<a href="https://colab.research.google.com/github/mitramir55/REConference/blob/main/1_data_cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 🧹 A complete guide on cleaning textual data


In this tutorial, we’ll take a look at the general and most important steps of text cleaning before giving it to a machine learning or deep learning model.

Building deep learning or machine learning models takes weeks, days, or at least a few hours. To  improve the results, we utilize various packages, add layers, and apply different techniques. But what if looking at the data itself and modifying it a little bit, could help us with both performance and the training time of our model!


But keep in mind that building a deep learning model with the ability to learn the semantic relationships between words requires as little cleaning as possible. Because these models are capable of understanding which parts of the text to focus on (pay attention to), to achieve their objective. However, still, even a little cleaning will play a big role, as it reduces memory usage by shrinking the vocabulary size and helps you identify more words by deleting unnecessary characters around them.

Here are a few steps we can take to improve and clean our text.


* Substituting Emojis
* Removing Stopwords
* Removing Punctuations
* Lower casing
* Lemmatization
* Stemming
* Tokenization
* Additional Resources
* Conclusion
* References


![](https://cdn-images-1.medium.com/max/800/1*B09y6eYoPTbHspcEoQUTNQ.png)

In [None]:
import pandas as pd
import numpy as np

In [None]:
# NLTK packages for text cleaning
import nltk
nltk.download('omw-1.4')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In this tutorial, we'll be using regular expressions alot. A regular expression is a string of letters that defines a text search pattern. So if you needed any help or had any question about any expression, simply go to https://regexr.com/ and explore different patterns and see what each one matches to. 

In [None]:
import regex as re


### Substituting emojis and emoticons
In cleaning, you might prefer to remove all the punctuations at first and therefore the all the emoticons that are made from them, like :), :( and :|. But by doing this you’re actually removing parts of the meaning. The better way of handling punctuations is to first try to substitute these parts and then delete the remaining.



In [None]:
def remove_emojis(text):
    
    # Happy 
    text = re.sub(":D", 'grin',text)
    text = re.sub(" (x|X)D", 'laugh',text)
    text = re.sub(":\)+", 'happy',text)

    # Sad 
    text = re.sub(":\(+", 'sad',text)
    text = re.sub("-_+-", 'annoyed',text)

    return text

In [None]:
# example sentence

text = 'This is so creepy! :D'
remove_emojis(text)

'This is so creepy! grin'

### Tokenization

Before cleaning, we can tokenize the data and create a list of all the words in our records. This way, instead of looking at the whole text, we look at individual words. The benefit of this approach is a higher accuracy while choosing which words we want to remove.

Look at the following example for understanding the concept better:

    

In [None]:
# download punkt which helps in tokenizing text
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
text = "Hello! I'm Mary, and I'm reporting an issue"
word_tokenize(text)

['Hello',
 '!',
 'I',
 "'m",
 'Mary',
 ',',
 'and',
 'I',
 "'m",
 'reporting',
 'an',
 'issue']

Now, to clean text and remove stopwords, for instance, we remove both "I" and "I'm". While if we were to not tokenize the text, we would have ended up with "Im" after the removal of punctuations, and not identify this piece of text as a stop word.

### Removing Stopwords
We all know how frequently words like ‘is’, ‘are’, ‘am’, ‘he’, ‘she’, are used. These words are called stopwords, and they’re so commonly used that appear in all sorts and types of sentences. They don’t add any specific information to a sentence that may change the meaning completely. Google often ignores them [2] when indexing entries for searching and when retrieving them as the result of a search query.



There are different libraries like[ nltk ](https://www.nltk.org/search.html?q=stopwords&check_keywords=yes&area=default)and [spacy](https://github.com/explosion/spaCy/blob/master/spacy/lang/en/stop_words.py) with different sets and number of stop words, so depending on how much and what stopwords you want to remove, you can choose one (NLTK has around 180, but Spacy has around 360 stop words).

Read more: [link](https://www.geeksforgeeks.org/removing-stop-words-nltk-python/)

In [None]:
from nltk.corpus import stopwords

# download nktj stopwords
nltk.download('stopwords')

# you can either use default stopwords or customize it
stop_words = stopwords.words('english')
#stop_words = ["the", "and", "not"]

print(len(stop_words))
print(stop_words)

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# example
text = 'This is not accurate!'

text = [word for word in word_tokenize(text) if not word.lower() in stop_words]
text = ' '.join(text)
print(text)

accurate !


### Removing Punctuations
Punctuations can have a big impact on the emotion expressed in the writing of the text. But sometimes, we don't care about the feelings in a database and want to create more clarity by removing these repetitive pieces of text that don't impart any further knowledge. 

 In the following cell, we use regex to find any of the punctuations in the brackets and substitute them with blank space.

In [None]:
def remove_punct(text):
    return re.sub("[()!><.,`?':\-\[\]_@]", '', text)

In [None]:
# example
remove_punct('this is... crazy!!!')

'this is crazy'

### Lower casing
Usually, lower casing can hugely reduce the size of the vocabulary. It will substitute all the capitalized letters with their small form like, “Another”, “There”, will become “another”, “There”. But pay close attention that at the same time it robs some words like “Bush”, “Bill”, “Apple” form their accurate representation and meaning by turning them into “bush”, “bill”, “apple”. You can simply lowercase your words with .lower()

In [None]:
# example
text = 'Apple represents itself in New York.'
text.lower()

'apple represents itself in new york.'

### Lemmatization or stemming?🤔

Purposes of lemmatization and stemming is the similar. They both want to relate different forms of verbs, nouns, in general words, to their base form, but they do this in different ways.


 Stemming is the process of chopping off the end of words in the hope of getting a simple and correct shape of the words. But lemmatization is the process of doing this properly with the use of a dictionary. So if we give “studies” to a stemmer, it will return “studi”, but if we give it to a lemmatizer, it will output “study”. Both of these functions tend to reduce your vocabulary size and variety in your text. So be careful about the tradeoff between the performance of model and the information that remains.

 ### Lemmatization with NLTK:

 NLTK lemmatizer uses WordNet. WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets) which are interlinked by means of conceptual-semantic and lexical relations. Read more about this database [here](https://wordnet.princeton.edu/).

In [None]:
from nltk.stem import WordNetLemmatizer 

# download wordnet for lemmatization
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
# example
print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 
  
# a denotes adjective in "pos" 
print("better :", lemmatizer.lemmatize("better", pos ="a"))


rocks : rock
corpora : corpus
better : good


### Stemming with NLTK
NLTK uses [PorterStemmer](https://www.nltk.org/_modules/nltk/stem/porter.html) to stem words. Porter stemming algorithm is capable of removing endings of words in text normalization. Read more about this algorithm [here](https://tartarus.org/martin/PorterStemmer/).

In [None]:
from nltk.stem import PorterStemmer 
ps = PorterStemmer()
words = ["programs", "programer", "programing", "programers", "studies", "cries"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 

programs  :  program
programer  :  program
programing  :  program
programers  :  program
studies  :  studi
cries  :  cri


### Some additional steps

And for the last step, you go through your dataset and check which words were not recognized by your algorithm and then try to find out ways that can reduce those words. You may even consider manually correcting some words like “Goooaaaal” or “Snaaap”.


In [None]:
def additional_cleaning(text):

    # list of some text we usually have to scrape in scraped data
    character_entity_references_dict = {"&gt;": ">", "&lt;":"<", "&amp;": "&"}
    for pattern, replacement in character_entity_references_dict.items():
        text = re.sub(pattern, replacement, text)

    # removing links: search for http and continue removing until you hit a space
    text = re.sub(r"\S*https?:\S*", "", text)

    # When you only want to keep words and certain characters
    text = re.sub(r'[^ \w\.\-\(\)\,]', ' ', text)

    # removes all single letters (typos) surrounded by space except the letters I and a
    text = re.sub(r' +(?![ia])[a-z] +', ' ', text)

    # removes all hashtags and the text right after them #peace
    text = re.sub(r'[@#]\w*\_*' , '', text)

    # substitute extra space with only one space
    text = re.sub(r' \s+', ' ', text)

    return text

In [None]:
text = 'You can    look at my website https://regexr.com/ to learn more about     this topic! #cool !  c    '

additional_cleaning(text)

'You can look at my website to learn more about this topic cool '

#### Further Parsing of words with SpaCy

SpaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. Here, we'll use this library to parse sentences further.

Description of various SpaCy models: [link](https://spacy.io/models/en)

In [None]:
import spacy

# download the smallest language model available on SpaCy
nlp = spacy.load("en_core_web_sm")

If you needed to analyze each token of a sentence, i.g., print out the `DEP` (dependency), `POS` (coarse-grained part of speech tags), `TAG` (fine-grained part of speech tags), `LEMMA` (canonical form) of a word, you can use the following tags to see the features extracted by SpaCy language model.

To know more about each of these topics go to the links listed:
* Dependency parsing: [Stanford typed dependencies manual](https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf)

* POS tagging: [Spacy POS tagger](https://spacy.io/usage/linguistic-features#pos-tagging)

* Labels definition: [Spacy Glossary](https://github.com/explosion/spaCy/blob/master/spacy/glossary.py) and [CoNLL-U Format](https://spacy.io/models/en)

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
words_features = []

for token in doc:
    word_dict = {}
    word_dict['word'] = token.text
    word_dict['lemma_'] = token.lemma_
    word_dict['pos_'] = token.pos_
    word_dict['tag_'] = token.tag_
    word_dict['dep_'] = token.dep_
    word_dict['shape_'] = token.shape_
    word_dict['is_alpha'] = token.is_alpha
    word_dict['is_stop'] = token.is_stop

    words_features.append(word_dict)

pd.DataFrame(words_features)

Unnamed: 0,word,lemma_,pos_,tag_,dep_,shape_,is_alpha,is_stop
0,Apple,Apple,PROPN,NNP,nsubj,Xxxxx,True,False
1,is,be,AUX,VBZ,aux,xx,True,True
2,looking,look,VERB,VBG,ROOT,xxxx,True,False
3,at,at,ADP,IN,prep,xx,True,True
4,buying,buy,VERB,VBG,pcomp,xxxx,True,False
5,U.K.,U.K.,PROPN,NNP,compound,X.X.,False,False
6,startup,startup,NOUN,NN,dobj,xxxx,True,False
7,for,for,ADP,IN,prep,xxx,True,True
8,$,$,SYM,$,quantmod,$,False,False
9,1,1,NUM,CD,compound,d,False,False


In [None]:
from spacy import displacy

In [None]:
txt = "Bears are dreamt of in your fantasies."
doc = nlp(txt)

svg = displacy.render(doc, jupyter=True, style="dep")

#### NER

The goal of named-entity recognition is to identify and categorise named entities found in unstructured text into predefined groups, such as names of people, places, organisations, things, medical codes, amounts, numbers, dollar amounts, percentages, etc.

Read more: [link](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)

In [None]:
import spacy
from spacy import displacy

text = "Ellenore Smith started working on self-driving cars at Tesla in 2007."

doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)


### Conclusion
In this era of history, we see computers and machines help us in every aspect of our lives! In return we have to help them understand our language better and make the interaction easier for both us humans and machines. 

Cleaning is just one of the ways that bring about faster and more accurate models. But because it’s modifying the main text, we have to be careful to construct functions that remove as little as possible from the text and its essential parts.

Thanks for reading this article!

### References

[1] https://www.kaggle.com/code/mitramir5/simple-bert-with-video

[2] https://bloggingx.com/stop-words/#:~:text=Search%20engines%2C%20in%20both%20search,are%20ignored%20or%20filtered%20out.

Image credit: https://www.analyticsvidhya.com/blog/2020/11/text-cleaning-nltk-library/

https://www.kaggle.com/code/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert

https://www.kaggle.com/code/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert

https://regexr.com/

