## 🧹 A complete guide on cleaning textual data


In this tutorial, we’ll take a look at the general and most important steps of text cleaning before giving it to a machine learning or deep learning model.

Building deep learning or machine learning models takes weeks, days, or at least a few hours. To  improve the results, we utilize various packages, add layers, and apply different techniques. But what if looking at the data itself and modifying it a little bit, could help us with both performance and the training time of our model?


If we look at textual data that hasn't been altered before, we'll see that people tend to use language as they please and without any special consideration of grammar or structure.  As a result, their words and sentences would be just a series of characters that can’t be properly distinguished and interpreted by our natural language processing algorithms and models. This makes our work a bit harder, meaning that we have to better clarify and prepare the data for our final algorithm.[1]


But fear not! As Tomas Mikolov, one of the authors of Word2vec famous text processing algorithms says, building a deep learning model with the ability to learn the semantic relationships between words requires as little cleaning as possible. Because these models are capable of understanding which parts of the text to focus on (pay attention to), to achieve their objective. However, still, even a little cleaning will play a big role, as it reduces memory usage by shrinking the vocabulary size and helps you identify more words by deleting unnecessary characters around them.

Here are a few steps we can take to improve and clean our text.


* Substituting Emojis
* Removing Stopwords
* Removing Punctuations
* Lower casing
* Lemmatization
* Stemming
* Additional Resources
* Conclusion
* References

In [1]:
import pandas as pd
import numpy as np

# For cleaning the text
import spacy
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import regex as re
import string


### Substituting emojis and emoticons
In cleaning, you might prefer to remove all the punctuations at first and therefore the all the emoticons that are made from them, like :), :( and :|. But by doing this you’re actually removing parts of the meaning. The better way of handling punctuations is to first try to substitute these parts and then delete the remaining.



In [2]:
def remove_emojis(text):
    
    # Happy 
    text = re.sub(":D", 'grin',text)
    text = re.sub(" (x|X)D", 'laugh',text)
    text = re.sub(":\)+", 'happy',text)

    # Sad 
    
    text = re.sub(":\(+", 'sad',text)
    text = re.sub("-_+-", 'annoyed',text)

    return text

In [3]:
# example sentence

text = 'This is so creepy! :D'
remove_emojis(text)

'This is so creepy! grin'

### Removing Stopwords
We all know how frequently words like ‘is’, ‘are’, ‘am’, ‘he’, ‘she’, are used. These words are called stopwords, and they’re so commonly used that appear in all sorts and types of sentences. They don’t have any specific information to add to a sentence that may change the meaning completely, so we simply ignore them while performing tasks like text classification. Google often ignores them [2] when indexing entries for searching and when retrieving them as the result of a search query.

There are different libraries like nltk and spacy with different sets and number of stop words, so depending on how much and what stopwords you want to remove, you can choose one.( NLTK has around 180, but Spacy has around 360 stop words)

In [4]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

print(len(stop_words))
print(stop_words)

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mitra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# example
text = 'This is not accurate!'

text = [word for word in word_tokenize(text) if not word.lower() in stop_words]
text = ' '.join(text)
print(text)

accurate !


### Removing Punctuations
Punctuations can have a big impact on the emotion expressed in the writing of the text. But sometimes, we don't care about the feelings in a database and want to create more clarity by removing these repetitive pieces of text that don't impart any further knowledge. 

 In the following cell, we use regex to find any of the punctuations in the brackets and substitute them with blank space.

In [9]:
def remove_punct(text):
    return re.sub("[()!><.,`?':\-\[\]_@]", '', text)

In [10]:
# example
remove_punct('this is... crazy!!!')

'this is crazy'

### Lower casing
Usually, lower casing can hugely reduce the size of the vocabulary. It will substitute all the capitalized letters with their small form like, “Another”, “There”, will become “another”, “There”. But pay close attention that at the same time it robs some words like “Bush”, “Bill”, “Apple” form their accurate representation and meaning by turning them into “bush”, “bill”, “apple”. You can simply lowercase your words with .lower()

In [11]:
# example
text = 'Apple represents itself in New York.'
text.lower()

'apple represents itself in new york.'

### Lemmatization or stemming?

Purposes of lemmatization and stemming is the similar. They both want to relate different forms of verbs, nouns, in general words, to their base form, but they do this in different ways.


 Stemming is the process of chopping off the end of words in the hope of getting a simple and correct shape of the words. But lemmatization is the process of doing this properly with the use of a dictionary. So if we give “studies” to a stemmer, it will return “studi”, but if we give it to a lemmatizer, it will output “study”. Both of these functions tend to reduce your vocabulary size and variety in your text. So be careful about the tradeoff between the performance of model and the information that remains.

 ### Lemmatization with NLTK:

In [12]:

from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer() 
  

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mitra\AppData\Roaming\nltk_data...


In [13]:
# example
print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 
  
# a denotes adjective in "pos" 
print("better :", lemmatizer.lemmatize("better", pos ="a"))


rocks : rock
corpora : corpus
better : good


### Stemming with NLTK


In [14]:
from nltk.stem import PorterStemmer 
ps = PorterStemmer()
words = ["programs", "programer", "programing", "programers", "studies", "cries"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 

programs  :  program
programer  :  program
programing  :  program
programers  :  program
studies  :  studi
cries  :  cri


### Some additional steps

And for the last step, you go through your dataset and check which words were not recognized by your algorithm and then try to find out ways that can reduce those words. You may even consider manually correcting some words like “Goooaaaal” or “Snaaap”.


In [16]:
def additional_cleaning(text):

    # list of some text we usually have to scrape in scraped data
    character_entity_references_dict = {"&gt;": ">", "&lt;":"<", "&amp;": "&"}
    for pattern, replacement in character_entity_references_dict.items():
        text = re.sub(pattern, replacement, text)

    # removing links: search for http and continue removing until you hit a space
    text = re.sub(r"\S*https?:\S*", "", text)

    # When you only want to keep words and certain characters
    text = re.sub(r'[^ \w\.\-\(\)\,]', ' ', text)

    # removes all single letters (typos) surrounded by space except the letters I and a
    text = re.sub(r' +(?![ia])[a-z] +', ' ', text)

    # removes all hashtags and the text right after them #peace
    text = re.sub(r'[@#]\w*\_*' , '', text)

    # substitute extra space with only one space
    text = re.sub(r' \s+', ' ', text)

    return text

In [26]:
text = 'You can    look at my website https://regexr.com/ to learn more about     this topic! #cool !  c    '

additional_cleaning(text)

'You can look at my website to learn more about this topic cool '


### Conclusion
In this era of history, we see computers and machines help us in every aspect of our lives! In return we have to help them understand our language better and make the interaction easier for both us humans and machines. 

Cleaning is just one of the ways that bring about faster and more accurate models. But because it’s modifying the main text, we have to be careful to construct functions that remove as little as possible from the text and its essential parts.

Thanks for reading this article!

### References

[1] https://www.kaggle.com/code/mitramir5/simple-bert-with-video

[2] https://bloggingx.com/stop-words/#:~:text=Search%20engines%2C%20in%20both%20search,are%20ignored%20or%20filtered%20out.

https://www.kaggle.com/code/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert

https://www.kaggle.com/code/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert

https://regexr.com/

