# Text Cleaning and Pre-Processing

Clean text is human language rearranged into a format that machine models can understand. Text cleaning can be performed using simple Python code that eliminates stopwords, removes unicode words, and simplifies complex words to their root form. In here we will see some of those techniques.

## Importing neccessary modules

In [1]:
import re
import string
import unicodedata
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

## Character Filtering

We will try to filter out all the non-printable characters and extended characters.  You can find the list here: [NON PRINTABLE CHARACTERS](https://web.itu.edu.tr/sgunduz/courses/mikroisl/ascii.html)

In [2]:
# string.printable will give the all sets of punctuation, digits, ascii_letters and whitespace.
string.printable

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [3]:
# Return string with all non-alphanumerics backslashed; this is useful if you want to match an
# arbitrary literal string that may have regular expression metacharacters in it.
re.escape(string.printable)

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~\\ \\\t\\\n\\\r\\\x0b\\\x0c'

In [4]:
pattern = re.compile('[^%s]' %re.escape(string.printable))

In [5]:
extended_chars = "#@ € Hi ‡ x ™ is ® z <|> §character ¥"
extended_chars = extended_chars.split()
extended_chars

['#@', '€', 'Hi', '‡', 'x', '™', 'is', '®', 'z', '<|>', '§character', '¥']

In [6]:
printable = [pattern.sub('', word) for word in extended_chars]
printable

['#@', '', 'Hi', '', 'x', '', 'is', '', 'z', '<|>', 'character', '']

In [7]:
" ".join(printable)

'#@  Hi  x  is  z <|> character '

In [8]:
from unicodedata import normalize

In [9]:
fr = "Ratification et mise en œuvre des conventions de l'OIT mises à jour (vote)"

In [10]:
fr = normalize('NFD', fr).encode('ascii', 'ignore')
fr = fr.decode('UTF-8')
fr

"Ratification et mise en uvre des conventions de l'OIT mises a jour (vote)"

## Making use of string.maketrans() to remove punctuations

Python string method maketrans() returns a translation table that maps each character in the intabstring into the character at the same position in the outtab string. Then this table is passed to the translate() function.

In [11]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [12]:
text = "<Hello <*^_^*> Sam!> "
table = str.maketrans("S", "P", string.punctuation) # first two arguments are mapping and the third is to remove
print(text.translate(table))

Hello  Pam 


## Convert Unicode Characters to ASCII String in Python

The Python module unicodedata provides a way to utilize the database of characters in Unicode and utility functions that help the accessing, filtering, and lookup of these characters significantly easier.

unicodedata has a function called normalize() that accepts two parameters, the normalized form of the Unicode string and the given string.
There are 4 types of normalized Unicode forms: NFC, NFKC, NFD, and NFKD. For more information [Official documentation](https://unicode.org/faq/normalization.html#4b).

NFD - Normalisation Form Canonical Decomposition  
NFC - Normalisation Form Canonical Composition  
NFKD - Normalisation Form Compatibility Decomposition  
NFKC - Normalisation Form Compatibility Composition  

In [13]:
import unicodedata

stringVal = u'Här är ett exempel på en svensk mening att ge dig.'
stringVal = unicodedata.normalize('NFKD', stringVal).encode('ascii', 'ignore')
print(stringVal)

b'Har ar ett exempel pa en svensk mening att ge dig.'


In [14]:
stringVal = stringVal.decode('UTF-8')
print(stringVal)

Har ar ett exempel pa en svensk mening att ge dig.


## Cleaning Text

**Note:**  It is not neccessary that we use every step in the function below. We should procced in a manner keeping in mind the task for which we are cleaning the text. Example: Sentiment classification, language translation, etc.


In [15]:
def cleanText(dataFrame):
    
    cleanedText = []
    lines = dataFrame["Review_text"].values.tolist()
    
    for text in lines:
        # Converting text to lower case
        text = text.lower()
        
        # Removing hyperlinks
        pattern = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
        text = pattern.sub("", text)
        
        # Removing Emojis
        emoji = re.compile("["
                           u"\U0001F600-\U0001FFFF"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
        
        text = emoji.sub(r'', text)
        
        # Normalizing unicode characters
        text = unicodedata.normalize('NFD', text).encode('ascii', 'ignore')
        text = text.decode('UTF-8')
        
        # Replacing common abbrevated words
        text = re.sub(r"i'm", "i am", text)
        text = re.sub(r"he's", "he is", text)
        text = re.sub(r"she's", "she is", text)
        text = re.sub(r"that's", "that is", text)        
        text = re.sub(r"what's", "what is", text)
        text = re.sub(r"where's", "where is", text) 
        text = re.sub(r"\'ll", " will", text)  
        text = re.sub(r"\'ve", " have", text)  
        text = re.sub(r"\'re", " are", text)
        text = re.sub(r"\'d", " would", text)
        text = re.sub(r"\'ve", " have", text)
        text = re.sub(r"won't", "will not", text)
        text = re.sub(r"don't", "do not", text)
        text = re.sub(r"did't", "did not", text)
        text = re.sub(r"can't", "can not", text)
        text = re.sub(r"it's", "it is", text)
        text = re.sub(r"couldn't", "could not", text)
        text = re.sub(r"have't", "have not", text)
        
        # Removing punctuations
        text = re.sub(r"[,.\"!@#$%^&*(){}?/;`~:<>+=-]", "", text)
        
        # Tokenizing
        tokens = word_tokenize(text)
        
        # Removing puntuation
        table = str.maketrans('', '', string.punctuation)
        stripped = [w.translate(table) for w in tokens]
        
        # Removing alpha numeric
        words = [word for word in stripped if word.isalpha()]
        
        # Removing non-printable characters
        pattern = re.compile('[^%s]' % re.escape(string.printable))
        words = [pattern.sub('', word) for word in words]
        
#         # Removing Stop Words
#         stop_words = set(stopwords.words("english"))
#         stop_words.discard("not")
#         words = [word for word in words if not word in stop_words]
        
#         # Stemming words
#         stemmer = PorterStemmer()
#         words = [stemmer.stem(word) for word in words]
        
#         # Lemmatization
#         lemmatizer = WordNetLemmatizer()
#         words = [lemmatizer.lemmatize(word) for word in words]
        
        text = " ".join(words)
        cleanedText.append(text)
    return cleanedText

## Spell Check


In [16]:
from textblob import TextBlob

In [17]:
b = TextBlob("I havv goood speling!")
print(b.correct())

I have good spelling!


In [18]:
b = TextBlob("He woud havv reachd home by noww")
print(b.correct())

He would have reached home by now


**Additional resources:**

**NLTK**  
https://www.nltk.org/index.html  
http://www.nltk.org/book/  

**Spacy**  
https://spacy.io/usage/spacy-101  