## Text Preprocessing

For an unstructured data like text, preprocessing is one of the most important phase. Some of the common text preprocessing / cleaning steps are

- lower casing
- removal of punctuations
- removal of stopwords
- remocal of frequent words
- Removal of Rare words
- Stemming
- Lemmatization
- Removal of emojis
- Removal of emoticons
- Conversion of emoticons to words
- Conversion of emojis to words
- Removal of URLs
- Removal of HTML tags
- Chat words conversion
- Spelling correction

We need to carefully choose our preprocessing steps based on our use case. For example: in sentimental analysis, we need not remove the emojis or emoticons as it will convey information about the sentiment.

In [1]:
import numpy as np
import pandas as pd
import re 
import nltk 
import spacy
import string 
# suppress the warning for setting with copy i.e to modiy dataframe slice
pd.options.mode.chained_assignment = None 

In [2]:
full_df = pd.read_csv("./twcs.csv")
full_df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


In [3]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2811774 entries, 0 to 2811773
Data columns (total 7 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   tweet_id                 int64  
 1   author_id                object 
 2   inbound                  bool   
 3   created_at               object 
 4   text                     object 
 5   response_tweet_id        object 
 6   in_response_to_tweet_id  float64
dtypes: bool(1), float64(1), int64(1), object(4)
memory usage: 131.4+ MB


In [4]:
df = full_df[["text"]]
df["text"] = df["text"].astype(str)
df.head()

Unnamed: 0,text
0,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...
3,@115712 Please send us a Private Message so th...
4,@sprintcare I did.


### Lower Casing
- The idea is to convert the input text into same casing format so that 'text', 'Text' and 'TEXT' are treated the same way.

- This may not be helpful when we do tasks like Part of Speech tagging(where proper casing gives some information about Nouns and so on) and sentimental analysis(where upper casing refers to anger and so on)

- By default, lower casing is done by most of the modern day vectorizers and tokenizers like sklearn tfidfVectorizer and keras tokenizer. So we need to set them to false as needed depending on our use case.

In [5]:
df["text_lower"] = df["text"].str.lower()
df.head()

Unnamed: 0,text,text_lower
0,@115712 I understand. I would like to assist y...,@115712 i understand. i would like to assist y...
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...,@sprintcare i have sent several private messag...
3,@115712 Please send us a Private Message so th...,@115712 please send us a private message so th...
4,@sprintcare I did.,@sprintcare i did.


### Removal of punctuations
- this is a text standardization process that will help to treat 'hurray' and 'hurray!' in the same way.
- We can exclude the removal of punctuation as per our need.


In [6]:
type(string.punctuation)

str

In [7]:
df.drop(["text_lower"], axis = 1, inplace = True)

def remove_punctuation(sentence):
    translator = str.maketrans('', '', string.punctuation)
    return sentence.translate(translator)


df["text_no_punct"] = df["text"].apply(lambda text: remove_punctuation(text))

In [8]:
df.head()

Unnamed: 0,text,text_no_punct
0,@115712 I understand. I would like to assist y...,115712 I understand I would like to assist you...
1,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...,sprintcare I have sent several private message...
3,@115712 Please send us a Private Message so th...,115712 Please send us a Private Message so tha...
4,@sprintcare I did.,sprintcare I did


### Removal of stopwords

In [12]:
from nltk.corpus import stopwords
stopwords_english = " ".join(stopwords.words('english'))
stopwords_english

"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't"

In [13]:
"my" in stopwords_english

True

In [14]:
def remove_stopwords(text):
    words = [word for word in str(text).lower().split() if word not in stopwords_english]
    return " ".join(words)

In [15]:
text = "@115712 I understand. I would like to assist you"
remove_stopwords(text)

'@115712 understand. like assist'

In [16]:
df["text_no_stopwords"] = df["text_no_punct"].apply(remove_stopwords)
df.head()

Unnamed: 0,text,text_no_punct,text_no_stopwords
0,@115712 I understand. I would like to assist y...,115712 I understand I would like to assist you...,115712 understand like assist get private secu...
1,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose
2,@sprintcare I have sent several private messag...,sprintcare I have sent several private message...,sprintcare sent several private messages one r...
3,@115712 Please send us a Private Message so th...,115712 Please send us a Private Message so tha...,115712 please send private message assist clic...
4,@sprintcare I did.,sprintcare I did,sprintcare


### Removal of frequent words
- If we have domain specific corpus, we might also have some frequent words which are of not so much importance to us

- This step removes the frequent words in the given corpus. 

- If we use tfidf, this is automatically taken care of

In [17]:
from collections import Counter
cnt = Counter()

for text in df["text_no_stopwords"].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(10)

[('please', 402709),
 ('dm', 335374),
 ('help', 267633),
 ('thanks', 206452),
 ('get', 200374),
 ('sorry', 192246),
 ('like', 146385),
 ('know', 145407),
 ('look', 139618),
 ('send', 138907)]

In [18]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])

def remove_frequent_words(text):
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_no_freq_words"] = df["text_no_stopwords"].apply(remove_frequent_words)
df.head()

Unnamed: 0,text,text_no_punct,text_no_stopwords,text_no_freq_words
0,@115712 I understand. I would like to assist y...,115712 I understand I would like to assist you...,115712 understand like assist get private secu...,115712 understand assist private secured link ...
1,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcare propose
2,@sprintcare I have sent several private messag...,sprintcare I have sent several private message...,sprintcare sent several private messages one r...,sprintcare sent several private messages one r...
3,@115712 Please send us a Private Message so th...,115712 Please send us a Private Message so tha...,115712 please send private message assist clic...,115712 private message assist click ‘message’ ...
4,@sprintcare I did.,sprintcare I did,sprintcare,sprintcare


### Removal of Rare words
- Similar to the frequent words but in this rare words are removed.

In [19]:
df.drop(["text_no_punct", "text_no_stopwords"], axis = 1, inplace=True)

In [20]:
n_rare_words = 10
RARE_WORDS = [w for (w, wc) in cnt.most_common()[-10:]]
RARE_WORDS

['httpstco4dhaxwnqb4',
 '823867',
 '823868',
 'httpstco4v1ft0th5x',
 '823869',
 'httpstcov2tmhetl7q',
 'httpstcogfyuq1kjtk',
 '823870',
 'httpstco7uqpwyh1b6',
 'notjustxmasallyearround']

In [21]:
def remove_rarewords(text):
    return " ".join([word for word in str(text).split() if word not in RARE_WORDS])

In [22]:
df.head()

Unnamed: 0,text,text_no_freq_words
0,@115712 I understand. I would like to assist y...,115712 understand assist private secured link ...
1,@sprintcare and how do you propose we do that,sprintcare propose
2,@sprintcare I have sent several private messag...,sprintcare sent several private messages one r...
3,@115712 Please send us a Private Message so th...,115712 private message assist click ‘message’ ...
4,@sprintcare I did.,sprintcare


In [23]:
df["text_no_rare_words"] = df["text_no_freq_words"].apply(remove_rarewords)

In [24]:
df.head()

Unnamed: 0,text,text_no_freq_words,text_no_rare_words
0,@115712 I understand. I would like to assist y...,115712 understand assist private secured link ...,115712 understand assist private secured link ...
1,@sprintcare and how do you propose we do that,sprintcare propose,sprintcare propose
2,@sprintcare I have sent several private messag...,sprintcare sent several private messages one r...,sprintcare sent several private messages one r...
3,@115712 Please send us a Private Message so th...,115712 private message assist click ‘message’ ...,115712 private message assist click ‘message’ ...
4,@sprintcare I did.,sprintcare,sprintcare


**Combine all the stopwords, frequent words and rare words cancreate a single list to remove all of them at once.**

### Stemming
- Stemming is the process of reducing the inflected words to their word stem, base or root form.

- Porter stemmer is for english language.
- For other languages, we can use snowball stemmer.

In [26]:
from tqdm import tqdm
tqdm.pandas()

from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df["text_stemmed"] = df["text"].progress_apply(stem_words)
df.head()

100%|██████████| 2811774/2811774 [07:38<00:00, 6136.84it/s]


Unnamed: 0,text,text_no_freq_words,text_no_rare_words,text_stemmed
0,@115712 I understand. I would like to assist y...,115712 understand assist private secured link ...,115712 understand assist private secured link ...,@115712 i understand. i would like to assist y...
1,@sprintcare and how do you propose we do that,sprintcare propose,sprintcare propose,@sprintcar and how do you propos we do that
2,@sprintcare I have sent several private messag...,sprintcare sent several private messages one r...,sprintcare sent several private messages one r...,@sprintcar i have sent sever privat messag and...
3,@115712 Please send us a Private Message so th...,115712 private message assist click ‘message’ ...,115712 private message assist click ‘message’ ...,@115712 pleas send us a privat messag so that ...
4,@sprintcare I did.,sprintcare,sprintcare,@sprintcar i did.


In [27]:
# supported language for snowball stemmer
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

### Lemmatization
- Lemmatization process depends on the POS tag to come up with the correct lemma.

In [30]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [31]:
def lemmatize_words(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df["text_lemmatized"] = df["text"].progress_apply(lambda text: lemmatize_words(text))
df.head()

100%|██████████| 2811774/2811774 [01:40<00:00, 27988.76it/s]


Unnamed: 0,text,text_no_freq_words,text_no_rare_words,text_stemmed,text_lemmatized
0,@115712 I understand. I would like to assist y...,115712 understand assist private secured link ...,115712 understand assist private secured link ...,@115712 i understand. i would like to assist y...,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that,sprintcare propose,sprintcare propose,@sprintcar and how do you propos we do that,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...,sprintcare sent several private messages one r...,sprintcare sent several private messages one r...,@sprintcar i have sent sever privat messag and...,@sprintcare I have sent several private messag...
3,@115712 Please send us a Private Message so th...,115712 private message assist click ‘message’ ...,115712 private message assist click ‘message’ ...,@115712 pleas send us a privat messag so that ...,@115712 Please send u a Private Message so tha...
4,@sprintcare I did.,sprintcare,sprintcare,@sprintcar i did.,@sprintcare I did.


In [32]:
print("Word is : stripes")
print("Lemma result for verb : ",lemmatizer.lemmatize("stripes", 'v'))
print("Lemma result for noun : ",lemmatizer.lemmatize("stripes", 'n'))

Word is : stripes
Lemma result for verb :  strip
Lemma result for noun :  stripe


In [35]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}

def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

df["text_lemmatized"] = df["text"].progress_apply(lambda text: lemmatize_words(text))
df.head()

100%|██████████| 2811774/2811774 [26:31<00:00, 1766.44it/s]


Unnamed: 0,text,text_no_freq_words,text_no_rare_words,text_stemmed,text_lemmatized
0,@115712 I understand. I would like to assist y...,115712 understand assist private secured link ...,115712 understand assist private secured link ...,@115712 i understand. i would like to assist y...,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that,sprintcare propose,sprintcare propose,@sprintcar and how do you propos we do that,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...,sprintcare sent several private messages one r...,sprintcare sent several private messages one r...,@sprintcar i have sent sever privat messag and...,@sprintcare I have send several private messag...
3,@115712 Please send us a Private Message so th...,115712 private message assist click ‘message’ ...,115712 private message assist click ‘message’ ...,@115712 pleas send us a privat messag so that ...,@115712 Please send u a Private Message so tha...
4,@sprintcare I did.,sprintcare,sprintcare,@sprintcar i did.,@sprintcare I did.


### Removal of emojis
- Some text analysis might need the removal of emojis

In [36]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  
                           u"\U0001F1E0-\U0001F1FF"  
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

remove_emoji("game is on 🔥🔥")

'game is on '

In [37]:
remove_emoji("Hilarious😂")

'Hilarious'

### Removal of Emoticons
:-) is an emoticon

😀 is an emoji

In [38]:
from emoji_emoticons import EMOTICONS

In [39]:
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

remove_emoticons("Hello :-)")

'Hello '

In [40]:
remove_emoticons("I am sad :(")


'I am sad '

### Conversion of emoticons to words
- In case of sentimental analysis, the emoticons give valuable information and so removing them is not a good solution.
- Hence, we can covert the emoticons to the word format so that they can be used in the downstream modeling processes.

In [41]:
from emoji_emoticons import EMOTICONS

In [42]:
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text

text = "Hello :-) :-)"
convert_emoticons(text)

'Hello Happy_face_smiley Happy_face_smiley'

In [43]:
text = "I am sad :()"
convert_emoticons(text)

'I am sad Frown_sad_andry_or_poutingConfusion'

### Conversion of Emoji to Words
-  We are going to make use of this dictionary to convert the emojis to corresponding words.

In [44]:
from emoji_emoticons import UNICODE_EMO

In [45]:
def convert_emojis(text):
    for emot in UNICODE_EMO:
        text = re.sub(r'('+emot+')', "_".join(UNICODE_EMO[emot].replace(",","").replace(":","").split()), text)
    return text

text = "game is on 🔥"
convert_emojis(text)

'game is on fire'

In [46]:
text = "Hilarious 😂"
convert_emojis(text)

'Hilarious face_with_tears_of_joy'

### Removal of URLS
- If we are doing twitter analysis, then this is a good chance that the tweet will have some URL in it.

In [47]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [48]:
text = "Driverless AI NLP blog post on https://www.h2o.ai/blog/detecting-sarcasm-is-difficult-but-ai-may-have-an-answer/"
remove_urls(text)

'Driverless AI NLP blog post on '

In [49]:
text = "Please refer to link http://lnkd.in/ecnt5yC for the paper"
remove_urls(text)

'Please refer to link  for the paper'

In [50]:
text = "Want to know more. Checkout www.h2o.ai for additional information"
remove_urls(text)

'Want to know more. Checkout  for additional information'

### Removal of HTML Tags
- When we scrap data from websites, we might end up having html strings as part of our text

In [51]:
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>"""

print(remove_html(text))


 H2O
 AutoML
 Driverless AI



In [1]:
from bs4 import BeautifulSoup

def remove_html(text):
    return BeautifulSoup(text, "lxml").text

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>
"""

print(remove_html(text))


 H2O
 AutoML
 Driverless AI




### Chat words conversion
- people use a lot of abbreviated wors in chat and so it might be helpful to expand those words for our analysis purposes

In [2]:
from chat_word import chat_words_str

In [3]:
chat_words_map_dict = {}
chat_words_list = []
for line in chat_words_str.split("\n"):
    if line != "":
        cw = line.split("=")[0]
        cw_expanded = line.split("=")[1]
        chat_words_list.append(cw)
        chat_words_map_dict[cw] = cw_expanded
chat_words_list = set(chat_words_list)

def chat_words_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words_list:
            new_text.append(chat_words_map_dict[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

chat_words_conversion("one minute BRB")

'one minute Be Right Back'

### Spelling Correction
- Typos are common in text data and we might want to correct those spelling errors before we do our analysis

In [4]:
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)
        
text = "speling correctin apple"
correct_spellings(text)

'spelling correcting apple'

In [5]:
misspelled_words = spell.unknown(text.split())
misspelled_words

{'correctin', 'speling'}