# **Preprocessing** 
   - 1 Lower casing
   - 2 Punctuation removal
   - 3 Stopwords removal
   - 4 Frequent words removal
   - 5 Rare words removal
   - 6 Spelling correction
   - 7 Tokenization
   - 8 Stemming
   - 9 Lemmatization


### **Load necessary libraries**

In [1]:
import numpy as np
import pandas as pd

### **Read dataset**

In [2]:
train = pd.read_csv("..\\..\\..\\data\\twitter_hate-speech\\train.csv")
test = pd.read_csv("..\\..\\..\\data\\twitter_hate-speech\\test.csv")

### **Preview dataset**

In [3]:
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [4]:
test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


## **1. Lower Casing**


- transform our tweets into lower case. 
- This avoids having multiple copies of the same words. 

In [8]:
def lower_case(df):
    df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
    print(df['tweet'].head())

In [9]:
lower_case(train)

0    @user when a father is dysfunctional and is so...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model i love u take with u all the time in ur...
4                  factsguide: society now #motivation
Name: tweet, dtype: object


In [10]:
lower_case(test)

0    #studiolife #aislife #requires #passion #dedic...
1    @user #white #supremacists want everyone to se...
2    safe ways to heal your #acne!! #altwaystoheal ...
3    is the hp and the cursed child book up for res...
4    3rd #bihday to my amazing, hilarious #nephew e...
Name: tweet, dtype: object


## **2. Punctuation Removal**


- The next step is to remove punctuation as it doesn’t add any extra information while treating text data. 

- Therefore removing all instances of it will help us reduce the size of the training data.

In [11]:
def punctuation_removal(df):
    df['tweet'] = df['tweet'].str.replace('[^\w\s]','')
    print(df['tweet'].head())

In [12]:
punctuation_removal(train)

0    @user when a father is dysfunctional and is so...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model i love u take with u all the time in ur...
4                  factsguide: society now #motivation
Name: tweet, dtype: object


In [13]:
punctuation_removal(test)

0    #studiolife #aislife #requires #passion #dedic...
1    @user #white #supremacists want everyone to se...
2    safe ways to heal your #acne!! #altwaystoheal ...
3    is the hp and the cursed child book up for res...
4    3rd #bihday to my amazing, hilarious #nephew e...
Name: tweet, dtype: object


## **3 Stop Words Removal**



-  stop words (or commonly occurring words) should be removed from the text data. 
- For this purpose, we can either create a list of stopwords ourselves or we can use predefined libraries.

In [14]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [15]:
def stop_words_removal(df):
    df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
    print(df['tweet'].head())

In [16]:
stop_words_removal(train)

0    @user father dysfunctional selfish drags kids ...
1    @user @user thanks #lyft credit can't use caus...
2                                       bihday majesty
3    #model love u take u time urð±!!! ððð...
4                      factsguide: society #motivation
Name: tweet, dtype: object


In [17]:
stop_words_removal(test)

0    #studiolife #aislife #requires #passion #dedic...
1    @user #white #supremacists want everyone see n...
2    safe ways heal #acne!! #altwaystoheal #healthy...
3    hp cursed child book reservations already? yes...
4    3rd #bihday amazing, hilarious #nephew eli ahm...
Name: tweet, dtype: object


## **4. Frequent Words Removal**

- We can also remove commonly occurring words from our text data.

- First, let’s check the 10 most frequently occurring words in our text data then take call to remove or retain.


In [18]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[:10]
freq

@user    17291
&amp;     1574
day       1454
#love     1449
happy     1328
-         1244
u         1116
love      1112
i'm        992
like       920
Name: count, dtype: int64

Now, we will remove these words as their presence will not of any use in classification of our text data.

In [19]:
freq = list(freq.index)

In [20]:
def frequent_words_removal(df):    
    df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
    print(df['tweet'].head())

In [21]:
frequent_words_removal(train)

0    father dysfunctional selfish drags kids dysfun...
1    thanks #lyft credit can't use cause offer whee...
2                                       bihday majesty
3    #model take time urð±!!! ðððð ð...
4                      factsguide: society #motivation
Name: tweet, dtype: object


In [22]:
frequent_words_removal(test)

0    #studiolife #aislife #requires #passion #dedic...
1    #white #supremacists want everyone see new â...
2    safe ways heal #acne!! #altwaystoheal #healthy...
3    hp cursed child book reservations already? yes...
4    3rd #bihday amazing, hilarious #nephew eli ahm...
Name: tweet, dtype: object


## **5. Rare Words Removal**

- Now, we will remove rarely occurring words from the text. 
- Because they’re so rare, the association between them and other words is dominated by noise. 
- We can replace rare words with a more general form and then this will have higher counts.

In [23]:
freq = pd.Series(' '.join(train['tweet']).split()).value_counts()[-10:]
freq

ukip        1
europ...    1
them,we     1
joke...     1
kylo,       1
prick       1
berry       1
ciroc       1
cents       1
chisolm.    1
Name: count, dtype: int64

In [24]:
freq = list(freq.index)

In [25]:
def rare_words_removal(df):
    df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
    print(df['tweet'].head())

In [26]:
rare_words_removal(train)

0    father dysfunctional selfish drags kids dysfun...
1    thanks #lyft credit can't use cause offer whee...
2                                       bihday majesty
3    #model take time urð±!!! ðððð ð...
4                      factsguide: society #motivation
Name: tweet, dtype: object


In [27]:
rare_words_removal(test)

0    #studiolife #aislife #requires #passion #dedic...
1    #white #supremacists want everyone see new â...
2    safe ways heal #acne!! #altwaystoheal #healthy...
3    hp cursed child book reservations already? yes...
4    3rd #bihday amazing, hilarious #nephew eli ahm...
Name: tweet, dtype: object


- All these pre-processing steps are essential and help us in reducing our vocabulary clutter so that the features produced in the end are more effective.

## **6 Spelling Correction** 

- Now tweets can be filled with plethora of spelling mistakes. Our task is to rectify these spelling mistakes.

- In that context, spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will be treated as different words even if they are used in the same sense.

- To accomplish the above task, we will use the textblob library as follows-


In [30]:
!pip install textblob




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [31]:
from textblob import TextBlob

In [32]:
def spell_correction(df):
    return df['tweet'][:5].apply(lambda x: str(TextBlob(x).correct()))

In [33]:
spell_correction(train)

0    father dysfunctional selfish drags kiss dysfun...
1    thanks #left credit can't use cause offer whee...
2                                       midday majesty
3    #model take time or±!!! ðððð ð¦...
4                      factsguide: society #motivation
Name: tweet, dtype: object

In [34]:
spell_correction(test)

0    #studiolife #dislike #requires #passion #educa...
1    #white #supremacists want everyone see new â...
2    safe ways heal #acne!! #altwaystoheal #healthy...
3    he cursed child book reservations already? yes...
4    rd #midday amazing, hilarious #nephew epi thei...
Name: tweet, dtype: object

## **7. Stemming**

- [Stemming](https://www.geeksforgeeks.org/introduction-to-stemming/) refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach. 

- So, stemming takes a word and refers it back to its base or root form. **Stems**, **Stemming**, **Stemmed** and **Stemtization** are all based on the single word **stem**.

- For this purpose, we will use *PorterStemmer* from the NLTK library.

In [73]:
from nltk.stem import PorterStemmer
st = PorterStemmer()

In [74]:
def stemming(df):
    return df['tweet'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

In [75]:
stemming(train)

0    father dysfunct selfish drag kid dysfunction. ...
1    thank #lyft credit can't use caus offer wheelc...
2                                       bihday majesti
3    #model take time urð±!!! ðððð ð...
4                           factsguide: societi #motiv
Name: tweet, dtype: object

In [76]:
stemming(test)

0    #studiolif #aislif #requir #passion #dedic #wi...
1    #white #supremacist want everyon see new â #...
2    safe way heal #acne!! #altwaystoh #healthi #he...
3    hp curs child book reserv already? yes, where?...
4    3rd #bihday amazing, hilari #nephew eli ahmir!...
Name: tweet, dtype: object

- We can see that *dysfunctional* has been transformed into *dysfunct*, among other changes.

## **8. Lemmatization**

- [Lemmatization](https://www.geeksforgeeks.org/python-lemmatization-with-nltk/) is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

- Lemmatization is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices. 

- Lemmatization makes use of the vocabulary and does a morphological analysis to obtain the root word. Therefore, lemmatization is prefered over stemming.

In [78]:
from textblob import Word

In [79]:
def lemmatization(df):
    df['tweet'] = df['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
    print(df['tweet'].head())

In [80]:
lemmatization(train)

0    father dysfunctional selfish drag kid dysfunct...
1    thanks #lyft credit can't use cause offer whee...
2                                       bihday majesty
3    #model take time urð±!!! ðððð ð...
4                      factsguide: society #motivation
Name: tweet, dtype: object


In [81]:
lemmatization(test)

0    #studiolife #aislife #requires #passion #dedic...
1    #white #supremacists want everyone see new â...
2    safe way heal #acne!! #altwaystoheal #healthy ...
3    hp cursed child book reservation already? yes,...
4    3rd #bihday amazing, hilarious #nephew eli ahm...
Name: tweet, dtype: object
