<h3>NLP Tutorial: Text Classification Using FastText</h3>

In [1]:
import pandas as pd

df= pd.read_csv("TwitterHate.csv")
print(df.shape)
df.head(3)

(31962, 3)


Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty


**Drop NA values**

In [2]:
df.dropna(inplace=True)
df.shape

(31962, 3)

In [3]:
df.label.unique()

array([0, 1], dtype=int64)

In [4]:
df['label_map'] = df['label'].map({0:"none_hate",1:"hate"})

In [5]:
df.head()

Unnamed: 0,id,label,tweet,label_map
0,1,0,@user when a father is dysfunctional and is s...,none_hate
1,2,0,@user @user thanks for #lyft credit i can't us...,none_hate
2,3,0,bihday your majesty,none_hate
3,4,0,#model i love u take with u all the time in ...,none_hate
4,5,0,factsguide: society now #motivation,none_hate


In [6]:
df.label_map.unique()

array(['none_hate', 'hate'], dtype=object)

In [7]:
df.label_map.value_counts()

none_hate    29720
hate          2242
Name: label_map, dtype: int64

When you train a fasttext model, it expects labels to be specified with __label__ prefix. We will just create a third column in the dataframe that has __label__ as well as the product description

In [8]:
df['_label_map'] = '__label__' + df['label_map'].astype(str)
df.head(5)

Unnamed: 0,id,label,tweet,label_map,_label_map
0,1,0,@user when a father is dysfunctional and is s...,none_hate,__label__none_hate
1,2,0,@user @user thanks for #lyft credit i can't us...,none_hate,__label__none_hate
2,3,0,bihday your majesty,none_hate,__label__none_hate
3,4,0,#model i love u take with u all the time in ...,none_hate,__label__none_hate
4,5,0,factsguide: society now #motivation,none_hate,__label__none_hate


In [9]:
df['Label_Tweet'] = df['_label_map'] + ' ' + df['tweet']
df.head(3)

Unnamed: 0,id,label,tweet,label_map,_label_map,Label_Tweet
0,1,0,@user when a father is dysfunctional and is s...,none_hate,__label__none_hate,__label__none_hate @user when a father is dys...
1,2,0,@user @user thanks for #lyft credit i can't us...,none_hate,__label__none_hate,__label__none_hate @user @user thanks for #lyf...
2,3,0,bihday your majesty,none_hate,__label__none_hate,__label__none_hate bihday your majesty


**Pre-procesing**
1. Remove punctuation
2. Remove extra space
3. Make the entire sentence lower case

In [10]:
df.tweet[1]

"@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked"

In [11]:
import re

text ="@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked"

text = re.sub(r'[^\w\s\']',' ', text)
text = re.sub(' +', ' ', text)
text.strip().lower()

"user user thanks for lyft credit i can't use cause they don't offer wheelchair vans in pdx disapointed getthanked"

In [12]:
def preprocess(text):
    
    text = re.sub(r'[^\w\s\']',' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip().lower() 

In [13]:
df['Label_Tweet'] = df['Label_Tweet'].map(preprocess)
df.head()

Unnamed: 0,id,label,tweet,label_map,_label_map,Label_Tweet
0,1,0,@user when a father is dysfunctional and is s...,none_hate,__label__none_hate,__label__none_hate user when a father is dysfu...
1,2,0,@user @user thanks for #lyft credit i can't us...,none_hate,__label__none_hate,__label__none_hate user user thanks for lyft c...
2,3,0,bihday your majesty,none_hate,__label__none_hate,__label__none_hate bihday your majesty
3,4,0,#model i love u take with u all the time in ...,none_hate,__label__none_hate,__label__none_hate model i love u take with u ...
4,5,0,factsguide: society now #motivation,none_hate,__label__none_hate,__label__none_hate factsguide society now moti...


In [14]:
df.Label_Tweet[1]

"__label__none_hate user user thanks for lyft credit i can't use cause they don't offer wheelchair vans in pdx disapointed getthanked"

**Train Test Split**

In [15]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

In [16]:
train.shape, test.shape

((25569, 6), (6393, 6))

In [17]:
train.to_csv("Fasttext_Tweet.train", columns=["Label_Tweet"], index=False, header=False)
test.to_csv("Fasttext_Tweet.test", columns=["Label_Tweet"], index=False, header=False)

**Train the model and evaluate performance**

In [18]:
import fasttext

model = fasttext.train_supervised(input="Fasttext_Tweet.train",lr=0.8, epoch=125, wordNgrams=2)
model.test("Fasttext_Tweet.test")

(6393, 0.9635538870639763, 0.9635538870639763)

First parameter (6393) is test size. Second and third parameters are precision and recall respectively. You can see we are getting around 96% precision which is pretty good

**Now let's do prediction for few product descriptions**

In [19]:
model.predict("i dont like when women got lesser sallary than man")

(('__label__none_hate',), array([0.99234444]))

In [20]:
model.predict("Sometimes, you think that you want to disappear, but all you really want is to be found.")

(('__label__none_hate',), array([1.00000858]))

In [21]:
model.predict("Even the darkest night will end, and the sun will rise.")

(('__label__none_hate',), array([1.00001001]))

In [22]:
model.predict("You are not born a winner. You are not born a loser. You are born a chooser.")

(('__label__none_hate',), array([0.99977881]))

In [23]:
model.predict("	tweet78	@user hey, white people: you can call people 'white' by @user  #race  #identity #medâ¦")

(('__label__hate',), array([0.98775697]))

In [24]:
model.predict("	tweet57	@user lets fight against ")

(('__label__hate',), array([1.00000525]))