### **TF-IDF: Exercises**
 
- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.
 
- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.
 
- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!
 
- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [22]:
import pandas as pd

df = pd.read_csv('./Emotion_classify_Data.csv')
df.shape
df.head()


Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [23]:
df.Emotion.value_counts()

Emotion
anger    2000
joy      2000
fear     1937
Name: count, dtype: int64

In [24]:
min_smple=1937
df_anger=df[df['Emotion']=='anger'].sample(min_smple,random_state=42)
df_joy=df[df['Emotion']=='joy'].sample(min_smple,random_state=42)
df_fear=df[df['Emotion']=='fear'].sample(min_smple,random_state=42)

df_final = pd.concat([df_anger,df_joy,df_fear],axis=0)
df_final.head()

#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2


#checking the results by printing top 5 rows


Unnamed: 0,Comment,Emotion
5528,i feel there are dangerous games or activities,anger
1072,i felt like facebook was a catalyst for me to ...,anger
3924,i am for the first time this year feeling the ...,anger
2656,ill take my gfathers ute down to get a load of...,anger
3768,i usually don t wear glasses at first i had un...,anger


In [25]:
df_final.Emotion.value_counts()

Emotion
anger    1937
joy      1937
fear     1937
Name: count, dtype: int64

In [26]:
df_final['Emotion_num']=df['Emotion'].map({
    'joy':0,
    'fear':1, 
    'anger':2,
})

In [27]:
df_final.head()

Unnamed: 0,Comment,Emotion,Emotion_num
5528,i feel there are dangerous games or activities,anger,2
1072,i felt like facebook was a catalyst for me to ...,anger,2
3924,i am for the first time this year feeling the ...,anger,2
2656,ill take my gfathers ute down to get a load of...,anger,2
3768,i usually don t wear glasses at first i had un...,anger,2


### **Modelling without Pre-processing Text data**

In [28]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(df_final['Comment'],df_final['Emotion_num'],test_size=.2,random_state=42,stratify=df_final.Emotion_num)

In [29]:
#print the shapes of X_train and X_test
x_train.shape,x_test.shape
y_train.value_counts()

Emotion_num
0    1550
2    1549
1    1549
Name: count, dtype: int64


**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [30]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object
clf = Pipeline([
    ('vec',CountVectorizer(ngram_range=(3,3))),
    ('model',RandomForestClassifier())
])

#2. fit with X_train and y_train
clf.fit(x_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


#4.  the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.57      0.17      0.26       387
           1       0.41      0.75      0.53       388
           2       0.52      0.44      0.48       388

    accuracy                           0.45      1163
   macro avg       0.50      0.45      0.42      1163
weighted avg       0.50      0.45      0.42      1163




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [31]:
#import MultinomialNB from sklearn
from sklearn.naive_bayes import MultinomialNB


#1. create a pipeline object
clf = Pipeline([
    ('vec',CountVectorizer(ngram_range=(1,2))),
    ('model',MultinomialNB())
])


#2. fit with X_train and y_train
clf.fit(x_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


#4.  the classfication report
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.85      0.84      0.85       387
           1       0.83      0.86      0.85       388
           2       0.88      0.86      0.87       388

    accuracy                           0.85      1163
   macro avg       0.85      0.85      0.85      1163
weighted avg       0.85      0.85      0.85      1163




**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [32]:
#1. create a pipeline object
clf = Pipeline([
    ('vec',CountVectorizer(ngram_range=(1,2))),
    ('model',RandomForestClassifier())
])

#2. fit with X_train and y_train
clf.fit(x_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


#4.  the classfication report
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.83      0.95      0.89       387
           1       0.91      0.88      0.90       388
           2       0.94      0.84      0.89       388

    accuracy                           0.89      1163
   macro avg       0.90      0.89      0.89      1163
weighted avg       0.90      0.89      0.89      1163




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [33]:
#import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer


#1. create a pipeline object

clf = Pipeline([
    ('vec',TfidfVectorizer()),
    ('model',RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(x_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


#4.  the classfication report
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.87      0.94      0.91       387
           1       0.92      0.90      0.91       388
           2       0.94      0.88      0.91       388

    accuracy                           0.91      1163
   macro avg       0.91      0.91      0.91      1163
weighted avg       0.91      0.91      0.91      1163



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [34]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [35]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient

df_final['preprocessed_comment']=df_final['Comment'].apply(preprocess)

**Build a model with pre processed text**

In [40]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment
x_train,x_test,y_train,y_test=train_test_split(df_final['preprocessed_comment'],df_final['Emotion_num'],test_size=.2,random_state=42,stratify=df_final.Emotion_num)

**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [41]:
#1. create a pipeline object
clf = Pipeline([
    ('vec',CountVectorizer(ngram_range=(1,2))),
    ('model',RandomForestClassifier())
])

#2. fit with X_train and y_train
clf.fit(x_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


#4.  the classfication report
print(classification_report(y_test,y_pred))


              precision    recall  f1-score   support

           0       0.93      0.95      0.94       387
           1       0.94      0.91      0.93       388
           2       0.92      0.93      0.93       388

    accuracy                           0.93      1163
   macro avg       0.93      0.93      0.93      1163
weighted avg       0.93      0.93      0.93      1163




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [42]:
#1. create a pipeline object

clf = Pipeline([
    ('vec',TfidfVectorizer()),
    ('model',RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(x_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(x_test)


#4.  the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.93      0.95      0.94       387
           1       0.92      0.93      0.92       388
           2       0.93      0.90      0.92       388

    accuracy                           0.93      1163
   macro avg       0.93      0.93      0.93      1163
weighted avg       0.93      0.93      0.93      1163



In [52]:
clf.predict(['i feel like i ve regained another vital part of my life which is living'])

array([0], dtype=int64)