<h1 style="text-align:center;color:mediumvioletred">Exercise: TF-IDF</h1>

### **TF-IDF: Exercises**
 
- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.
 
- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.
 
- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!
 
- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**
---

**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("Emotion_classify_Data.csv")
df.head()

Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [3]:
df.shape

(5937, 2)

#### Let's check the Emotion class distribution

In [4]:
df.Emotion.value_counts()

Emotion
anger    2000
joy      2000
fear     1937
Name: count, dtype: int64

##### It's already balanced

### Label Mapping

In [5]:
df["Emotion_num"] = df["Emotion"].map({
    'joy': 0,
    'fear': 1,
    'anger': 2
})

df.head()

Unnamed: 0,Comment,Emotion,Emotion_num
0,i seriously hate one subject to death but now ...,fear,1
1,im so full of life i feel appalled,anger,2
2,i sit here to write i start to dig out my feel...,fear,1
3,ive been really angry with r and i feel like a...,joy,0
4,i feel suspicious if there is no one outside l...,fear,1


### Model Training without Preprocessing

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df.Comment,
    df.Emotion_num,
    test_size = 0.2,
    random_state = 69,
    stratify = df.Emotion_num
)

In [7]:
print(f"X_train shape = {X_train.shape}\nX_test shape = {X_test.shape}")

X_train shape = (4749,)
X_test shape = (1188,)


In [8]:
y_train.value_counts()

Emotion_num
0    1600
2    1600
1    1549
Name: count, dtype: int64

In [9]:
y_test.value_counts()

Emotion_num
0    400
2    400
1    388
Name: count, dtype: int64

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('cv', CountVectorizer(ngram_range=(3,3))),
    ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

In [14]:
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.57      0.25      0.35       400
           1       0.40      0.81      0.53       388
           2       0.57      0.32      0.41       400

    accuracy                           0.46      1188
   macro avg       0.51      0.46      0.43      1188
weighted avg       0.51      0.46      0.43      1188



---

**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.

In [15]:
from sklearn.naive_bayes import MultinomialNB

clf = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1,2))),
    ('nb', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.89      0.90       400
           1       0.86      0.90      0.88       388
           2       0.88      0.86      0.87       400

    accuracy                           0.88      1188
   macro avg       0.88      0.88      0.88      1188
weighted avg       0.88      0.88      0.88      1188



---

**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [17]:
clf = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1,2))),
    ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.98      0.94       400
           1       0.96      0.91      0.94       388
           2       0.94      0.90      0.92       400

    accuracy                           0.93      1188
   macro avg       0.94      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



---

**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.

In [18]:
clf = Pipeline([
    ('tf-idf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.90      0.97      0.94       400
           1       0.95      0.91      0.93       388
           2       0.95      0.91      0.93       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



---

## Model Training and Testing after text Preprocessing

In [19]:
import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [20]:
df["Comment_new"] = df["Comment"].apply(preprocess)
df.head()

Unnamed: 0,Comment,Emotion,Emotion_num,Comment_new
0,i seriously hate one subject to death but now ...,fear,1,seriously hate subject death feel reluctant drop
1,im so full of life i feel appalled,anger,2,m life feel appalled
2,i sit here to write i start to dig out my feel...,fear,1,sit write start dig feeling think afraid accep...
3,ive been really angry with r and i feel like a...,joy,0,ve angry r feel like idiot trust place
4,i feel suspicious if there is no one outside l...,fear,1,feel suspicious outside like rapture happen


In [21]:
X_train, X_test, y_train, y_test = train_test_split(
    df.Comment_new,
    df.Emotion_num,
    test_size = 0.2,
    random_state = 69,
    stratify = df.Emotion_num
)

---

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.

In [22]:
clf = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1,2))),
    ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.95       400
           1       0.93      0.91      0.92       388
           2       0.92      0.91      0.91       400

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



---

**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.

In [23]:
clf = Pipeline([
    ('tf-idf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.95       400
           1       0.96      0.91      0.93       388
           2       0.93      0.93      0.93       400

    accuracy                           0.94      1188
   macro avg       0.94      0.94      0.94      1188
weighted avg       0.94      0.94      0.94      1188



---

### Testing Predictions

In [24]:
X_test[:5]

1996                              read novel feel relaxed
1834    loose job f amp ed xmas hate xmas hate holiday...
2948              not feel mad god angry allow happen sad
620                                            monthe ago
1369    realize awful mother feel lack energy independ...
Name: Comment_new, dtype: object

| Label  | Class |
|--------|-------|
| joy    | 0     |
| fear   | 1     |
| anger  | 2     |

In [25]:
y_test[:5]

1996    0
1834    1
2948    2
620     2
1369    1
Name: Emotion_num, dtype: int64

In [26]:
y_pred[:5]

array([0, 1, 2, 2, 1])

In [29]:
X_test[2948]

'not feel mad god angry allow happen sad'