We will explore Twitter Emotion Classification. The goal is the identify the primary emotion expressed in a tweet. Consider the following tweets:
```
Tweet 1: @NationalGallery @ThePoldarkian I have always loved this painting.
Tweet 2: '@tateliverpool #BobandRoberta: I am angry more artists that have a profile are not speaking up #foundationcourses.'
``` 

How would you describe the emotions in `Tweet 1` vs `Tweet 2`? `Tweet 1` expresses enjoyment and happiness, while `Tweet 2` directly expresses anger. We will be working with the SMILE Twitter Emotion Dataset ([Wang et al. 2016](https://ceur-ws.org/Vol-1619/paper3.pdf)). At a high level, our goal is to develop different models (rule-based, machine learning, and deep learning), which can be used to identify the emotion of a tweet. We will be required to clean and preprocess the data, generate features for classification, train various models, and evaluate the models. 

Before you get started, run the cell below to download the dataset into memory and a few relevant libraries.

In [None]:
!wget -O data.csv "https://figshare.com/ndownloader/files/4988956"
!pip install emoji

import nltk
nltk.download('punkt')

--2023-02-26 10:56:54--  https://figshare.com/ndownloader/files/4988956
Resolving figshare.com (figshare.com)... 54.228.130.170, 99.81.233.31, 2a05:d018:1f4:d003:616f:a4e2:59fe:d704, ...
Connecting to figshare.com (figshare.com)|54.228.130.170|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4988956/smileannotationsfinal.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230226/eu-west-1/s3/aws4_request&X-Amz-Date=20230226T105654Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=8b7bf895d4e105a471c9bd8f1ef65e6d01454f762e6b3a9a0d7999224f2a1329 [following]
--2023-02-26 10:56:54--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4988956/smileannotationsfinal.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230226/eu-west-1/s3/aws4_request&X-Amz-Date=20230226T105654Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=8b7bf895d4e105a47

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
!wget -O data.csv "https://figshare.com/ndownloader/files/4988956"
!pip install emoji

import nltk
nltk.download('punkt')

--2023-02-26 10:57:02--  https://figshare.com/ndownloader/files/4988956
Resolving figshare.com (figshare.com)... 54.228.130.170, 99.81.233.31, 2a05:d018:1f4:d003:616f:a4e2:59fe:d704, ...
Connecting to figshare.com (figshare.com)|54.228.130.170|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4988956/smileannotationsfinal.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230226/eu-west-1/s3/aws4_request&X-Amz-Date=20230226T105702Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=f42a67067e84ca217dcd079fda2dc53d687373d7d07ede5748fde261e8d31800 [following]
--2023-02-26 10:57:02--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/4988956/smileannotationsfinal.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20230226/eu-west-1/s3/aws4_request&X-Amz-Date=20230226T105702Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=f42a67067e84ca217

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Task 1. Data Cleaning, Preprocessing, and splitting
The `data` environment contains the SMILE dataset loaded into a pandas dataframe object. Our dataset has three columns: id, tweet, and label. The `tweet` column contains the raw scraped tweet and the `label` column contains the annotated emotion category. Each tweet is labelled with one of the following emotion labels:
- 'nocode', 'not-relevant' 
- 'happy', 'happy|surprise', 'happy|sad'
- 'angry', 'disgust|angry', 'disgust' 
- 'sad', 'sad|disgust', 'sad|disgust|angry' 
- 'surprise'

### Task 1a. Label Consolidation
As we can see above the annotated categories are complex. Several tweets express complex emotions like (e.g. 'happy|sad') or multiple emotions (e.g. 'sad|disgust|angry'). The first things we need to do is clean up our dataset by removing complex examples and consolidating others so that we have a clean set of emotions to predict. 

For Task 1a., we will do the following:
1. Drops all rows which have the label "happy|sad", "happy|surprise", 'sad|disgust|angry', and 'sad|angry'.
2. Re-label 'nocode' and 'not-relevant' as 'no-emotion'.
3. Re-label 'disgust|angry' and 'disgust' as 'angry'.
4. Re-label 'sad|disgust' as 'sad'.

Updated `data' dataframe should have 3,062 rows and 5 label categories (no-emotion, happy, angry, sad, and surprise).


In [None]:
import pandas as pd
import numpy as np
df= pd.read_csv('data.csv',header= None)
df.columns = ['id','tweet','label']
df.head(5)

Unnamed: 0,id,tweet,label
0,611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
1,614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
2,614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
3,614877582664835073,@Sofabsports thank you for following me back. ...,happy
4,611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [None]:
dropped_data = df[(df['label'] == 'happy|sad') | (df['label'] == 'happy|surprise') | (df['label'] == 'sad|disgust|angry') | (df['label'] == 'sad|angry')].index
df.drop(dropped_data , inplace=True)
df_1 = df.replace(to_replace=["nocode", "not-relevant"],value="no-emotion")
df_2 = df_1.replace(to_replace=["disgust|angry", "disgust"],value="angry")
data = df_2.replace(to_replace=["sad|disgust"],value="sad")
print(data.label.unique())
print(len(data))

['no-emotion' 'happy' 'angry' 'sad' 'surprise']
3062


### Task 1a Tests 
Run the cell below to evaluate code. 

In [None]:
# Test 1. Data should have 5 unique labels.
print(f"Unique label test: {len(data['label'].unique()) == 5}")

# Test 2. Data labels must be: angry, happy, no-emotion, sad, and surprise
labels = ["angry", "happy", "no-emotion", "sad", "surprise"]
print(f"Label check: { set(data['label'].unique()).difference(labels) == set() }")

# Test 3. Check example counts per label
print(f"Angry example count: {len(data[data['label']=='angry']) == 70}")
print(f"Happy example count: {len(data[data['label']=='happy']) == 1137}")
print(f"No-Emotion example count: {len(data[data['label']=='no-emotion']) == 1786}")
print(f"Sad example count: {len(data[data['label']=='sad']) == 34}")
print(f"Surprise example count: {len(data[data['label']=='surprise']) == 35}")

Unique label test: True
Label check: True
Angry example count: True
Happy example count: True
No-Emotion example count: True
Sad example count: True
Surprise example count: True


### Task 1b. Tweet Cleaning and Processing
Raw tweets are noisy. Consider the example below: 
```
'@tateliverpool #BobandRoberta: I am angry more artists that have a profile are not speaking up #foundationcourses. 😠'
```
The mention @tateliverpool and hashtag #BobandRoberta are extra noise that don't directly help with understanding the emotion of the text. The accompanying emoji can be useful but needs to be decoded to it text form :angry: first. 

For this task we will fill complete the `preprocess_tweet` function below with the following preprocessing steps:
1. Lower case all text
2. De-emoji the text
3. Remove all hashtags, mentions, and urls
4. Remove all non-alphabet characters except the followng punctuations: period, exclamation mark, and question mark


In [None]:
import emoji 
import re

def preprocess_tweet(tweet: str) -> str:
  """
  Function takes a raw tweet and performs the following processing steps:
  1. Lower case all text
  2. De-emoji the text
  3. Remove all hashtags, mentions, and urls 
  4. Remove all non-alphabet characters except the followng punctuations: period, exclamation mark, and question mark
  """
  #1. Lower case all text
  lowercase_tweet = tweet.lower()
  #2. De-emoji the text
  demojized_tweet = emoji.demojize(lowercase_tweet)
  #3. Remove all hashtags, mentions, and urls
  clean_tweet1 = re.sub(r'https?://\S+', '', demojized_tweet)
  clean_tweet2 = re.sub('@[A-Za-z0-9_\s]+','', clean_tweet1)
  clean_tweet3 = re.sub('#[A-Za-z0-9_\s]+','', clean_tweet2)
  #4. Remove all non-alphabet characters except the followng punctuations: period, exclamation mark, and question mark
  tweet = re.sub(r'[^a-zA-Z.!?\s]+', '',clean_tweet3).lstrip()

  return tweet 

test_tweet = "'@tateliverpool #BobandRoberta: I am angry more artists that have a profile are not speaking up! #foundationcourses 😠'"
print(preprocess_tweet(test_tweet))

i am angry more artists that have a profile are not speaking up! angryface


### Task 1b Test
Run the cell below to evaluate your code. 

In [None]:
# Create new column with cleaned tweets. We will use this for the subsequent tasks
data["cleaned_tweet"] = data["tweet"].apply(preprocess_tweet)

# Test 1b 
test_tweet = "'@tateliverpool #BobandRoberta: I am angry more artists that have a profile are not speaking up! #foundationcourses 😠'"
clean_tweet = "i am angry more artists that have a profile are not speaking up! angryface"

print(f"Test 1b: {preprocess_tweet(test_tweet) == clean_tweet}")

Test 1b: True


### Task 1c. Generating Evaluation Splits 
Finally, we need to split our data into a train, validation, and test set. We will split the data using a 60-20-20 split, where 60% of our data is used for training, 20% for validation, and 20% for testing. As the dataset is heaviliy imbalanced, I have stratify the dataset to ensure that the label distributions across the three splits are roughly equal. 

Stored splits in the variables `train`, `val`, and `test` respectively. 


In [None]:
data

Unnamed: 0,id,tweet,label,cleaned_tweet
0,611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,no-emotion,!
1,614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,dorian gray with rainbow scarf from
2,614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,... replace with your wish which the artist us...
3,614877582664835073,@Sofabsports thank you for following me back. ...,happy,. great to hear from a diverse amp interesting...
4,611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,portrait. is the r for rex ?
...,...,...,...,...
3080,613678555935973376,MT @AliHaggett: Looking forward to our public ...,happy,mt looking forward to our public engagement e...
3081,613294681225621504,@britishmuseum Upper arm guard?,no-emotion,?
3082,615246897670922240,@MrStuchbery @britishmuseum Mesmerising.,happy,.
3083,613016084371914753,@NationalGallery The 2nd GENOCIDE against #Bia...,no-emotion,days of unreported aerial bombardment in


In [None]:
from sklearn.model_selection import train_test_split
# code here
#train, val, test = None, None, None
train , test = train_test_split(data ,train_size=0.6, test_size=0.4, random_state=2023, stratify=data["label"])
val, test = train_test_split(test, train_size =0.5, test_size = 0.5, random_state=2023, stratify=test["label"])

print(f"final datasets, train examples: {len(train)}, val examples: {len(val)}, test examples: {len(test)}")

final datasets, train examples: 1837, val examples: 612, test examples: 613


In [None]:
train["cleaned_tweet"].shape

(1837,)

In [None]:
print(len(train))
print(len(val))
print(len(test))

1837
612
613


In [None]:
train.shape,val.shape, test.shape

((1837, 4), (612, 4), (613, 4))

In [None]:
test['label'].value_counts()

no-emotion    357
happy         228
angry          14
surprise        7
sad             7
Name: label, dtype: int64

## Task 2: Naive Baseline Using a Rule-based Classifier 

Now that we have a dataset, let's work on developing some solutions for emotion classification. We'll start with implementing a simple rule-based classifier which will also serve as our naive baseline. Emotive language (e.g. awesome, feel great, super happy) can be a strong signal as to the overall emotion being by the tweet. For each emotion in our label space (happy, surprised, sad, angry) we will generate a set of words and phrases that are often associated with that emotion. At classification time, the classifier will calculate a score based on the overlap between the words in the tweet and the emotive words and phrases for each of the emotions. The emotion label with the highest overlap will be selected as the prediction and if there is no match the "no-emotion" label will be predicted. We can break the implementation of this rules-based classifier into three steps:
1. Emotive language extraction from train examples 
2. Developing a scoring algorithm
3. Building the end-to-end classification flow 

### Task 2a. Emotive Language Extraction
For this task we will generate a set of unigrams and bigrams that will be used to predict each of the labels. Using the training data we will need to extract all the unique unigrams and bigrams associated with each label (excluding no-emotion). Then we should ensure that the extracted terms for each emotion label do not appear in the other lists. In the real world, we would then manually curate the generated lists to ensure that associated words were useful and emotive. For the project, we won't be required to further curate the generated lists.

saved as lists stored in the following environment variables: `happy_words`, `surprised_words`, `sad_words`,and `angry_words`. 

In [None]:
# code here
from typing import List
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer

# 1. Extract all terms associated with each label
def extract_words(examples: List[str]) -> List[str]:
  """
  Given a list of tweets, return back the unigrams and bigrams found
  across all the tweets. 
  """
  model = CountVectorizer(stop_words="english",ngram_range = (1, 2))
  model.fit(examples)
  id2vocab = model.vocabulary_.items()
  id2vocab = {v:k for k,v in model.vocabulary_.items()}
  words = set([value for value in id2vocab.values()])
  extracted_words = set(words)
  return extracted_words

happy_examples = extract_words(train[train['label'] == 'happy']['cleaned_tweet'])
sad_examples = extract_words(train[train['label'] == 'sad']['cleaned_tweet'])
angry_examples = extract_words(train[train['label'] == 'angry']['cleaned_tweet'])
surprise_examples = extract_words(train[train['label'] == 'surprise']['cleaned_tweet'])

happy_words = happy_examples.difference(sad_examples,angry_examples,surprise_examples)
sad_words = sad_examples.difference(happy_examples,angry_examples,surprise_examples)
angry_words = angry_examples.difference(happy_examples,sad_examples,surprise_examples)
surprise_words = surprise_examples.difference(angry_examples,sad_examples,happy_examples)

### Task 2a Tests
Run the cell below to evaluate your code. 

In [None]:
# Check sets are non-empty
print("Checking sets are not empty: ")
print(f"Happy words count: {len(happy_words)}, {len(happy_words) > 0}")
print(f"Sad words count: {len(sad_words)}, {len(sad_words) > 0}")
print(f"Angry words count: {len(angry_words)}, {len(angry_words) > 0}")
print(f"Surprise words count: {len(surprise_words)}, {len(surprise_words) > 0}")

# Checks sets are disjoint 
union1 = sad_words.union(angry_words, surprise_words)
union2 = happy_words.union(surprise_words, angry_words) 
union3 = surprise_words.union(happy_words, sad_words)
union4 = angry_words.union(happy_words, sad_words) 

print("\nChecking sets are all disjoint:")
print(f"Happy words disjoint: {happy_words.isdisjoint(union1)}")
print(f"Sad words disjoint: {sad_words.isdisjoint(union2)}")
print(f"Angry words disjoint: {angry_words.isdisjoint(union3)}")
print(f"Surprise words disjoint: {surprise_words.isdisjoint(union4)}")

Checking sets are not empty: 
Happy words count: 3356, True
Sad words count: 145, True
Angry words count: 231, True
Surprise words count: 53, True

Checking sets are all disjoint:
Happy words disjoint: True
Sad words disjoint: True
Angry words disjoint: True
Surprise words disjoint: True


### Task 2b. Scoring using set overlaps

Next we will implement to scoring algorithm. Our score will simply be the count of overlapping terms between tweet text and emotive terms. 

In [None]:
from nltk import word_tokenize
sample_words = {'cat', 'hat', 'mat', 'bowling', 'bat'}
sample_tweet1 = "that cat is super cool sitting on the mat" 
sample_tweet2 = "the man in the bowling hat sat on the cat"
sample_tweet3 = "the quick brown fox jumped over the lazy dog"
words = []
def score_tweet(example1,example2):
  count = 0
  for i in example1:
    words = set(word_tokenize(example1))
    count = len(words.intersection(example2)) #intersection
   
  return count

print(f"Test 1: {score_tweet(sample_tweet1, sample_words) == 2}")
print(f"Test 2: {score_tweet(sample_tweet2, sample_words) == 3}")
print(f"Test 3: {score_tweet(sample_tweet3, sample_words) == 0}")

Test 1: True
Test 2: True
Test 3: True


### 2c. Rule-based classification 
Let put together our rules-based classfication system. Given a tweet, `simple_clf` will generate the overlap score
for each of emotion labels and return the emotion label with the highest score. If there is no match amongst the emotions, the classifier will return 'no-emotion'.


In [None]:
def simple_clf(tweet: str) -> str:
  """
  Given a tweet, calculate all the emotion overlap scores.
  Return the emotion label which has the largest score. If
  overlap score is 0, return no-emotion. 
  """
  count_happy = score_tweet(tweet,happy_words)
  count_sad = score_tweet(tweet,sad_words)
  count_angry = score_tweet(tweet,angry_words)
  count_surprise = score_tweet(tweet,surprise_words)

  emotions = ['happy','sad','angry','surprise','no-emotion']
  counts = [count_happy, count_sad, count_angry, count_surprise]  # Check which emotion label has the maximum count and return it
  max_val = max(count_happy, count_sad, count_angry, count_surprise)  # Check which emotion label has the maximum count and return it
  
  if max_val == 0:
    return "no-emotion"
  elif max_val == count_happy:
    return "happy"
  elif max_val == count_sad:
    return "sad"
  elif max_val == count_angry:
    return "angry"
  elif max_val == count_surprise:
    return "surprise"

After finishing the above section, let's evaluate our how model did.

In [None]:
from sklearn.metrics import classification_report

preds = test["cleaned_tweet"].apply(simple_clf)
print(classification_report(test["label"], preds)) 

              precision    recall  f1-score   support

       angry       0.30      0.21      0.25        14
       happy       0.45      0.75      0.56       228
  no-emotion       0.67      0.42      0.52       357
         sad       0.00      0.00      0.00         7
    surprise       0.00      0.00      0.00         7

    accuracy                           0.53       613
   macro avg       0.28      0.28      0.27       613
weighted avg       0.57      0.53      0.51       613



## Task 3. Machine learning w/ grammar augmented features

Now that we have a naive baseline, let's build a more sophisticated solution using machine learning. Up to this point, we have only considered the words in the tweet as our primary features. The rules-based approach is a very simple bag-of-words classifier. Can we improve performance if we provide some additional linguistic knowledge?

For Task 3 we will do the following:
- Generate part-of-speech features our tweets
- Train two different machine learning classifiers, one with linguistic features and one without
- Evaluate the trained models on the test set

### Task 3a. Grammar Augmented Feature Generation
For this task, we will be generating part-of-speech tags for each token in our tweet. Additionally we'll lemmatize the text as well. We will directly include the POS information by appending the tag to the lemma of word itself. For example:
```
Raw Tweet: I am very angry with the increased prices.
POS Augmented Tweet: I-PRP be-VBP very-RB angry-JJ with-IN the-DT increase-VBN price-NNS .-.
```

we will generate the pos features `generate_pos_features` using the Spacy library. Once we have an implementation that works, we'll update the `train` and `test` dataframes with a new column called `tweet_with_pos` which contains the output of the `generate_pos_features` method.

In [None]:
import spacy 
from tqdm.notebook import tqdm
nlp = spacy.load("en_core_web_sm")

def generate_pos_features(tweet: str) -> str:
  """
  Given a tweet, return the lemmatized tweet augmented
  with POS tags.
  E.g.:
  Input: "cats are super cool."
  output: "cat-NNS be-VBP super-RB cool-JJ .-."
  """
  doc = nlp(tweet)   #1. Pass text to spacy

  count = " ".join([f"{value.lemma_}-{value.tag_}" for value in doc])
  
  return count 
 
sample_tweet = "I hate action movies"
generate_pos_features(sample_tweet)



'I-PRP hate-VBP action-NN movie-NNS'

In [None]:
# Once you have the code working above run this cell.
train["tweet_with_pos"] = train["cleaned_tweet"].apply(generate_pos_features)
test["tweet_with_pos"] = test["cleaned_tweet"].apply(generate_pos_features)

### Task 3a Tests
Run the cell below to evaluate your code. 

In [None]:
sample_texts = [
    ("i am super angry", "I-PRP be-VBP super-RB angry-JJ"),
    ("That movie was great", "that-DT movie-NN be-VBD great-JJ"),
    ("I hate action movies", "I-PRP hate-VBP action-NN movie-NNS")
]
for i, text in enumerate(sample_texts):
  print(f"Test {i+1}: {generate_pos_features(text[0]) == text[1]}")

Test 1: True
Test 2: True
Test 3: True


In [None]:
train["tweet_with_pos"]

2161                                      last-JJ week-NN
1566    see-VB all-PDT the-DT photo-NNS from-IN wednes...
1674    .how-: to-TO make-VB a-DT turquoise-NN goblet-...
2037    time-NN be-VBZ run-VBG out-RP to-TO catch-VB n...
2946                                        ..-NFP  　-_SP
                              ...                        
3027    stunning-JJ vike-VBG silver-NN thistle-NNP bro...
2324                                        ...-NFP  -_SP
1176    take-VB .-. free-JJ admission-NN print-NNS fro...
1793    what-WP a-DT treat-NN rt-NNP  -_SP exhibit-VBN...
419     m-VBP cezanne-NNP painting-NN will-MD leave-VB...
Name: tweet_with_pos, Length: 1837, dtype: object

### Task 3b. Model Training 
Next we will train two seperate RandomForest Classifier models. For this task we will generate two sets of input features using the `TfidfVectorizer`. We generate Tfidf statistic on the`cleaned_tweet` and the `tweet_with_pos` columns. 

Once we've generated features, train two different Random Forest classifiers with the generated features and generate the predictions on the test set for each classifier.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# code here

# Fit the tfidf vectorizer on the cleaned text column & Generate feature for train
tfidf = TfidfVectorizer()
X_train_cleanedtweet = tfidf.fit(train['cleaned_tweet'])
X_train_cleanedtweet = tfidf.transform(train['cleaned_tweet'])

y_test_cleanedtweet = tfidf.transform(test['cleaned_tweet'])

X_train_tweet_with_pos = tfidf.fit(train['tweet_with_pos'])
X_train_tweet_with_pos = tfidf.transform(train['tweet_with_pos'])

y_test_tweet_with_pos = tfidf.transform(test['tweet_with_pos'])

In [None]:
clf = RandomForestClassifier()
clf_cleanedtweet = clf.fit(X_train_cleanedtweet.toarray(), train["label"])
y_pred_cleanedtweet = clf_cleanedtweet.predict(y_test_cleanedtweet.toarray())

clf_tweet_with_pos = clf.fit(X_train_tweet_with_pos.toarray(), train["label"])
y_pred_tweet_with_pos = clf_tweet_with_pos.predict(y_test_tweet_with_pos.toarray())

### Task 3c. 
Generated classification reports for both models. Printing the reports below. 

In [None]:
from sklearn.metrics import classification_report

# Classification Report for Tfidf features
# Your code here
Classification_report1 = classification_report(test["label"], y_pred_cleanedtweet)
print("Classification report for TFIDF features",  "\n", Classification_report1)

# Classfication Report for POS features 
# Your code here

Classification_report2 = classification_report(test["label"], y_pred_tweet_with_pos)
print("Classification report for TFIDF w/ POS features",  "\n", Classification_report2)

Classification report for TFIDF features 
               precision    recall  f1-score   support

       angry       1.00      0.07      0.13        14
       happy       0.78      0.54      0.64       228
  no-emotion       0.73      0.92      0.81       357
         sad       0.00      0.00      0.00         7
    surprise       0.00      0.00      0.00         7

    accuracy                           0.74       613
   macro avg       0.50      0.31      0.32       613
weighted avg       0.73      0.74      0.71       613

Classification report for TFIDF w/ POS features 
               precision    recall  f1-score   support

       angry       1.00      0.07      0.13        14
       happy       0.72      0.54      0.62       228
  no-emotion       0.73      0.90      0.80       357
         sad       0.00      0.00      0.00         7
    surprise       0.00      0.00      0.00         7

    accuracy                           0.73       613
   macro avg       0.49      0.30     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Your evaluation here.
The model performs better even though the accuracy isnt much of a difference

## Task 4. Transfer Learning with DistilBERT

For this task we will finetune a pretrained language model (DistilBERT) using the huggingface `transformers` library. For this task we will need to:
- Encode the tweets using the BERT tokenizer
- Create pytorch datasets for for the train, val and test datasets
- Finetune the distilbert model for 5 epochs
- Extract predictions from the model's output logits and convert them into the emotion labels.
- Generate a classification report on the predictions.

Ensure you are running the notebook in Google Colab with the gpu runtime enabled for this section.

In [None]:
!pip install transformers >> NULL

In [None]:
from sklearn.preprocessing import LabelEncoder #Encode the tweets using the BERT tokenizer
import torch
from transformers import AutoTokenizer
from torch.utils.data import Dataset
from transformers import AutoModelForSequenceClassification
from transformers import Trainer
from transformers import TrainingArguments

# 1. Load Label Encoder
le = LabelEncoder()

# 2. Fit the label encoder to the label in our dataset
le.fit(train["label"])

# 3. Create a new column with encoded labels
train["encoded_label"] = le.transform(train["label"])
val["encoded_label"] = le.transform(val["label"])
test["encoded_label"] = le.transform(test["label"])

# Validate the mapping:
train.groupby(["label", "encoded_label"]).aggregate("count")

train_labels = torch.tensor(train["encoded_label"].tolist())
val_labels = torch.tensor(val["encoded_label"].tolist())
test_labels = torch.tensor(test["encoded_label"].tolist())

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

train_encodings = tokenizer(
    train['cleaned_tweet'].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)

val_encodings = tokenizer(
    val['cleaned_tweet'].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)

test_encodings = tokenizer(
    test['cleaned_tweet'].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)

#Create pytorch datasets for for the train, val and test datasets

# Define Custom Class for DistilBert Inputs
class RelationDataset(Dataset):
    
    def __init__(self, encodings: dict):  
        self.encodings = encodings
        
    def __len__(self) -> int:
        return len(self.encodings["input_ids"])
    
    def __getitem__(self, idx: int) -> dict:
        e = {k: v[idx] for k,v in self.encodings.items()}
        return e 


# Update encodings with labels
train_encodings["labels"] = train_labels
val_encodings["labels"] = val_labels
test_encodings["labels"] = test_labels

# Generate Datasets
train_ds = RelationDataset(train_encodings)
val_ds = RelationDataset(val_encodings)
test_ds = RelationDataset(test_encodings)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=7)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

In [None]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    lr_scheduler_type='cosine',
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32, 
    fp16=False,
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
)

trainer.train()

***** Running training *****
  Num examples = 1837
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 290
  Number of trainable parameters = 66958855


Epoch,Training Loss,Validation Loss
1,No log,0.615466
2,No log,0.590491
3,No log,0.641292
4,No log,0.640196
5,No log,0.637942


***** Running Evaluation *****
  Num examples = 612
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-58
Configuration saved in ./results/checkpoint-58/config.json
Model weights saved in ./results/checkpoint-58/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 612
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-116
Configuration saved in ./results/checkpoint-116/config.json
Model weights saved in ./results/checkpoint-116/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 612
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-174
Configuration saved in ./results/checkpoint-174/config.json
Model weights saved in ./results/checkpoint-174/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 612
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-232
Configuration saved in ./results/checkpoint-232/config.json
Model weights saved in ./results/checkpoint-232/pytorch_model.bin
***** Runni

TrainOutput(global_step=290, training_loss=0.45147436733903556, metrics={'train_runtime': 3332.9184, 'train_samples_per_second': 2.756, 'train_steps_per_second': 0.087, 'total_flos': 57038510081520.0, 'train_loss': 0.45147436733903556, 'epoch': 5.0})

In [None]:
bert_preds = trainer.predict(test_ds)

bert_preds = le.inverse_transform(np.argmax(bert_preds.predictions, axis=1))
print(classification_report(test["label"].tolist(), bert_preds))

***** Running Prediction *****
  Num examples = 613
  Batch size = 32


              precision    recall  f1-score   support

       angry       0.43      0.21      0.29        14
       happy       0.82      0.76      0.79       228
  no-emotion       0.82      0.90      0.86       357
         sad       0.00      0.00      0.00         7
    surprise       0.00      0.00      0.00         7

    accuracy                           0.82       613
   macro avg       0.41      0.38      0.39       613
weighted avg       0.79      0.82      0.80       613



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Task 5. Model Recommendation 
In a paragraph answered the following questions:
1. Which of the implemented models is recommended and why? 
2. Compare the metrics for each models implemted (Rules-Based, Machine Learning w/ POS features, and DistilBERT). What are the pros and con for each model (consider performance both macro performance and label specifc metrics and the computational requirements). 

# **Answers**
1. I would recommend DistilBERT model since it is a deep learning model and contains hidden layers, and trained on a large corpus and performs very efficiently among NLP tasks, even though it is complex and requires GPU it classifies the data on a much better accuracy, F1 scores and better results overall.

2. These are the metrics of each of the models in the cell below.
If the classification reports of all the models are compared then,
in terms of:
*   Accuracy DistillBERT works best at 82%, second to DistillBERT would be model w/POS features, with accuracy at 73% and third to the list would be Rules-based model at accuracy 53%.
*   Precision is higher in classification report of DistilBERT which is a better result as it is accuracy of positive predictions. 
*   Recall is collectively higher in classification report of DistilBERT which is a better result as it is fraction of positives that were correctly identified.
*   Support is same among all the models which shows models are classifying the occurences of each label correctly.
*   Macro average is a better indicator since the dataset is imabalanced and it treats all classes equally and each one of them has its importance since data is stratified, macro average is highest in DistillBERT model when compared toother two.
*   Similarly, weighted average considers the contibution of each label and calculates the weighted averaging as per the contribution of each label to the F1 average is weighted by its size and hence, is highest in DistillBERT model at 80%.





In [None]:
print("Classification report for Rules-based : ",  "\n", classification_report(test["label"], preds)) 
print("Classification report for TFIDF w/ POS features : ",  "\n", Classification_report2)
print("Classification report for DistilBERT : ",  "\n", classification_report(test["label"].tolist(), bert_preds))

Classification report for Rules-based :  
               precision    recall  f1-score   support

       angry       0.30      0.21      0.25        14
       happy       0.45      0.75      0.56       228
  no-emotion       0.67      0.42      0.52       357
         sad       0.00      0.00      0.00         7
    surprise       0.00      0.00      0.00         7

    accuracy                           0.53       613
   macro avg       0.28      0.28      0.27       613
weighted avg       0.57      0.53      0.51       613

Classification report for TFIDF w/ POS features :  
               precision    recall  f1-score   support

       angry       1.00      0.07      0.13        14
       happy       0.72      0.54      0.62       228
  no-emotion       0.73      0.90      0.80       357
         sad       0.00      0.00      0.00         7
    surprise       0.00      0.00      0.00         7

    accuracy                           0.73       613
   macro avg       0.49      0.30  

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
