# Machine Learning Lab 2

## Assignment 3 (Deadline : 05/02/2023 11:59PM)

Total Points : 25

Your answers must be entered in LMS by midnight of the day it is due. 

If the question requires a textual response, you can create a PDF and upload that. 

The PDF might be generated from MS-WORD, LATEX, the image of a hand- written response, or using any other mechanism. 

Code must be uploaded and may require demonstration to the TA. 

Numbers in the parentheses indicate points allocated to the question. 

**Naming Convention**: FirstName_LastName_Lab3_TLP23.ipynb

**Assignment**: 3-class Sentiment Analysis with LSTM on Twitter Data
 

**Objective**:
The objective of this assignment is to train a LSTM neural network to perform 3-class sentiment analysis on Twitter data.
 

**Dataset**:
The dataset used in this assignment is the Sentiment140 dataset, which can be downloaded from http://help.sentiment140.com/for-students. The dataset consists of 1.6 million tweets, labeled as positive (4), neutral (2), or negative (0)


*   Collect a sample of at least 100,000 tweets from the dataset **(1 points)**


*   Preprocess the text data by removing punctuation, lowercasing, removing stop words, and tokenizing the words **(3 points)**

*   Split the data into training and testing sets, and pad the sequences to the same length **(2 points)**

*   Build a LSTM model to classify the tweets as positive, neutral, or negative. The model should have an Embedding layer, followed LSTM layers of your choosing, and a dense layer for output **(7 points)**

*   Train the model on the training data and evaluate its performance on the testing data **(3 points)**


*   Fine-tune the model by experimenting with different architectures, optimizers, activation functions, and hyperparameters. Feel free to experiment with GRUs **(4 points)**


*   Report the accuracy, precision, recall, and F1 score of the model on the testing data. Inclue graphs and necessary data. Include this in a markdown cell within the notebook. Compare the basic LSTM model against SOTA and other architectures which you can directly import **(3 points)**


*   Use the trained model to predict the sentiment of 25 new tweets with positive (2), neutral (1), or negative (0) **(2 points)**



## Import the Libraries

In [1]:
import numpy as np
import pandas as pd
import os
import csv
import string
import nltk
import torchtext

nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from torchtext.data import get_tokenizer

from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense, Flatten
from transformers import pipeline
from sklearn.metrics import confusion_matrix, classification_report

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/kishlay/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
2023-02-08 23:24:58.298309: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-08 23:24:58.479434: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-08 23:24:59.219674: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared o

## Load the data

In [2]:
df = pd.DataFrame(columns=['sentiment', 'tweet'])
directory = './ML2Lab2_LSTM/ML2Lab2_LSTM/'
for fname in os.listdir(directory):
    if '.txt' in fname:
        df2 = pd.read_csv(directory+fname, sep='\t', header=None, quoting=csv.QUOTE_NONE).drop([0], axis=1).rename(columns={1:'sentiment',2:'tweet'})
    elif '.tsv' in fname:
        df2 = pd.read_csv(directory+fname, sep='\t', header=None, quoting=csv.QUOTE_NONE).drop([0,1], axis=1).rename(columns={2:'sentiment',3:'tweet'})
    df = pd.concat([df,df2], axis=0)

print(df.shape)

(57103, 3)


In [3]:
df.head()

Unnamed: 0,sentiment,tweet,3
0,negative,Saturday without Leeds United is like Sunday w...,
1,positive,Catch Rainbow Valley at the @CBC #IMAF2014 Gal...,
2,positive,"""@NiklaklePinkel it doesn't really count, I wa...",
3,positive,"""#BEARDOWN Wish us luck...we may need it. (@ G...",
4,positive,We're so excited to be part of the Still We Ri...,


In [4]:
df = df.drop(3, axis=1)
df.head()

Unnamed: 0,sentiment,tweet
0,negative,Saturday without Leeds United is like Sunday w...
1,positive,Catch Rainbow Valley at the @CBC #IMAF2014 Gal...
2,positive,"""@NiklaklePinkel it doesn't really count, I wa..."
3,positive,"""#BEARDOWN Wish us luck...we may need it. (@ G..."
4,positive,We're so excited to be part of the Still We Ri...


In [5]:
df.isna().sum()

sentiment    1
tweet        1
dtype: int64

In [6]:
df.dropna(inplace=True)

In [7]:
df.isna().sum()

sentiment    0
tweet        0
dtype: int64

In [8]:
df.shape

(57102, 2)

In [9]:
df = df.reset_index(drop=True)

## Preprocessing the Data

In [10]:
# Convert to lowercase
df['tweet'] = df['tweet'].apply(lambda x:x.lower())
df['tweet']

0        saturday without leeds united is like sunday w...
1        catch rainbow valley at the @cbc #imaf2014 gal...
2        "@niklaklepinkel it doesn't really count, i wa...
3        "#beardown wish us luck...we may need it. (@ g...
4        we're so excited to be part of the still we ri...
                               ...                        
57097    it's a wednesday girls night out as '90's band...
57098    "night college course sorted, just have to enr...
57099    for the 1st time in 30 years. for your splendi...
57100    nurses day - 12 may 2012. nursing: the heart b...
57101    we have 15 minutes left until the 2nd episode ...
Name: tweet, Length: 57102, dtype: object

In [11]:
# Remove punctuation
df['tweet'] = df['tweet'].apply(lambda x:x.translate(str.maketrans('', '', string.punctuation)))
df['tweet']

0        saturday without leeds united is like sunday w...
1        catch rainbow valley at the cbc imaf2014 gala ...
2        niklaklepinkel it doesnt really count i was de...
3        beardown wish us luckwe may need it  georgia d...
4        were so excited to be part of the still we ris...
                               ...                        
57097    its a wednesday girls night out as 90s band wi...
57098    night college course sorted just have to enrol...
57099    for the 1st time in 30 years for your splendif...
57100    nurses day  12 may 2012 nursing the heart beat...
57101    we have 15 minutes left until the 2nd episode ...
Name: tweet, Length: 57102, dtype: object

In [12]:
# Remove stopwords
stop = stopwords.words('english')
df['tweet'] = df['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df['tweet']

0        saturday without leeds united like sunday with...
1        catch rainbow valley cbc imaf2014 gala oct 26 ...
2        niklaklepinkel doesnt really count decorating ...
3        beardown wish us luckwe may need georgia dome ...
4        excited part still rise gala dec 3 join us war...
                               ...                        
57097    wednesday girls night 90s band wilson phillips...
57098    night college course sorted enrole tomorrow fi...
57099    1st time 30 years splendiferous entertainment ...
57100     nurses day 12 may 2012 nursing heart beat health
57101    15 minutes left 2nd episode styled rock uknavi...
Name: tweet, Length: 57102, dtype: object

In [13]:
tokenizer = get_tokenizer("basic_english")
token_dict = {}
tweet_tokens = []
stemmer = PorterStemmer()
cur_val = 1
max_len = 0

for idx, tweet in enumerate(df['tweet']):
    tokens = tokenizer(tweet)
    if(max_len<len(tokens)):
        max_len = len(tokens)
    num_list = []
    for word in tokens:
        word = stemmer.stem(word)
        if word not in token_dict:
            token_dict[word] = cur_val
            cur_val+=1
        num_list.append(token_dict[word])
    tweet_tokens.append(num_list)
    
df["tweet"] = tweet_tokens

In [14]:
max_len, len(token_dict)

(39, 75368)

In [15]:
df['tweet'] = df['tweet'].apply(lambda x:x+[0]*(max_len-len(x)))

In [16]:
df['tweet'].apply(lambda x:len(x)).value_counts()

39    57102
Name: tweet, dtype: int64

In [17]:
sent_map = {'positive':2, 'neutral':1, 'negative': 0}
df['sentiment'] = df['sentiment'].apply(lambda x:sent_map[x])
df.head()

Unnamed: 0,sentiment,tweet
0,0,"[1, 2, 3, 4, 5, 6, 2, 6, 7, 8, 9, 10, 11, 0, 0..."
1,2,"[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 0, 0,..."
2,2,"[22, 8, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32..."
3,2,"[35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 4..."
4,2,"[50, 51, 52, 53, 17, 54, 55, 56, 37, 57, 58, 5..."


In [18]:
from sklearn.model_selection import train_test_split

train, val = train_test_split(df, test_size=0.2, random_state=42)

In [19]:
train = list(train.itertuples(index=False, name=None))
test = list(val.itertuples(index=False, name=None))

In [20]:
train_x = np.array([tweet[:15] for label, tweet  in train])
train_y = np.array([label for label, tweet in train])
val_x = np.array([tweet[:15] for label, tweet in test])
val_y = np.array([label for label, tweet in test])

max_len=15

In [21]:
train_y, val_y = pd.get_dummies(train_y), pd.get_dummies(val_y)

In [22]:
embedding_vector_features = 100
model=Sequential()
model.add(Embedding(len(token_dict),embedding_vector_features,input_length=max_len))
model.add(LSTM(10))
model.add(Flatten())
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

2023-02-08 23:25:16.083058: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-08 23:25:16.121689: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-08 23:25:16.121978: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-02-08 23:25:16.122863: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operati

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 15, 100)           7536800   
                                                                 
 lstm (LSTM)                 (None, 10)                4440      
                                                                 
 flatten (Flatten)           (None, 10)                0         
                                                                 
 dense (Dense)               (None, 3)                 33        
                                                                 
Total params: 7,541,273
Trainable params: 7,541,273
Non-trainable params: 0
_________________________________________________________________
None


In [23]:
model.fit(train_x,train_y,validation_data=(val_x,val_y),epochs=10)

Epoch 1/10


2023-02-08 23:25:21.915368: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8302
2023-02-08 23:25:22.111790: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:630] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-02-08 23:25:22.114006: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x60f2a8f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-08 23:25:22.114044: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): NVIDIA GeForce RTX 3060 Laptop GPU, Compute Capability 8.6
2023-02-08 23:25:22.122628: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-02-08 23:25:22.215856: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-08 23:25:2

Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f18b3796d60>

In [34]:
test_df = pd.read_csv('./ML2Lab2_LSTM/ML2Lab2_LSTM/testdata.manual.2009.06.14.csv',names = ['sentiment','col2','col3','col4','col5','tweet'])
X_test = test_df['tweet']
y_test = test_df['sentiment']
y_test = y_test//2

In [36]:
# test_df = test_df['tweet'] = test_df['tweet'].apply(lambda x:x.translate(str.maketrans('', '', string.punctuation)))

# stop = stopwords.words('english')
# test_df['tweet'] = test_df['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# tweet_tokens = []

# for idx, tweet in enumerate(test_df['tweet']):
#     tokens = tokenizer(tweet)
#     num_list = []
#     for word in tokens:
#         word = stemmer.stem(word)
#         if word not in token_dict:
#             token_dict[word] = cur_val

#         num_list.append(token_dict[word])
#     num_list = num_list[:15]
#     tweet_tokens.append(num_list)

# y_fin = model.predict(np.asarray(tweet_tokens))
# y_fin

## Transformer model

In [32]:
classifier = pipeline("text-classification", model="j-hartmann/sentiment-roberta-large-english-3-classes", return_all_scores=True)

test_df = pd.read_csv('./ML2Lab2_LSTM/ML2Lab2_LSTM/testdata.manual.2009.06.14.csv',names = ['sentiment','col2','col3','col4','col5','tweet'])
X_test = test_df['tweet']
y_test = test_df['sentiment']
y_test = y_test//2

Some weights of the model checkpoint at j-hartmann/sentiment-roberta-large-english-3-classes were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [33]:
y_pred = []
for i in range(len(list(X_test))):
  y_pred.append(np.argmax(pd.DataFrame(classifier(X_test[i])[0])['score']))

print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.89      0.82      0.85       177
           1       0.68      0.94      0.79       139
           2       0.92      0.73      0.81       182

    accuracy                           0.82       498
   macro avg       0.83      0.83      0.82       498
weighted avg       0.84      0.82      0.82       498

