# Machine Learning Lab 2

## Assignment 3 (Deadline : 05/02/2023 11:59PM)

Total Points : 25

Your answers must be entered in LMS by midnight of the day it is due. 

If the question requires a textual response, you can create a PDF and upload that. 

The PDF might be generated from MS-WORD, LATEX, the image of a hand- written response, or using any other mechanism. 

Code must be uploaded and may require demonstration to the TA. 

Numbers in the parentheses indicate points allocated to the question. 

**Naming Convention**: FirstName_LastName_Lab3_TLP23.ipynb

**Assignment**: 3-class Sentiment Analysis with LSTM on Twitter Data
 

**Objective**:
The objective of this assignment is to train a LSTM neural network to perform 3-class sentiment analysis on Twitter data.
 

**Dataset**:
The dataset used in this assignment is the Sentiment140 dataset, which can be downloaded from http://help.sentiment140.com/for-students. The dataset consists of 1.6 million tweets, labeled as positive (4), neutral (2), or negative (0)


*   Collect a sample of at least 100,000 tweets from the dataset **(1 points)**


*   Preprocess the text data by removing punctuation, lowercasing, removing stop words, and tokenizing the words **(3 points)**

*   Split the data into training and testing sets, and pad the sequences to the same length **(2 points)**

*   Build a LSTM model to classify the tweets as positive, neutral, or negative. The model should have an Embedding layer, followed LSTM layers of your choosing, and a dense layer for output **(7 points)**

*   Train the model on the training data and evaluate its performance on the testing data **(3 points)**


*   Fine-tune the model by experimenting with different architectures, optimizers, activation functions, and hyperparameters. Feel free to experiment with GRUs **(4 points)**


*   Report the accuracy, precision, recall, and F1 score of the model on the testing data. Inclue graphs and necessary data. Include this in a markdown cell within the notebook. Compare the basic LSTM model against SOTA and other architectures which you can directly import **(3 points)**


*   Use the trained model to predict the sentiment of 25 new tweets with positive (2), neutral (1), or negative (0) **(2 points)**



In [11]:
import pandas as pd
import numpy as np
import random

In [12]:
def read_data(file_names, extension):
  df = pd.DataFrame()
  for name in file_names:
    temp_df = pd.read_csv((str(name) + extension), sep='\t', header = None)
    df = pd.concat([df, temp_df], ignore_index=True)
  return df

In [13]:
df1 = read_data([1, 2], '.tsv')
df1.drop([0, 1], axis = 1, inplace = True)
df1.rename(columns = {3:'tweets', 2:'labels'}, inplace = True)
df1.head()

Unnamed: 0,labels,tweets
0,negative,"I know I missed something here , but what does..."
1,neutral,What do you think of Beside Ourselves as a tit...
2,positive,:D I intend to be one someday .
3,negative,LLLINKKK LLLINKKK IIIMAGEEELLLINKKK The choice...
4,neutral,LLLINKKK Some more mountains .


In [14]:
df2 = read_data(range(3, 14), '.txt')
df2.drop([0, 3], axis = 1, inplace = True)
df2.rename(columns = {2:'tweets', 1:'labels'}, inplace = True)
df2.head()

Unnamed: 0,labels,tweets
0,neutral,Won the match #getin . Plus\u002c tomorrow is ...
1,neutral,Some areas of New England could see the first ...
2,negative,@francesco_con40 2nd worst QB. DEFINITELY Tony...
3,neutral,#Thailand Washington - US President Barack Oba...
4,neutral,Did y\u2019all hear what Tony Romo dressed up ...


In [15]:
train_data = pd.concat([df1, df2], ignore_index=True)
train_data.head()

Unnamed: 0,labels,tweets
0,negative,"I know I missed something here , but what does..."
1,neutral,What do you think of Beside Ourselves as a tit...
2,positive,:D I intend to be one someday .
3,negative,LLLINKKK LLLINKKK IIIMAGEEELLLINKKK The choice...
4,neutral,LLLINKKK Some more mountains .


In [16]:
train_data.shape #import more data later

(53368, 2)

In [17]:
for i in range(train_data.shape[0]):
  if train_data.iloc[i,0] == 'negative':
    train_data.iloc[i,0] = 0
  elif train_data.iloc[i,0] == 'neutral':
    train_data.iloc[i,0] = 2
  elif train_data.iloc[i,0] == 'positive':
    train_data.iloc[i,0] = 4

In [18]:
train_data.head()

Unnamed: 0,labels,tweets
0,0,"I know I missed something here , but what does..."
1,2,What do you think of Beside Ourselves as a tit...
2,4,:D I intend to be one someday .
3,0,LLLINKKK LLLINKKK IIIMAGEEELLLINKKK The choice...
4,2,LLLINKKK Some more mountains .


In [19]:
train_data.value_counts('labels')

labels
2    24143
4    20718
0     8507
dtype: int64

In [20]:
test = pd.read_csv("testdata.csv", header = None)

In [21]:
test.head()

Unnamed: 0,0,1,2,3,4,5
0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...


In [22]:
test_data = pd.DataFrame()
test_data['labels'] = test.iloc[:,0]
test_data['tweets'] = test.iloc[:,5] 
test_data.head()

Unnamed: 0,labels,tweets
0,4,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,Reading my kindle2... Love it... Lee childs i...
2,4,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,@kenburbary You'll love your Kindle2. I've had...
4,4,@mikefish Fair enough. But i have the Kindle2...


In [23]:
test_data.shape

(498, 2)

Cleaning Data

In [24]:
import re
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [25]:

ps = PorterStemmer()

def datacleaning(data):
    corpus = []
    for i in range(0, len(data)):
    # review = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", data['tweets'][i]) #removing links and some special characters
    # review = " ".join(review.split())
      review = re.sub('[^a-zA-Z]', ' ', data['tweets'][i]) 
      review = review.lower()
      review = review.split()
     
    
      review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
      review = ' '.join(review)
      corpus.append(review)
    return corpus


In [26]:
train_corpus = datacleaning(train_data)

In [27]:
train_corpus[0:5]

['know miss someth thud mean',
 'think besid titl',
 'intend one someday',
 'lllinkkk lllinkkk iiimageeelllinkkk choic take rocki put death row',
 'lllinkkk mountain']

In [28]:
test_corpus = datacleaning(test_data)

In [29]:
test_corpus[0:5]

['stellargirl loooooooovvvvvvee kindl dx cool fantast right',
 'read kindl love lee child good read',
 'ok first asses kindl fuck rock',
 'kenburbari love kindl mine month never look back new big one huge need remors',
 'mikefish fair enough kindl think perfect']

One hot representation

In [30]:
import tensorflow as tf

In [31]:
tf.__version__

'2.9.2'

In [32]:
from tensorflow.keras.preprocessing.text import one_hot

In [33]:
vocab_size = 100000

In [34]:
onehot_rep_train=[one_hot(words,vocab_size)for words in train_corpus] 
onehot_rep_train[0:5]

[[91986, 6740, 16726, 10317, 24333],
 [8061, 61682, 82056],
 [16106, 99961, 15675],
 [46060, 46060, 32816, 45676, 164, 32998, 91182, 98190, 94631],
 [46060, 5681]]

In [35]:
onehot_rep_test = [one_hot(words,vocab_size)for words in test_corpus] 
onehot_rep_test[0:5]

[[53696, 93222, 55489, 96869, 46125, 37530, 95546],
 [50461, 55489, 63779, 78802, 78797, 33678, 50461],
 [78979, 88844, 73919, 55489, 8637, 13891],
 [25756,
  63779,
  55489,
  27015,
  28650,
  30362,
  10036,
  74977,
  19800,
  69397,
  99961,
  27646,
  89322,
  26366],
 [98385, 874, 2367, 55489, 8061, 51825]]

In [36]:
max_length = 0
for i in onehot_rep_train:
  max_length = max(len(i),max_length)

max_length

639

Padding

In [37]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

In [38]:
sentence_length= 25
embedded_train=pad_sequences(onehot_rep_train,padding='post',maxlen=sentence_length)
print(embedded_train[0:5])

[[91986  6740 16726 10317 24333     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0]
 [ 8061 61682 82056     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0]
 [16106 99961 15675     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0]
 [46060 46060 32816 45676   164 32998 91182 98190 94631     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0]
 [46060  5681     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0]]


In [39]:
embedded_test=pad_sequences(onehot_rep_test,padding='pre',maxlen=sentence_length)
print(embedded_test[0:5])

[[    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0 53696 93222 55489 96869 46125 37530
  95546]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0 50461 55489 63779 78802 78797 33678
  50461]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0 78979 88844 73919 55489  8637
  13891]
 [    0     0     0     0     0     0     0     0     0     0     0 25756
  63779 55489 27015 28650 30362 10036 74977 19800 69397 99961 27646 89322
  26366]
 [    0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0 98385   874  2367 55489  8061
  51825]]


In [40]:
np.array(embedded_train).shape

(53368, 25)

In [41]:
53368*25

1334200

In [42]:
y_train = pd.get_dummies(train_data['labels'])

In [43]:
y_train.shape

(53368, 3)

Splitting into train and val data

In [44]:
from sklearn.model_selection import train_test_split

X_train, X_val, Y_train, Y_val = train_test_split(embedded_train,y_train, test_size = 0.3, random_state = 42 )

Embedding and making the model

In [45]:
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense, Flatten

In [46]:
embedding_vector_features = 100
model=Sequential()
model.add(Embedding(vocab_size,embedding_vector_features,input_length=sentence_length))
model.add(LSTM(10))
# model.add(Flatten())
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 25, 100)           10000000  
                                                                 
 lstm (LSTM)                 (None, 10)                4440      
                                                                 
 dense (Dense)               (None, 3)                 33        
                                                                 
Total params: 10,004,473
Trainable params: 10,004,473
Non-trainable params: 0
_________________________________________________________________
None


Train test split

In [47]:
# X_train = np.array(embedded_train).astype('int32')
# Y_train = np.array([train_data['labels']]).astype('int32')

In [48]:
# train_data.shape

In [49]:
print(type(X_train))
print(type(Y_train))

<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>


In [50]:
X_train.shape

(37357, 25)

In [51]:
Y_train.shape

(37357, 3)

In [52]:
# X_test = np.array(embedded_test).astype('int32')
# Y_test = np.array([test_data['labels']]).astype('int32')

In [53]:
print(type(X_val))
print(type(Y_val))

<class 'numpy.ndarray'>
<class 'pandas.core.frame.DataFrame'>


In [54]:
X_val.shape

(16011, 25)

In [55]:
Y_val.shape

(16011, 3)

Model Training

In [56]:
model.fit(X_train,Y_train,validation_data=(X_val,Y_val),epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd96059dcd0>

In [67]:
y_pred=model.predict(X_val)



In [79]:
Y_val = np.array(Y_val)

In [76]:
y_pred

array([[9.94972587e-01, 3.89605784e-03, 1.13132421e-03],
       [7.02314836e-04, 9.98674870e-01, 6.22885011e-04],
       [7.36051414e-04, 9.98494267e-01, 7.69601553e-04],
       ...,
       [9.91189241e-01, 7.76691549e-03, 1.04376953e-03],
       [3.26800384e-02, 6.41889155e-01, 3.25430781e-01],
       [1.05009936e-01, 8.37223947e-01, 5.77661619e-02]], dtype=float32)

In [74]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [83]:
y_val_new = np.argmax(Y_val,axis = 1)
y_pred_new = np.argmax(y_pred, axis =1)

print(confusion_matrix(y_pred_new, y_val_new))
print(accuracy_score(y_pred_new, y_val_new))

[[ 886  649  301]
 [1313 5029 2355]
 [ 358 1541 3579]]
0.5929673349572169
