<a href="https://colab.research.google.com/github/michaeljf00/projects_in_ml_and_ai/blob/main/homework4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Homework 4: Sequence Models**

In your project, you will pick a dataset (time-series) and an associated problem that can be
solved via sequence models. You must describe why you need sequence models to solve this
problem. Include a link to the dataset source. Next, you should pick an RNN framework that you
would use to solve this problem (This framework can be in TensorFlow, PyTorch or any other
Python Package).

Dataset: https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis 

**Task 1 (75 points)**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Libraries
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

df = pd.read_csv("drive/MyDrive/twitter_training.csv")
SEED = 5473

In [3]:
df.head()

Unnamed: 0,id,topic,sentiment,tweet
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...


In [4]:
len(df)

74682

In [5]:
df.isnull()

Unnamed: 0,id,topic,sentiment,tweet
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
...,...,...,...,...
74677,False,False,False,False
74678,False,False,False,False
74679,False,False,False,False
74680,False,False,False,False


There does not appear to be any null values present.

In [6]:
df = df.drop(columns = ["id", "topic"])
df

Unnamed: 0,sentiment,tweet
0,Positive,im getting on borderlands and i will murder yo...
1,Positive,I am coming to the borders and I will kill you...
2,Positive,im getting on borderlands and i will kill you ...
3,Positive,im coming on borderlands and i will murder you...
4,Positive,im getting on borderlands 2 and i will murder ...
...,...,...
74677,Positive,Just realized that the Windows partition of my...
74678,Positive,Just realized that my Mac window partition is ...
74679,Positive,Just realized the windows partition of my Mac ...
74680,Positive,Just realized between the windows partition of...


The tweetd id was removed from the dataset because it is a non-useful feature that has no effect on the sentiment result. The topic column was also taken out from the databse since it has no relevance here either in determining sentiment.

In [7]:
len(df[df["sentiment"] == "Positive"])

20832

In [8]:
len(df[df["sentiment"] == "Negative"])

22542

In [9]:
len(df[df["sentiment"] == "Neutral"])

18318

In [10]:
len(df[df["sentiment"] == "Irrelevant"])

12990

In [11]:
df = df[df.sentiment != "Irrelevant"]
df = df[df.sentiment != "Neutral"]

In [12]:
len(df)

43374

Each tweet is classifiable under four sentiments, Positive, Negative, Neutral, and irrelevent. The irrelevant categorization is caused by the tweet not being related to the topic, so in this case it would add more uncertainty to the RNN. The irrelvant observations were removed from the dataframe along with the neutral tweet since in this particular problem, we are looking for definite answers. We want the problem to be solved with each tweet colnclude as either negative or psotive. Even after removing these observation, we are still left with more than a sufficient amount of data for training. Looking at the count of positive and negative tweets, it seems to be an even distribution of each one for the most part.

**PART 1**

In [13]:
df["sentiment"] = df["sentiment"].replace("Negative", 0)
df["sentiment"] = df["sentiment"].replace("Positive", 1)


# Clean text, remove any tags
def cleanTweet(tweet: str) -> str:
  ret = list()

  for word in str(tweet).split(): 
    if '#' not in word and "@" not in word:
      ret.append(word)

  return ' '.join(ret)

df["tweet"] = df["tweet"].apply(lambda text: cleanTweet(text))
df.head()

Unnamed: 0,sentiment,tweet
0,1,im getting on borderlands and i will murder yo...
1,1,I am coming to the borders and I will kill you...
2,1,im getting on borderlands and i will kill you ...
3,1,im coming on borderlands and i will murder you...
4,1,im getting on borderlands 2 and i will murder ...


In [14]:
from sklearn.model_selection import train_test_split

In [15]:
def convert_to_tfds(dataframe):

  dataset = tf.data.Dataset.from_tensor_slices((dataframe['tweet'], dataframe['sentiment']))
  dataset = dataset.shuffle(buffer_size=len(dataframe), seed=0)
  return dataset.batch(64).prefetch(tf.data.AUTOTUNE)

training_set = df.copy()

train, dev = train_test_split(training_set, test_size=0.1, random_state = 0)
train, test = train_test_split(train, test_size = 0.1, random_state = 0)

train_ds = convert_to_tfds(train)
valid_ds = convert_to_tfds(dev)
test_ds = convert_to_tfds(test)

In [16]:
encoder = tf.keras.layers.TextVectorization()
encoder.adapt(train_ds.map(lambda text, label: text))

In [17]:
len(encoder.get_vocabulary())

22349

In [18]:
model = tf.keras.Sequential([
        encoder,
        tf.keras.layers.Embedding( 
            input_dim = len(encoder.get_vocabulary()),
            output_dim = 64,
            mask_zero = True
        ),
        tf.keras.layers.Bidirectional(tf.keras.layers.SimpleRNN(64, activation='relu')),
        tf.keras.layers.Dense(64, activation='relu'), 
        tf.keras.layers.Dense(1)
])

In [19]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [20]:
history = model.fit(train_ds, epochs=5,
                    validation_data=valid_ds,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [21]:
# Get Loss and Accuracy of test set
loss, accuracy = model.evaluate(test_ds)

print('Loss:', loss)
print('Accuracy:', accuracy)

Loss: 0.19062553346157074
Accuracy: 0.9254610538482666


I implemented the bidirectional recurrent neural network with an embedding layer and tokenized input. The amount of time it took to train the model was lengthy but the accuracy did increase when more epochs were added. The loss was kept at a small percentage at about 18% and the accuracy sits at a high percentage of 93% which makes the a well performing model. It does look as if there is overfitting occurring since the training acuracy is higher than the test and validation accuracys.

**PART 2**

In [22]:
# LSTM Implementation

model = tf.keras.Sequential([
        encoder,
        tf.keras.layers.Embedding( 
            input_dim = len(encoder.get_vocabulary()),
            output_dim = 64,
            mask_zero = True
        ),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, activation='relu')),
        tf.keras.layers.Dense(64, activation='relu'), 
        tf.keras.layers.Dense(1)
])

In [23]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [24]:
history = model.fit(train_ds, epochs=5,
                    validation_data=valid_ds,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [25]:
# LSTM Implementation - Get Loss and Accuracy of test set
loss, accuracy = model.evaluate(test_ds)

print('Loss:', loss)
print('Accuracy:', accuracy)

Loss: 0.33649465441703796
Accuracy: 0.8665471076965332


In [26]:
# GRU Implementation

model = tf.keras.Sequential([
        encoder,
        tf.keras.layers.Embedding( 
            input_dim = len(encoder.get_vocabulary()),
            output_dim = 64,
            mask_zero = True
        ),
        tf.keras.layers.Bidirectional(tf.keras.layers.GRU(64, activation='relu')),
        tf.keras.layers.Dense(64, activation='relu'), 
        tf.keras.layers.Dense(1)
])

In [27]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

In [28]:
history = model.fit(train_ds, epochs=5,
                    validation_data=valid_ds,
                    validation_steps=30)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [29]:
# GRU Implementation - Get Loss and Accuracy of test set
loss, accuracy = model.evaluate(test_ds)

print('Loss:', loss)
print('Accuracy:', accuracy)

Loss: 0.18657895922660828
Accuracy: 0.9180327653884888


The loss and accuracy change for both the LSTM and GRU from the original RNN. The loss for the LSTM increased to a concerning number at an estimate of 27%. Its accuracy decreased while also being smaller than the validation accuracy meaning overfitting is occurring. This trend seems to be present in the GRU implementation as well with a loss at 19% and overfitting still occurring. For the most part, majority of the tweets stay around the average lengths with some lengthy comments and some being as small as one word. This variation in tweet lengths most likely contributed to the variation in results among these three implementations.

**PART 3**

A traditional feed-forward network will not perform as well as a recurrent neural network for this problem. The temporal factor makes it essential for recurrent neural networks to be the chosen option. RNN's have a memory and can look for certain patterns within the text to determine the sentiment. Within a sentence or group of them, certain words come in a certain order with a definitive meanining they want to convey. The order is best kept track by these RNN's as they can iterate back to determine patterns.

**Task 2 (25 points)**

In [30]:
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embeddings = hub.KerasLayer(module_url)

In [34]:
def simFunction():
  x = str(input('Please enter first word: '))
  y = str(input('Please enter second word: '))
  embed_x = embeddings([x])[0].numpy()
  embed_y = embeddings([y])[0].numpy()
  similarity = np.inner(embed_x, embed_y)/(np.linalg.norm(embed_x)*np.linalg.norm(embed_y)) # cosine similarity
  dissimilarity = 1 - similarity 
  print(f'Cosine similarity of {x} and {y} is {similarity}.')
  print(f'Dissimilarity of {x} and {y} is {dissimilarity}.')

In [37]:
simFunction()

Please enter first word: good
Please enter second word: great
Cosine similarity of good and great is 0.8486340641975403.
Dissimilarity of good and great is 0.15136593580245972.


In [36]:
simFunction()

Please enter first word: love
Please enter second word: hate
Cosine similarity of love and hate is 0.5902369022369385.
Dissimilarity of love and hate is 0.4097630977630615.


The dissimilarity between a pair of words is defined as 1 - cosine similarity which is the equivalent of the cosine distance. A dissimilarity relationship should be inverse which makes this appropriate especially if two words have oppositve meanings of each other.