# Sentiment Analysis Using the Sentiment140 Dataset

In this notebook, we aim to accomplish the following:
1. Import and analyze the Sentiment140 dataset.
2. Prepare the data for model training
3. Implement a train-validation split to evaluate our model's performance.
4. Train a tokenizer on the training data and tokenize the data
5. Pad the data on a selected length
6. Construct and train a neural network model that includes an embedding layer for text representation.
7. Inspect the output dimensionality of the embedding layer.
8. Train the model to classify the sentiment of tweets and evaluate its performance using the validation set.
9. Tune the hyper parameters for the tokenization and model training.

## Introduction
For this exercise, we'll be using a subsample of the Sentiment140 dataset from Kaggle.
Sentiment140 is a popular dataset for sentiment analysis that contains 1.6 million tweets labeled for sentiment. This dataset is widely used for training machine learning models to differentiate between positive and negative sentiment in text.

You can download the dataset [here](https://github.com/opencampus-sh/course-material/blob/main/machine-learning-with-tensorflow/week-05/sentiment140_small.csv).

Please download the dataset and upload it to your Google Drive.
By uploading the data to your Google Drive and not directly to your Colab environment, it is persistently available. You just need to run the cell the code to mount your Google Drive to your Colab environment.

The dataset contains the following 6 fields:  
`target`: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)  
`ids`: The id of the tweet ( 2087)  
`date`: the date of the tweet (Sat May 16 23:58:44 UTC 2009)  
`flag`: The query (lyx). If there is no query, then this value is NO_QUERY.  
`user`: the user that tweeted (robotickilldozr)  
`text`: the text of the tweet (Lyx is cool)  

## Google Drive Setup
Before proceeding, ensure that you have uploaded the Sentiment140 dataset to your Google Drive in a specified folder. Then, use the following code to mount your Google Drive and access files using the path '/content/drive/MyDrive/'.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


## Importing Data
Next, we will import the dataset and examine it using descriptive statistics to gain initial insights.

In [None]:
import pandas as pd

# Define column names
column_names = ["target", "ids", "date", "flag", "user", "text"]

# Update this to the path of your Sentiment140 dataset file
file_path = '/content/drive/MyDrive/path_to_your_file.csv'
data = pd.read_csv(file_path, encoding='ISO-8859-1', names=column_names)  # Note: The encoding may vary based on your dataset specifics

data.head()  # View the first few rows of the dataset


### Descriptive Statistics

In [None]:
# Average number of words in each tweet
data['text'].apply(lambda x: len(x.split(' '))).mean()

In [None]:
# Summary statistics for numerical columns
data.describe()

In [None]:
# Class distribution
data['your_label_column'].value_counts()  # Replace 'your_label_column' with the actual column name for labels

## Prepare Labels

In [None]:
# Labels must start at 0 and increase sequentially by 1 to mark the different classes
data['label'] = # INCLUDE YOUR CODE HERE

data['label'].value_counts() 

# Another option would be to use one-hot encoding to represent the labels, where each class is represented by a vector of 0s and a 1 in the position of the class label

## Removing Stop Words

Stop words are common words that generally do not contribute much meaning in a sentence and are typically removed in the preprocessing stage of traditional text analysis. This helps in reducing the size of the dataset and improves the performance of the model by focusing on words that carry more meaning.

In this section, we will use the NLTK library, a widely-used Python library for natural language processing, to remove stop words from our dataset.


In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def remove_stop_words(data):
    stop_words = set(stopwords.words('english'))
    return [" ".join([word for word in sentence.split() if word.lower() not in stop_words]) for sentence in data]

# Add new column including the texts without stopwords
data['text_without_stopwords'] = remove_stop_words(data['your_text_column'])

# print the first few rows of the text columns
data[['your_text_column', 'text_without_stopwords']].head()

In [None]:
# Average number of words in each tweet without stopwords
data['text_without_stopwords'].apply(lambda x: len(x.split(' '))).mean()

## Splitting Data into Training and Test Sets

Before training our model, it is essential to split the dataset into training and test sets. This approach helps in assessing the performance of the model on unseen data, ensuring that our evaluations are realistic and our model is not overfitting to the training data.

We will use a typical split ratio of 80% for training and 20% for testing. You can adjust this ratio based on your dataset size and requirements.


In [None]:
from sklearn.model_selection import train_test_split

# Assuming 'data' is your dataframe and 'labels' is the column with sentiment labels
X_train, X_test, y_train, y_test = train_test_split(
    data['your_text'], data['your_labels'], test_size=0.2, random_state=42) # replace 'your_text' and 'your_labels' with the actual column names for text and labels

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")


## Tokenization Function
The following function will be responsible for tokenizing our text data. We'll use Keras' Tokenizer class, which allows us to vectorize a text corpus by turning each text into a sequence of integers. The `oov_token` parameter is used to handle out-of-vocabulary words during text conversion.


In [None]:
# Install TensorFlow library for text processing (if not already installed - for Colab you can skip it)
%pip install tensorflow_text

In [None]:
# Import libraries
from tensorflow.keras.preprocessing.text import Tokenizer # Implement tokenizer function

# Define the tokenizer
tokenizer = Tokenizer(oov_token="<OOV>", num_words = 0) # Change the num_words parameter to the desired number of words that your dictionary should contain

# Train the tokenizer
# INCLUDE YOUR CODE HERE

# Tokenize the data
X_train_sequences = # INCLUDE YOUR CODE HERE

# Create a datframe with the original and the tokenized data to check the tokenization for the first rows
pd.DataFrame(zip(X_train, X_train_sequences), columns=['text', 'sequence']).head()

## Padding
To ensure consistent input shape for modeling, we apply padding to our tokenized text. Padding adjusts the sequence length so that all inputs are of the same length, which is necessary for batch processing in neural networks.

In this step, it important to decide on the best maximum length of sequences for padding. This length affects both the model's performance and computational efficiency. If the maximum length is too long, it may lead to increased computational costs and may include lots of padding for shorter sequences. On the other hand, if it's too short, valuable information might be lost.

In this section, we will analyze the distribution of sequence lengths in our dataset and choose an appropriate maximum length. This is often a balance between capturing enough information and maintaining computational efficiency.



In [None]:
# Plot distribution of sequence lengths
import matplotlib.pyplot as plt

plt.hist([len(sequence) for sequence in X_train_sequences], bins=160)
plt.show()

# print maximum sequence length
print("Longest sequence: ", max([len(sequence) for sequence in X_train_sequences]))

In [61]:
# Import libraries
import numpy as np
from keras.preprocessing.sequence import pad_sequences

max_len = None # Set the maxlen parameter to the desired maximum length of your sequences
# Note: By setting it to `None`, the maximum length of the sequences will be the length of the longest sequence in the data.
X_train_padded_sequences = pad_sequences(X_train_sequences, padding='post', maxlen=max_len)
X_test_padded_sequences = pad_sequences(X_test_sequences, padding='post', maxlen=max_len)

## Model Construction
Now, we will define our neural network model, incorporating an embedding layer to capture text representation effectively.


In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, GlobalAveragePooling1D

# Define model parameters
vocab_size = XXX  # Replace with your vocabulary size as defined in the tokenization step
max_length = XXX  # Replace with your maximum sequence length as defined in the padding step
embedding_dim = 128  # Size of the embedding vectors


# Define the model architecture
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
model.add(GlobalAveragePooling1D()),
model.add(Dense(20, activation='relu')),
model.add(Dense(units=NUMBER_OF_CLASSES, activation='softmax'))

# Display the model's architecture
model.summary()

In [66]:
# Inspect the embedding layer dimensions after training the model
embedding_layer_weights = model.layers[0].get_weights()[0]
print(f"Shape of embedding layer weights: {embedding_layer_weights.shape}")


Shape of embedding layer weights: (10000, 128)


## Model Fitting

In this section, we will compile and train our model.

In [None]:
# Compile and train the model
model.compile(optimizer='adam', loss='replace_with_suitable_loss_function', metrics=['accuracy']) # The loss function can either be `sparse_categorical_crossentropy` or `categorical_crossentropy`, depending on how you ave encoded your labels
history = model.fit(your_train_texts, your_train_labels, epochs=30, batch_size=1024, validation_data=(your_test_texts, your_test_labels)) # Replace your_train_texts, your_train_labels, your_test_texts, your_test_labels with the actual variables containing your data

Use the code of above to compile and train your data again but replace definition of `validation_data` with `validation_split=0.2` argument (see below).  
Why is this approach of defining the validation data not recommended?

In [None]:
# Compile and train the model
# INCLUDE YOUR CODE HERE
history_val_split = # INCLUDE YOUR CODE HERE

In [None]:
# Plot training & validation accuracy values for both models of above
plt.figure(figsize=(12, 6))

# Plot accuracy for the first model
plt.plot(history.history['accuracy'], 'b-', label='Train accuracy Base Model')
plt.plot(history.history['val_accuracy'], 'b--', label='Validation accuracy Base Model')

# Plot accuracy for the second model
plt.plot(history_val_split.history['accuracy'], 'r-', label='Train accuracy Validation Split Model')
plt.plot(history_val_split.history['val_accuracy'], 'r--', label='Validation accuracy Validation Split Model')

plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(loc='upper right')
plt.show()

## Hyperparameter Tuning

Now that you have a basic understanding of how to construct and train a text classification model using Keras, you can experiment with the model to improve its performance. One way to do this is by tuning the model's hyperparameters.

Here are a few hyperparameters you can experiment with:

1. **Vocabulary Size**: This is the number of unique words in your text data. A larger vocabulary size means the model can recognize more unique words, but it also increases the dimensionality of the data and can lead to overfitting. Try reducing the vocabulary size to see if it improves the model's performance.

2. **Maximum Sequence Length**: This is the length of the input sequences. If you increase the maximum sequence length, the model will be able to process longer sequences, but it will also take longer to train and may be more prone to overfitting. Try decreasing the maximum sequence length to see if it improves the model's performance.

3. **Embedding Dimensionality**: This is the size of the vectors in which words will be embedded. A higher dimensionality can capture more nuanced relationships between words, but it also increases the computational cost and can lead to overfitting. Try experimenting with different embedding dimensionalities to see what works best.

Remember, the goal of hyperparameter tuning is to find the combination of hyperparameters that gives the best performance on your validation data. Happy tuning!