# Sentiment Analysis with Sentiment140 Dataset

In this notebook, we aim to accomplish the following:
1. Import and analyze the Sentiment140 dataset.
2. Prepare the data for model training using tokenization and padding techniques.
3. Implement a train-validation split to evaluate our model's performance.
4. Train a tokenizer on the training data.
5. Construct and train a neural network model that includes an embedding layer for text representation.
6. Inspect the output dimensionality of the embedding layer.
7. Train the model to classify the sentiment of tweets and evaluate its performance using the validation set.

## Introduction
For this exercise, we'll be using a subsample of the Sentiment140 dataset from Kaggle.
Sentiment140 is a popular dataset for sentiment analysis that contains 1.6 million tweets labeled for sentiment. This dataset is widely used for training machine learning models to differentiate between positive and negative sentiment in text.

Here is the link to the dataset: [Sentiment140_small]()

Please download the dataset and upload it to your Google Drive.
By uploading the data to your Google Drive and not directly to your Colab environment, it is persistently available. You just need to run the cell the code to mount your Google Drive to your Colab environment.

The dataset contains the following 6 fields:  
`target`: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)  
`ids`: The id of the tweet ( 2087)  
`date`: the date of the tweet (Sat May 16 23:58:44 UTC 2009)  
`flag`: The query (lyx). If there is no query, then this value is NO_QUERY.  
`user`: the user that tweeted (robotickilldozr)  
`text`: the text of the tweet (Lyx is cool)  

## Google Drive Setup
Before proceeding, ensure that you have uploaded the Sentiment140 dataset to your Google Drive in a specified folder. Then, use the following code to mount your Google Drive and access files using the path '/content/drive/MyDrive/'.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


## Importing Data
Next, we will import the dataset and examine it using descriptive statistics to gain initial insights.

In [None]:
import pandas as pd

# Update this to the path of your Sentiment140 dataset file
file_path = '/content/drive/MyDrive/path_to_your_file.csv'
data = pd.read_csv(file_path, encoding='ISO-8859-1')  # Note: The encoding may vary based on your dataset specifics

data.head()  # View the first few rows of the dataset


In [None]:
# Descriptive statistics
print(data.describe())  # Summary statistics for numerical columns
print(data['your_label_column'].value_counts())  # Replace 'your_label_column' with the actual column name for the labels

## Removing Stop Words

Stop words are common words that generally do not contribute much meaning in a sentence and are typically removed in the preprocessing stage of traditional text analysis. This helps in reducing the size of the dataset and improves the performance of the model by focusing on words that carry more meaning.

In this section, we will use the NLTK library, a widely-used Python library for natural language processing, to remove stop words from our dataset.


In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def remove_stop_words(data):
    stop_words = set(stopwords.words('english'))
    return [" ".join([word for word in sentence.split() if word.lower() not in stop_words]) for sentence in data]

# Assuming 'data' is your dataframe and it has a column named 'text' containing the tweets
data['your_text_column'] = remove_stop_words(data['your_text_column'])


## Tokenization Function
The following function will be responsible for tokenizing our text data. We'll use Keras' Tokenizer class, which allows us to vectorize a text corpus by turning each text into a sequence of integers. The `oov_token` parameter is used to handle out-of-vocabulary words during text conversion.


In [None]:
# Import Libraries


# Implement tokenizer function
def tokenizer(sentences, oov_token="<OOV>"):
    # Insert your tokenizer code here
    return tokenizer

# Tokenize the data


## Padding
To ensure consistent input shape for modeling, we apply padding to our tokenized text. Padding adjusts the sequence length so that all inputs are of the same length, which is necessary for batch processing in neural networks.


In [None]:
# Import Libraries


def padding_function(sequences, padding='post', maxlen=None):
    # Insert your padding code here
    pass


In [None]:
from sklearn.model_selection import train_test_split

# Splitting the data into train and validation sets
train_data, validation_data, train_labels, validation_labels = train_test_split(
    data['text'], data['labels'], test_size=0.2, random_state=42)


## Model Construction
Now, we will define and compile our neural network model, incorporating an embedding layer to capture text representation effectively.


In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Dense

# Define the model architecture
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length))
model.add(Dense(units=number_of_classes, activation='softmax'))

# Display the model's architecture
model.summary()


In [None]:
# Inspect the embedding layer dimensions after training the model
embedding_layer_weights = model.layers[0].get_weights()[0]
print(f"Shape of embedding layer weights: {embedding_layer_weights.shape}")


In [None]:
# Compile and train the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(train_data, train_labels, epochs=num_epochs, validation_data=(validation_data, validation_labels))


## Conclusion
We have covered data preprocessing steps like tokenization and padding, followed by splitting our dataset into training and validation sets. We constructed and trained
