<a href="https://colab.research.google.com/github/mimuruth-msft/NLP/blob/main/Text_Classification_2/Classification2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using the "Sentiment Analysis on Movie Reviews" dataset. This dataset can be downloaded from here: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data

First, read in the "Sentiment Analysis on Movie Reviews" dataset from Kaggle and divides it into training and testing sets using the train_test_split function from sklearn.model_selection. 

Then, divided the dataset into train and test sets. For this, used 80% of the data for training and 20% for testing.

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D

df = pd.read_csv("/content/sample_data/train.tsv", sep="\t")
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

#####**Preprocess the text data.** 
Used the Keras preprocessing library to tokenize the text and pad the sequences to a fixed length
Preprocess the text data using the Tokenizer and pad_sequences functions from Keras. Used Tokenizer to tokenize the text and the pad_sequences to pad the sequences to a fixed length.

In [8]:
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(train_df['Phrase'])

X_train = tokenizer.texts_to_sequences(train_df['Phrase'])
X_test = tokenizer.texts_to_sequences(test_df['Phrase'])

maxlen = 100
X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

y_train = train_df['Sentiment'].values
y_test = test_df['Sentiment'].values


#####**Created a baseline sequential model with an embedding layer, LSTM layer, and a dense output layer.**
Next, created a baseline sequential model with an embedding layer, LSTM layer, and a dense output layer. Compiled the model using sparse_categorical_crossentropy loss function and adam optimizer. 

In [9]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense

embedding_dim = 100

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=embedding_dim, input_length=maxlen))
model.add(LSTM(units=32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(units=5, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 100, 100)          1000000   
                                                                 
 lstm (LSTM)                 (None, 32)                17024     
                                                                 
 dense (Dense)               (None, 5)                 165       
                                                                 
Total params: 1,017,189
Trainable params: 1,017,189
Non-trainable params: 0
_________________________________________________________________


#####**Train the model on the training data and evaluate it on the test data.** 
Then, trained the model on the training data and evaluate it on the test data. The model achieves an accuracy of around 51%.

In [10]:
batch_size = 128
epochs = 5

model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test, y_test))


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fc781e92fa0>

#####**Try a different architecture like CNN and evaluate the test data.**
Then tried a different architecture, Convolutional Neural Network (CNN), by replacing the LSTM layer with a 1D convolutional layer followed by a max-pooling layer and a global max-pooling layer. Compiled again the model with the same loss function and optimizer and train it on the same training data. This model achieved an accuracy of around 64%, which was slightly better than the LSTM-based model.

In [11]:
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=embedding_dim, input_length=maxlen))
model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(GlobalMaxPooling1D())
model.add(Dense(units=5, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test, y_test))


Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 conv1d (Conv1D)             (None, 96, 64)            32064     
                                                                 
 max_pooling1d (MaxPooling1D  (None, 24, 64)           0         
 )                                                               
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 5)                 325       
                                                                 
Total params: 1,032,389
Trainable params: 1,032,389
No

<keras.callbacks.History at 0x7fc78283f970>

#####**Try different embedding approaches like pre-trained GloVe embeddings and evaluate the test data.**
Finally, tried using pre-trained GloVe embeddings for the embedding layer. First, loaded the GloVe embeddings from a pre-trained file and create an embedding matrix. Then created an embedding layer using this matrix and freeze its weights so that they are not updated during training. Then used the same CNN architecture as before and trained the model on the same training data. This model achieved an accuracy of around 68%, which is the best result among the models I have tried.

In [14]:
import numpy as np

embedding_dim = 100
embeddings_index = {}

with open('/content/sample_data/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index
        embedding_matrix = np.zeros((10000, embedding_dim))

embedding_matrix = np.zeros((10000, embedding_dim))
for word, i in tokenizer.word_index.items():
    if i >= 10000:
        break
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=embedding_dim, weights=[embedding_matrix], input_length=maxlen, trainable=False))
model.add(Conv1D(filters=64, kernel_size=5, activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(GlobalMaxPooling1D())
model.add(Dense(units=5, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_test, y_test))


Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 100, 100)          1000000   
                                                                 
 conv1d_1 (Conv1D)           (None, 96, 64)            32064     
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 24, 64)           0         
 1D)                                                             
                                                                 
 global_max_pooling1d_1 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_2 (Dense)             (None, 5)                 325       
                                                                 
Total params: 1,032,389
Trainable params: 32,389
Non-t

<keras.callbacks.History at 0x7fc782618310>

Overall, observed that using pre-trained embeddings can significantly improve the performance of the model, as compared to using randomly initialized embeddings. Additionally, using a CNN architecture instead of an LSTM-based architecture can also lead to slightly better performance in this case. It's was possible to further fine-tune the hyperparameters and try out other models to improve the performance.