# IMDB reviews sentiment analyses

This notebook uses Kaggle dataset (https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews).

1. [Explore and prepare training data](#explore_prepare_data)
1. [Create train and test dataset](#train_test_set)
1. [Train the model](#train_model)
1. [Save the model](#save_model)

In [1]:
import pandas as pd
import numpy as np
import pickle
import nltk
import keras

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing

2024-02-26 17:09:57.730041: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a id="explore_prepare_data"></a>
## 1. Explore and prepare training data

In [3]:
import ssl

# This try-except block addresses SSL certificate verification issues.
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [5]:
nltk.data.path.append('../nltk_data')
nltk.download(['stopwords', 'punkt'], download_dir='../nltk_data')

[nltk_data] Downloading package stopwords to ../nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to ../nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
stop_words = stopwords.words('english')
porter_stemmer = PorterStemmer()

df = pd.read_csv('../data/imdb_dataset.csv', delimiter=',')

In [None]:
df.head()

### 2.2. Exploring and preparing data

In this step you will prepare data for training a model. Using the following Text Feature Engineering techniques:

1. Tonkenization         
2. Removes stop words
3. Stemming text (porter)
4. Joining words (tokens) into a single string

In [None]:
import sys

sys.path.insert(0, '<your-directory>/tensorflow-keras-container')

In [None]:
from utils.preprocessing import tokenization, remove_stop_words, stem_porter, rejoin_words, word2vec_tfidf

In [None]:
reviews = df['review']

input_tokens = reviews.apply(tokenization)
input_tokens = input_tokens.apply(remove_stop_words)
input_tokens = input_tokens.apply(stem_porter)
input_text_cleaned = input_tokens.apply(rejoin_words)

df['cleaned_text'] = input_text_cleaned

In [None]:
df.head()

In [None]:
df['cleaned_text'] = df['cleaned_text'].str.lower()

<a id="train_test_set"></a>
## 2. Create train and test dataset

NOTE: Test dataset (30%) and Training dataset (70%)

In [None]:
X = df['cleaned_text']
Y = df['sentiment']

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.2,
                                                    random_state=48,
                                                    stratify=Y)

tfidf_vectorizer = TfidfVectorizer(ngram_range=(2, 3),
                        sublinear_tf=True,
                        max_features=10000)

NOTE: Machine Learning or Deep Learning models uses numeric values input. The Tf-Idf Text Feature Engineering (TFE) process will be used to transform the texts into vectors.

In [None]:
X_train_tf = tfidf_vectorizer.fit_transform(X_train)
X_test_tf = tfidf_vectorizer.transform(X_test)

print(Y.value_counts().shape)
print(X_train_tf.shape)

In [None]:
le = preprocessing.LabelEncoder()

Y_train_le = le.fit_transform(list(Y_train))
Y_test_le = le.transform(list(Y_test))

num_class = Y.value_counts().shape
input_shape = X_train_tf.shape

<a id="train_model"></a>
## 3. Train the model

Create a Keras neural network model.

In [None]:
from keras.utils import to_categorical

Y_train_label_keras = to_categorical(Y_train_le)
Y_test_label_keras = to_categorical(Y_test_le)

from keras import models
from keras import layers

In [None]:
network = models.Sequential()

network.add(layers.Dense(2, activation='relu', input_shape=(input_shape[1], )))
network.add(layers.Dropout(0.4))

network.add(layers.Dense(5, activation='relu'))
network.add(keras.layers.Dropout(0.4))

network.add(layers.Dense(5, activation='sigmoid'))
network.add(layers.Dropout(0.4))

network.add(layers.Dense(num_class[0], activation='softmax'))

network.compile(optimizer='adamax',
                loss="binary_crossentropy",
                metrics=['accuracy'])

network.summary()

In [None]:
network.fit(X_train_tf.toarray(),
            Y_train_label_keras,
            verbose=1,
            epochs=50,
            validation_split=0.3)

## 4. Save the model

Save Keras model and TF-IDF text vectorizer to file.

In [None]:
network.save('../models/sentiment_network_model.keras')
pickle.dump(tfidf_vectorizer, open('../models/tfidf_text_vectorizer.pickle', 'wb'))