# Sentiment classification

The task of classifying sentiments of texts (for example movie or product reviews) has high practical significance in online marketing as well as financial prediction. This is a non-trivial task, since the concept of sentiment is not easily captured. We will use [IMDB sentiment](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) benchmark dataset from Stanford.

We will try out multiple models namely:

1. TFIDF + classical statistical model
2. LSTM classification model
3. LSTM model, where the embeddings are initialized with pre-trained word vectors
4. fastText model
5. BERT based model

# Data download with tf.datasets

In [None]:
!pip install tensorflow-datasets > /dev/null

In [None]:
import tensorflow_datasets as tfds
(ds_train,ds_test),ds_info = tfds.load( name="imdb_reviews", 
                                split=('train', 'test'),
                                shuffle_files=True,
                                as_supervised=True,
                                with_info =True)

In [None]:
# dataset is of 25000 in train and test each, but just to avoid any file to be missed we pass large value in take()
# # this creates the dataframe out of the tensorflow dataset object
ds_train = tfds.as_dataframe(ds_train.take(30000), ds_info)
ds_test = tfds.as_dataframe(ds_test.take(30000), ds_info)

In [None]:
# lets check the dataframe for number of enteries and columns
print(ds_train.info())
print(ds_test.info())

<class 'tensorflow_datasets.core.as_dataframe.StyledDataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   25000 non-null  int64 
 1   text    25000 non-null  object
dtypes: int64(1), object(1)
memory usage: 390.8+ KB
None
<class 'tensorflow_datasets.core.as_dataframe.StyledDataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   25000 non-null  int64 
 1   text    25000 non-null  object
dtypes: int64(1), object(1)
memory usage: 390.8+ KB
None


In [None]:
# lets look at few rows
# the datatype in text column is byte that's why it shows b in each row, if we print any entry then it automatically gets converted to 
# UTF, so ignore it
import pandas as pd 
pd.set_option('display.max_colwidth',500) 
ds_train.head(5)

Unnamed: 0,label,text
0,0,"b""This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothi..."
1,0,"b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everyt..."
2,0,"b'Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do. <br /><br />But come on Hollywood - a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town? <br /><br />Nothing even remotely resembling that happened on the Canadian side of the border ..."
3,1,"b'This is the kind of film for a snowy Sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm-chair and mellow for a couple of hours. Wonderful performances from Cher and Nicolas Cage (as always) gently row the plot along. There are no rapids to cross, no dangerous waters, just a warm and witty paddle through New York life at its best. A family film in every sense and one that deserves the praise it received.'"
4,1,"b'As others have mentioned, all the women that go nude in this film are mostly absolutely gorgeous. The plot very ably shows the hypocrisy of the female libido. When men are around they want to be pursued, but when no ""men"" are around, they become the pursuers of a 14 year old boy. And the boy becomes a man really fast (we should all be so lucky at this age!). He then gets up the courage to pursue his true love.'"


In [None]:
# As we can see their are some weird characters in between, lets do the very basic cleaning
import re
# remove special characters
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")

# remove the weierd pattern visible in the sentences above.
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
  reviews = reviews.decode("utf-8")   # required as text is in byte

  # convert to lower case
  reviews = REPLACE_NO_SPACE.sub("", reviews.lower())  
  reviews = REPLACE_WITH_SPACE.sub(" ", reviews.lower())
  return reviews

ds_train.loc[:, ['text']] =  ds_train.loc[:,'text'].apply(preprocess_reviews)
ds_test.loc[:, ['text']] =  ds_test.loc[:,'text'].apply(preprocess_reviews)

In [None]:
# cleaned data
ds_train.head(5)

Unnamed: 0,label,text
0,0,this was an absolutely terrible movie dont be lured in by christopher walken or michael ironside both are great actors but this must simply be their worst role in history even their great acting could not redeem this movies ridiculous storyline this movie is an early nineties us propaganda piece the most pathetic scenes were those when the columbian rebels were making their cases for revolutions maria conchita alonso appeared phony and her pseudo love affair with walken was nothing but a pat...
1,0,i have been known to fall asleep during films but this is usually due to a combination of things including really tired being warm and comfortable on the sette and having just eaten a lot however on this occasion i fell asleep because the film was rubbish the plot development was constant constantly slow and boring things seemed to happen but with no explanation of what was causing them or why i admit i may have missed part of the film but i watched the majority of it and everything just see...
2,0,mann photographs the alberta rocky mountains in a superb fashion and jimmy stewart and walter brennan give enjoyable performances as they always seem to do but come on hollywood a mountie telling the people of dawson city yukon to elect themselves a marshal yes a marshal and to enforce the law themselves then gunfighters battling it out on the streets for control of the town nothing even remotely resembling that happened on the canadian side of the border during the klondike gold rush mr...
3,1,this is the kind of film for a snowy sunday afternoon when the rest of the world can go ahead with its own business as you descend into a big arm chair and mellow for a couple of hours wonderful performances from cher and nicolas cage as always gently row the plot along there are no rapids to cross no dangerous waters just a warm and witty paddle through new york life at its best a family film in every sense and one that deserves the praise it received
4,1,as others have mentioned all the women that go nude in this film are mostly absolutely gorgeous the plot very ably shows the hypocrisy of the female libido when men are around they want to be pursued but when no men are around they become the pursuers of a 14 year old boy and the boy becomes a man really fast we should all be so lucky at this age he then gets up the courage to pursue his true love


In [None]:
# Lets mount the Google Drive and save the data for access, instead of downloading everytime
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
# create path to Mydrive and then make directory called IMDB there.
path = '/content/drive/MyDrive'
try:
   os.mkdir(path + "/IMDB")
except:
    print("Drive exists in your personal drive")

In [None]:
# updating full path to IMDB directory and saving the files there
path = '/content/drive/MyDrive/IMDB'
ds_train.to_pickle(path + "/train.pkl")
ds_test.to_pickle(path + "/test.pkl")