# Sentiment classification - close to the state of the art

The task of classifying sentiments of texts (for example movie or product reviews) has high practical significance in online marketing as well as financial prediction. This is a non-trivial task, since the concept of sentiment is not easily captured.

For this assignment you have to use the larger [IMDB sentiment](https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) benchmark dataset from Stanford, an achieve close to state of the art results.

The task is to try out multiple models in ascending complexity, namely:

1. TFIDF + classical statistical model (eg. RandomForest)
2. LSTM classification model
3. LSTM model, where the embeddings are initialized with pre-trained word vectors
4. fastText model
5. BERT based model (you are advised to use a pre-trained one and finetune, since the resource consumption is considerable!)

You should get over 90% validation accuracy (though nearly 94 is achievable).

You are allowed to use any library or tool, though the Keras environment, and some wrappers on top (ie. Ktrain) make your life easier.





__Groups__
This assignment is to be completed individually, two weeks after the class has finished. For the precise deadline please see canvas.

__Format of submission__
You need to submit a pdf of your Google Collab notebooks.

__Due date__
Two weeks after the class has finished. For the precise deadline please see canvas.

Grade distribution:
1. TFIDF + classical statistical model (eg. RandomForest) (25% of the final grade)
2. LSTM classification model (15% of the final grade)
3. LSTM model, where the embeddings are initialized with pre-trained word vectors, e.g. fastText, GloVe etc. (15% of the final grade)
4. fastText model (15% of the final grade)
5. BERT based model (you are advised to use a pre-trained one and finetune it, since the resource consumption is considerable!) (30% of the final grade). For BERT you should get over 90% validation accuracy (though nearly 94% is achievable).


__For each of the models, the marks will be awarded according to the following three criteria__:

(1) The (appropriately measured) accuracy of your prediction for the task. The more accurate the prediction is, the better. Note that you need to validate the predictive accuracy of your model on a hold-out of unseen data that the model has not been trained with.

(2) How well you motivate the use of the model - what in this model's structure makes it suited for representing sentiment? After using the model for the task how well you evaluate the accuracy you got for each model and discuss the main advantages and disadvantages the model has in the particular modelling task. At best you take part of the modelling to support your arguments.

(3) The consistency of your take-aways, i.e. what you have learned from your analyses. Also, analyze when the model is good and when and where it does not predict well.

Please make sure that you comment with # on the separates steps of the code you have produced. For the verbal description and analyses plesae insert markdown cells.


__Plagiarism__: The Frankfurt School does not accept any plagiarism. Data science is a collaborative exercise and you can discuss the research question with your classmates from other groups, if you like. You must not copy any code or text though. Plagiarism will be prosecuted and will result in a mark of 0 and you failing this class.

After carefully reading this document and having had a look at the data you may still have questions. Please submit those question to the public Q&A board in canvas and we will answer each question, so 

# Download with tf.datasets

In [1]:
!pip install tensorflow-datasets > /dev/null

# 0. Load dataset for preprocessing

In [2]:
import tensorflow_datasets as tfds

import numpy as np   
import pandas as pd
import xgboost as xgb
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Dropout, LSTM, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing import sequence

In [3]:
(ds_train, ds_test),ds_info = tfds.load(
    name="imdb_reviews",
    split=["train","test"],
    shuffle_files=True,
    as_supervised=True,
    with_info=True)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…







HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteIVE905/imdb_reviews-train.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteIVE905/imdb_reviews-test.tfrecord


HBox(children=(FloatProgress(value=0.0, max=25000.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteIVE905/imdb_reviews-unsupervised.tfrecord


HBox(children=(FloatProgress(value=0.0, max=50000.0), HTML(value='')))



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [4]:
df_train = tfds.as_dataframe(ds_train, ds_info)
df_test = tfds.as_dataframe(ds_test, ds_info)
df = df_train.append(df_test, ignore_index=True)
df['text'] = df['text'].str.decode("utf-8")

# 1. TFIDF + classical statistical model with RandomForest


In [5]:
import nltk  # For test pre-processing
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from bs4 import BeautifulSoup

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [6]:
def clean_text(text):
    
    # remove html content
    text = BeautifulSoup(text).get_text()
    
    # tokenize the text and drop out punctuation
    tokenizer = RegexpTokenizer(r'\w+') # picks out sequences of alphanumeric characters as tokens and drops everything else
    word_tokens = tokenizer.tokenize(text)
    
    # lemmetize the token
    lemmmatizer = WordNetLemmatizer()
    
    # remove stop words
    stop_words = set(stopwords.words("english")) 
    text = [lemmmatizer.lemmatize(token.lower()) for token in word_tokens if token not in stop_words]
    
    return " ".join(text)

In [7]:
df['cleaned_text'] = df['text'].apply(clean_text)
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['label'], test_size=0.1, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

In [8]:
# Tfidf transform features
vectorizer = TfidfVectorizer(strip_accents='ascii')
X_train_tfidf = vectorizer.fit_transform(X_train)
X_valid_tfidf = vectorizer.transform(X_valid)
X_test_tfidf = vectorizer.transform(X_test)
# cast df into DMatrix
dtrain_tfidf = xgb.DMatrix(X_train_tfidf, label=y_train)
dvalid_tfidf = xgb.DMatrix(X_valid_tfidf, label=y_valid)
dtest_tfidf = xgb.DMatrix(X_test_tfidf, label=y_test)

In [9]:
params = {
    # Other parameters
    'objective':'binary:logistic',
    'eval_metric':"error"
}
num_boost_round = 999
model = xgb.train(
    params,
    dtrain_tfidf,
    num_boost_round=num_boost_round,
    evals=[(dvalid_tfidf, "Valid")],
    early_stopping_rounds=10
)

[0]	Valid-error:0.293778
Will train until Valid-error hasn't improved in 10 rounds.
[1]	Valid-error:0.262
[2]	Valid-error:0.254889
[3]	Valid-error:0.249556
[4]	Valid-error:0.238667
[5]	Valid-error:0.233778
[6]	Valid-error:0.228889
[7]	Valid-error:0.228222
[8]	Valid-error:0.223778
[9]	Valid-error:0.223111
[10]	Valid-error:0.222889
[11]	Valid-error:0.218222
[12]	Valid-error:0.215556
[13]	Valid-error:0.211556
[14]	Valid-error:0.209333
[15]	Valid-error:0.206889
[16]	Valid-error:0.206
[17]	Valid-error:0.205778
[18]	Valid-error:0.203111
[19]	Valid-error:0.200222
[20]	Valid-error:0.199778
[21]	Valid-error:0.196444
[22]	Valid-error:0.197111
[23]	Valid-error:0.196222
[24]	Valid-error:0.193111
[25]	Valid-error:0.192
[26]	Valid-error:0.190667
[27]	Valid-error:0.190444
[28]	Valid-error:0.189333
[29]	Valid-error:0.187556
[30]	Valid-error:0.186889
[31]	Valid-error:0.182444
[32]	Valid-error:0.181778
[33]	Valid-error:0.180889
[34]	Valid-error:0.177556
[35]	Valid-error:0.176889
[36]	Valid-error:0.17444

In [10]:
predictions_train = model.predict(dtrain_tfidf)
print(f"train accuracy: {accuracy_score(y_train,predictions_train>0.5)}")

train accuracy: 0.9514567901234567


In [11]:
predictions = model.predict(dtest_tfidf)
print(f"test accruracy: {accuracy_score(y_test,predictions>0.5)}")

test accruracy: 0.8598


In [12]:
df.loc[X_test.index]['pred_Tfidf'] = predictions

In [13]:
df.loc[X_test.index]

Unnamed: 0,label,text,cleaned_text
33553,0,I supposed 'Scarecrow Gone Wild' is a dull sla...,i supposed scarecrow gone wild dull slasher fl...
9427,0,"Undeveloped/unbelievable story line,(by the ti...",undeveloped unbelievable story line time i sor...
199,1,"This movie is good for what it is, and unprete...",this movie good unpretentious i watch twice ho...
12447,1,This film provides us with an interesting remi...,this film provides u interesting reminder easy...
39489,1,Gentle and genial film seems to have been over...,gentle genial film seems overlooked triviality...
...,...,...,...
39885,1,"This movie was very good, not great but very g...",this movie good great good it based one man pl...
17566,1,I must give How She Move a near-perfect rating...,i must give how she move near perfect rating c...
16062,1,Fair drama/love story movie that focuses on th...,fair drama love story movie focus life blue co...
48445,0,I'm sorry but this guy is not funny. I swear I...,i sorry guy funny i swear i heard heard 4 year...


In [14]:
df_error_analysis = pd.concat([df.loc[X_test.index], pd.Series(predictions>0.5)], ignore_index=True,axis=1)

In [15]:
df_error_analysis=df.loc[X_test.index]

In [16]:
df_error_analysis.reset_index(inplace=True,drop=True)

In [17]:
df_error_analysis.insert(1, "prediction_tfidf", predictions>0.5)

In [18]:
pd.set_option('display.max_colwidth', -1)
df_error_analysis[df_error_analysis['label']!=df_error_analysis['prediction_tfidf']]

  """Entry point for launching an IPython kernel.


Unnamed: 0,label,prediction_tfidf,text,cleaned_text
8,1,False,"The basic storyline here is, Aditiya (Kumar) is the spoilt son of a millionaire, Ishwar (Bachan) who owns a toy industry, in Ishwar's eyes his son Aditya can do nothing wrong, Aditya's mother Sumitra (Shefali Shah) warns Ishwar to bring his son to the responsible path before it is too late, for Ishwar is a patient of lung cancer and has only 9 months to live, when his son elopes and marries Mitali (Chopra), Ishwar readily forgives Aditya, but when the happy couple Aditya and Mitali come back from a honeymoon, Mitali is pregnant, and this forces Ishwar to kick Aditya out of the house to make him more responsible, Aditya doesn't know his father is suffering from lung cancer, and he also doesn't know that his father has kicked him out of the hose to make him more responsible, Ishwar cannot bring himself to tall Aditya that he is about to die, with a hungry and pregnant wife. it is a race against time so Aditya does all he can to prove himself to his father, and the climax comes when Aditya gets his big break in the movie industry and his father tells him that he is about to die.<br /><br />This movie is absolutely brilliant, this is the breakthrough in Indian cinema that was needed for the Bollywood industry, Shah's directing is almost flawless, but which movie doesn't have flaws? The best part if this movie is the father son relationship which is a tearjerker. the song interludes is just placed at the right time, the scenery is good, the only part where this movie fails is where the jokes between Boman Irani and Rajpal Yadav the jokes are too long and after a bit they are annoying, but overall this is a brilliant movie, i advise anybody Reading this review to go and watch it regardless of other reviews. 9/10",the basic storyline aditiya kumar spoilt son millionaire ishwar bachan owns toy industry ishwar eye son aditya nothing wrong aditya mother sumitra shefali shah warns ishwar bring son responsible path late ishwar patient lung cancer 9 month live son elopes marries mitali chopra ishwar readily forgives aditya happy couple aditya mitali come back honeymoon mitali pregnant force ishwar kick aditya house make responsible aditya know father suffering lung cancer also know father kicked hose make responsible ishwar cannot bring tall aditya die hungry pregnant wife race time aditya prove father climax come aditya get big break movie industry father tell die this movie absolutely brilliant breakthrough indian cinema needed bollywood industry shah directing almost flawless movie flaw the best part movie father son relationship tearjerker song interlude placed right time scenery good part movie fails joke boman irani rajpal yadav joke long bit annoying overall brilliant movie advise anybody reading review go watch regardless review 9 10
16,0,True,"This Worldwide was the cheap man's version of what the NWA under Jim Crockett Junior and Jim Crockett Promotions made back in the 1980s on the localized ""Big 3"" Stations during the Saturday Morning/Afternoon Wrestling Craze. When Ted Turner got his hands on Crockett's failed version of NWA he turned it into World Championship Wrestling and proceeded to drop all NWA references all together. NWA World Wide and NWA Pro Wrestling were relabeled with the WCW logo and moved off the road to Disney/MGM Studios in Orlando, Florida and eventually became nothing more than recap shows for WCW's Nitro, Thunder, and Saturday Night. Worldwide was officially the last WCW program under Turner to air the weekend of the WCW buyout from Vince McMahon and WWF. Today the entire NWA World Wide/WCW Worldwide Video Tape Archive along with the entire NWA/WCW Video Tape Library in general lay in the vaults of WWE Headquarters in Stamford,Connecticut.",this worldwide cheap man version nwa jim crockett junior jim crockett promotion made back 1980s localized big 3 station saturday morning afternoon wrestling craze when ted turner got hand crockett failed version nwa turned world championship wrestling proceeded drop nwa reference together nwa world wide nwa pro wrestling relabeled wcw logo moved road disney mgm studio orlando florida eventually became nothing recap show wcw nitro thunder saturday night worldwide officially last wcw program turner air weekend wcw buyout vince mcmahon wwf today entire nwa world wide wcw worldwide video tape archive along entire nwa wcw video tape library general lay vault wwe headquarters stamford connecticut
27,0,True,"Best thing I can say about this porno-horror film is: boobies boobies boobies !<br /><br />Beyond that, this film is made by some Hindu/Indian guy with some background in porn films or such .<br /><br />Plot: Talk-Show host and girlfriend are stalked by a psychopath who is angry over the plight of the homeless and takes it out on, you guessed it, beautiful real-estate agent ladies ! (films like these are why the slasher films of the 80's got a rep for misogyny)<br /><br />This film is not really a Slasher, but has the same sort of implausibilities and stereotypes: the dumb-ass cops, the villain is an old white male, and the women are busty babes . <br /><br />If you like porno-horror, this is your movie, otherwise stay away . (Adrienne fans will get to see her sagging breasts for a second or two)",best thing i say porno horror film booby booby booby beyond film made hindu indian guy background porn film plot talk show host girlfriend stalked psychopath angry plight homeless take guessed beautiful real estate agent lady film like slasher film 80 got rep misogyny this film really slasher sort implausibility stereotype dumb as cop villain old white male woman busty babe if like porno horror movie otherwise stay away adrienne fan get see sagging breast second two
30,1,False,"To be brief, the story is paper thin and you can see the ending coming from a mile away, but Gene Kelly, Rita Hayworth, and an impossibly young Phil Silvers keep the movie afloat throughout and at times lift it right up into the air. A few of the songs are terrible clunkers (""Poor John"" is a train wreck) but most of them are great fun, and the scene of Hayworth performing on the absurdly huge set for Kelly's rival has to be seen to be believed. Another treat is the perfect faux-NYC sets in the best Hollywood tradition.<br /><br />Another attraction, if you consider such things attractions, is the howlingly awful male ""chivalry"" toward women. The oily leering and transparent obsequiousness that passed for male charm back then (in the movies, at least) is presented in its most lurid form here. Some of the men are about like a cartoon wolf.<br /><br />One minor disappointment is Eve Arden trapped in a role so minor that she barely has a chance to do anything. I can imagine a lot of potential comic interplay between her and Silvers--a missed opportunity.",to brief story paper thin see ending coming mile away gene kelly rita hayworth impossibly young phil silver keep movie afloat throughout time lift right air a song terrible clunkers poor john train wreck great fun scene hayworth performing absurdly huge set kelly rival seen believed another treat perfect faux nyc set best hollywood tradition another attraction consider thing attraction howlingly awful male chivalry toward woman the oily leering transparent obsequiousness passed male charm back movie least presented lurid form some men like cartoon wolf one minor disappointment eve arden trapped role minor barely chance anything i imagine lot potential comic interplay silver missed opportunity
48,0,True,"Thomas Edison had no other reason to make this film except to show that film can capture the electrocution of an innocent elephant. Edison was not a genius but a man out for money and profit; his love for life was measured by dollars, not experiences, as this film shows.",thomas edison reason make film except show film capture electrocution innocent elephant edison genius man money profit love life measured dollar experience film show
...,...,...,...,...
4981,1,False,"SPOILER NOTHING BUT SPOILER<br /><br />I have to add my name to the list of folks who feel that the other viewers just don't get it. But no one has even mentioned the ""s"" word so far as I have seen.<br /><br />While I agree that the kid died I think we can be more specific: he committed suicide. He races down the slope in an old wagon, shoots off the cliff and...""flies away"". Maybe the whole account of the form of death is allegory or maybe he does commit suicide in a wagon as laid out. In either case, he ""flies away"" (c'mon, not that tough a metaphor).<br /><br />Maybe I just have a thing for Tom Hanks, but I was ok with the narration. Besides he is raising $ for the WW2 memorial and you gotta love him for that.<br /><br />Oh yeah, I loved the movie and found it incredibly moving.",spoiler nothing but spoileri add name list folk feel viewer get but one even mentioned word far i seen while i agree kid died i think specific committed suicide he race slope old wagon shoot cliff fly away maybe whole account form death allegory maybe commit suicide wagon laid in either case fly away c mon tough metaphor maybe i thing tom hank i ok narration besides raising ww2 memorial gotta love oh yeah i loved movie found incredibly moving
4987,0,True,"Independent film that would make Hollywood proud. The movie substitutes good looks for good acting, a cryptic plot for a good story line, and self-absorption for character development. May be I missed something, go see it for yourself.",independent film would make hollywood proud the movie substitute good look good acting cryptic plot good story line self absorption character development may i missed something go see
4990,0,True,"This feels very stilted and patronizing to a great extent. The whole plot is extremely forced - especially the ""gallant"" effort to save the college from ruin, and the moralistic overtone (especially by the leading lady) grates a bit.<br /><br />But there are one or two comic moments that do help relieve the boredom, and the dancing is quite fun (especially for alleged amateurs - ha, ha!)<br /><br />The shop proprietor and the young guy doing spectacular tap dancing were particular highlights. And I liked Peter Hayes impressions of Charles Laughton and Ronald Coleman as well.",this feel stilted patronizing great extent the whole plot extremely forced especially gallant effort save college ruin moralistic overtone especially leading lady grate bit but one two comic moment help relieve boredom dancing quite fun especially alleged amateur ha ha the shop proprietor young guy spectacular tap dancing particular highlight and i liked peter hayes impression charles laughton ronald coleman well
4995,1,False,"This movie was very good, not great but very good. It is based on a one man play by Ruben Santiago Hudson..yes he played most of the parts. On paper it looks like stunt casting. Yes let's round up all the black folks in Hollywood and put them in one movie. Halle Berry even produced it. The only name I didn't see was Oprah's ,thank god because it probably would of ended up being like a Hallmark movie. Instead this movie was not some sentimental mess. It was moving but not phony, the characters came and went with the exception of her husband, Pauline and the writer in question. The movie revolved around the universe of Nanny, Mrs Bill Crosby and how she raised the writer and took in people. Now being a jaded New Yorker when he said she took in sick people and old and then we see them going to a mental institution to pick up a man, I'm thinking looks like sister has a medicare scam going. Getting folks jobs and taking the medicare/caid checks But no she explains to Lou Gosset she just wants 25 bucks a week and did not want the money ahead of time. I think that part was put in the movie just for us jaded New Yorkers so we know she is not scamming the poor folks.(g) It was written by a New Yorker so he knows the deal(g).. She almost seems angelic and looking through a little boys eyes I can see why. She is married to a ne'er do well who is 17 years younger and fools around on her. Terrence Howard was born to play these type of parts. He was good but I would like to see him play something different. Markerson who plays Nanny is also very good. But for some reason the person who stood out to me was a small role played by Jeffery Wright. Where is this mans Oscar? He already won a Emmy and a Tony. He was in Shaft and he stole the movie. I did not even know who he was in this movie. He is a chameleon never the same. I never seen him play a bad part yet. This was a 5 minute role and he managed to make me both laugh and cry. I re-winded the scene few times ..one time because I didn't know who he was. His wife Carman Ejogo was excellent. I have seen her in roles before mostly mousy stuff. But she is so good here. I actually know people who act just like her. So it was very real to me Macy Grey who had one of the bigger parts was also very good. I was very happy that they did not kill Nanny off. I thought she was a goner in the beginning of the movie. BUT she was able to go home and start her old routine of taking care of people. There are women like that in most of our lives. People we might know or even lived with. Thank god for them, I do not know how they do it all of the time. I have a friend who lost 2 children and been through a lot of stuff but whenever I am feeling selfishly sorry for myself I call her and she always puts me in a good mood. THis movie is a tribute to all of those people. I only wish they they told us what happened to some of the characters like the the one armed man, Paulines boyfriend who is played by one of my favorite actors on HBO's The Wire, Omar, Rosie Perez's character and Richard the lesbian and Delroy Lindo's one arm man, he was mesmerizing in another small role.",this movie good great good it based one man play ruben santiago hudson yes played part on paper look like stunt casting yes let round black folk hollywood put one movie halle berry even produced the name i see oprah thank god probably would ended like hallmark movie instead movie sentimental mess it moving phony character came went exception husband pauline writer question the movie revolved around universe nanny mr bill crosby raised writer took people now jaded new yorker said took sick people old see going mental institution pick man i thinking look like sister medicare scam going getting folk job taking medicare caid check but explains lou gosset want 25 buck week want money ahead time i think part put movie u jaded new yorkers know scamming poor folk g it written new yorker know deal g she almost seems angelic looking little boy eye i see she married ne er well 17 year younger fool around terrence howard born play type part he good i would like see play something different markerson play nanny also good but reason person stood small role played jeffery wright where man oscar he already emmy tony he shaft stole movie i even know movie he chameleon never i never seen play bad part yet this 5 minute role managed make laugh cry i winded scene time one time i know his wife carman ejogo excellent i seen role mostly mousy stuff but good i actually know people act like so real macy grey one bigger part also good i happy kill nanny i thought goner beginning movie but able go home start old routine taking care people there woman like life people might know even lived thank god i know time i friend lost 2 child lot stuff whenever i feeling selfishly sorry i call always put good mood this movie tribute people i wish told u happened character like one armed man paulines boyfriend played one favorite actor hbo the wire omar rosie perez character richard lesbian delroy lindo one arm man mesmerizing another small role


1. Accuracy

*   Train: 95%
*   Test: 85%

2. Motivation

*   What in this model's structure makes it suited for representing sentiment?



*   How well you evaluate the accuracy you got for each model?



*   Discuss the main advantages and disadvantages the model has in the particular modelling task.



3. Consistency

*   What you have learned from your analyses.



*   Also, analyze when the model is good and when and where it does not predict well.

# 2. LSTM model


In [19]:
df['length']=df['cleaned_text'].str.split().apply(len)

In [20]:
df['length'].describe()

count    50000.000000
mean     131.139260  
std      97.021983   
min      4.000000    
25%      71.000000   
50%      98.000000   
75%      159.000000  
max      1492.000000 
Name: length, dtype: float64

In [21]:
maxlen = 200 # more than 75% of reviews have length greater than 159. 
vocab_size = 2000
oov_token = "<OOV>"
padding_type = "post"
trunction_type= "post"

# tokenize text into numbers
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token, lower=False)
tokenizer.fit_on_texts(X_train)
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_valid_seq = tokenizer.texts_to_sequences(X_valid)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# pad sequences
X_train_lstm_padded = sequence.pad_sequences(X_train_seq, maxlen=maxlen, padding=padding_type, truncating=trunction_type)
X_valid_lstm_padded = sequence.pad_sequences(X_valid_seq, maxlen=maxlen, padding=padding_type, truncating=trunction_type)
X_test_lstm_padded = sequence.pad_sequences(X_test_seq, maxlen=maxlen, padding=padding_type, truncating=trunction_type)

total_words = len(tokenizer.word_index) + 1   # add 1 because of 0 padding
word_index = tokenizer.word_index
print('Encoded X Train Shape ', X_train_lstm_padded.shape, '\n')
print('Encoded X Valid Shape ', X_test_lstm_padded.shape, '\n')
print('Encoded X Test Shape ', X_test_lstm_padded.shape, '\n')
print('Maximum review length: ', maxlen, '\n')

Encoded X Train Shape  (40500, 200) 

Encoded X Valid Shape  (5000, 200) 

Encoded X Test Shape  (5000, 200) 

Maximum review length:  200 



In [22]:
# Model Parameters
EMBED_DIM = 200
LSTM_UNITS = 64
# Training Settings
batch_size = 128
epochs = 5
# Build Model
model = Sequential()
model.add(Embedding(input_dim = total_words, 
                    output_dim = EMBED_DIM, 
                    input_length = maxlen))
model.add(Bidirectional(LSTM(LSTM_UNITS, return_sequences=True)))
model.add(Bidirectional(LSTM(LSTM_UNITS, return_sequences=True)))
model.add(Bidirectional(LSTM(LSTM_UNITS,)))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile Model
model.compile(optimizer = 'adam', 
              loss = 'binary_crossentropy', 
              metrics = 'accuracy')
# Model Summary
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 200, 200)          17232200  
_________________________________________________________________
bidirectional (Bidirectional (None, 200, 128)          135680    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 200, 128)          98816     
_________________________________________________________________
bidirectional_2 (Bidirection (None, 128)               98816     
_________________________________________________________________
dense (Dense)                (None, 8)                 1032      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 9         
Total params: 17,566,553
Trainable params: 17,566,553
Non-trainable params: 0
____________________________________________

In [23]:
# Train Model
tf.keras.backend.clear_session()
history = model.fit(X_train_lstm_padded, 
                    y_train, 
                    batch_size = batch_size, 
                    epochs = epochs, 
                    validation_data=(X_valid_lstm_padded, y_valid))

Epoch 1/5

KeyboardInterrupt: ignored

In [None]:
# Evaluate Model
y_pred = (model.predict(X_test_lstm_padded, batch_size = batch_size)> 0.5).astype("int32")
score_lstm = accuracy_score(y_true=y_test, y_pred=y_pred)
print(f"train accuracy: {history.history['accuracy'][-1]}")
print(f'test accuracy: {score_lstm}')

In [None]:
df_error_analysis.insert(1, "prediction_lstm", y_pred)

In [None]:
df_error_analysis[df_error_analysis['label']!=df_error_analysis['prediction_lstm']]

1. Accuracy

*   Train: 89%
*   Test: 87%

2. Motivation

*   What in this model's structure makes it suited for representing sentiment?



*   How well you evaluate the accuracy you got for each model?



*   Discuss the main advantages and disadvantages the model has in the particular modelling task.



3. Consistency

*   What you have learned from your analyses.



*   Also, analyze when the model is good and when and where it does not predict well.

# 3. LSTM model with pre-trained embedding


In [None]:
from pathlib import Path

# Glove file
my_file = Path("glove.6B.zip")
if my_file.is_file():
    pass
else:
    !wget http://nlp.stanford.edu/data/glove.6B.zip
    !unzip glove.6B.zip
    !ls

In [None]:
embeddings_index = {}
with open('glove.6B.200d.txt') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

In [None]:
embedding_matrix = np.zeros((len(word_index) + 1, maxlen))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [None]:
# Build model
model_glove = Sequential()
model_glove.add(Embedding(input_dim = total_words, 
                          output_dim = 200, 
                          input_length = maxlen,
                          weights=[embedding_matrix], 
                          trainable=False))
model_glove.add(Bidirectional(LSTM(LSTM_UNITS*2, return_sequences=True)))
model_glove.add(Bidirectional(LSTM(LSTM_UNITS*2, return_sequences=True)))
model_glove.add(Bidirectional(LSTM(LSTM_UNITS,)))
model_glove.add(Dense(8, activation='relu'))
model_glove.add(Dense(1, activation='sigmoid'))

# Compile model
model_glove.compile(loss='binary_crossentropy',
                    optimizer='adam',
                    metrics='accuracy')

model_glove.summary()

In [None]:
# Train model
tf.keras.backend.clear_session()
history_glove=model_glove.fit(X_train_lstm_padded, 
                y_train, 
                batch_size = batch_size, 
                epochs = epochs, 
                validation_data=(X_valid_lstm_padded, y_valid))

In [None]:
# Evaluate performance
y_pred_glove = (model_glove.predict(X_test_lstm_padded, batch_size = batch_size)> 0.5).astype("int32")

score_lstm_glove = accuracy_score(y_true=y_test, y_pred=y_pred_glove)
print(f'train accuracy: {history_glove.history["accuracy"][-1]}')
print(f'test accuracy: {score_lstm_glove}')

In [None]:
df_error_analysis.insert(1, "prediction_lstm_glove", y_pred_glove)

In [None]:
df_error_analysis[df_error_analysis['label']!=df_error_analysis['prediction_lstm_glove']]

1. Accuracy

*   Train: 84%
*   Test: 80%

2. Motivation

*   What in this model's structure makes it suited for representing sentiment?



*   How well you evaluate the accuracy you got for each model?



*   Discuss the main advantages and disadvantages the model has in the particular modelling task.



3. Consistency

*   What you have learned from your analyses.



*   Also, analyze when the model is good and when and where it does not predict well.