# Natural Language Processing Project | Sentiment Analysis on IMDB dataset Using BERT 

Sentiment analysis is one of the key areas of research in NLP and Sequence modelling. we will be using BERT model to predict two classes - positive or negative sentiment. 

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
!pip install simpletransformers
from simpletransformers.classification import ClassificationModel
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns

Collecting simpletransformers
[?25l  Downloading https://files.pythonhosted.org/packages/35/ef/0b70ae95138064d665d9298c4d96afba2edf4b86dc44f762807ceb12668e/simpletransformers-0.61.4-py3-none-any.whl (213kB)
[K     |█▌                              | 10kB 18.3MB/s eta 0:00:01[K     |███                             | 20kB 19.4MB/s eta 0:00:01[K     |████▋                           | 30kB 16.1MB/s eta 0:00:01[K     |██████▏                         | 40kB 14.9MB/s eta 0:00:01[K     |███████▊                        | 51kB 11.9MB/s eta 0:00:01[K     |█████████▎                      | 61kB 10.4MB/s eta 0:00:01[K     |██████████▊                     | 71kB 11.4MB/s eta 0:00:01[K     |████████████▎                   | 81kB 11.9MB/s eta 0:00:01[K     |█████████████▉                  | 92kB 11.7MB/s eta 0:00:01[K     |███████████████▍                | 102kB 11.2MB/s eta 0:00:01[K     |█████████████████               | 112kB 11.2MB/s eta 0:00:01[K     |██████████████████▌  

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
path = '/content/drive/My Drive/NLP/imdb.csv'

In [None]:
df = pd.read_csv(path)

In [None]:
df['label'] = (df['sentiment']=='positive').astype(int)

In [None]:
df.rename({'review': 'text'}, axis=1, inplace=True)
df.drop('sentiment', axis=1, inplace=True)

In [None]:
df.head()

Unnamed: 0,text,label
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [None]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [None]:
df_train, df_valid = train_test_split(df, test_size=0.2)

In [None]:
args = {
    'fp16':False,
    'wandb_project': 'bert-imdb',
    'num_train_epochs': 3,
    'overwrite_output_dir':True,
    'learning_rate': 1e-5,
}

## Model | Classification Model

In [None]:
model = ClassificationModel('bert', 'bert-large-cased', use_cuda=True,args=args) 
model.train_model(df_train, output_dir='bert-imdb')
result, model_outputs, wrong_predictions = model.eval_model(df_valid)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at 

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/40000 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


wandb: Paste an API key from your profile and hit enter: ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 3:   0%|          | 0/5000 [00:00<?, ?it/s]



Running Epoch 1 of 3:   0%|          | 0/5000 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/5000 [00:00<?, ?it/s]

  "Dataframe headers not specified. Falling back to using column 0 as text and column 1 as labels."


  0%|          | 0/10000 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1250 [00:00<?, ?it/s]

VBox(children=(Label(value=' 0.05MB of 0.05MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
Training loss,0.00072
lr,0.0
global_step,15000.0
_runtime,12346.0
_timestamp,1619649058.0
_step,299.0


0,1
Training loss,▆▅▂▅▃▅▁▁▄▅▃▂▁▁▁▃▂▄▄▄▁▁▄▁▅█▄▁▁▁█▁▁▁▁▁▁▁▁▁
lr,▃▄▇███▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁
global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
_runtime,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
_timestamp,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███


In [None]:
df_valid.shape

(10000, 2)

## Evalutaion on Twitter Dataset and Random Review

In [None]:
model.predict(['The movie was really good'])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

([1], array([[-3.71797347,  3.2923696 ]]))

In [None]:
model.predict(['The movie was really bad'])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

([0], array([[ 4.15009832, -3.90226984]]))

In [None]:
model.predict(["FUCK YOU @apple DIE IN A FUCKING BLAZE INFERNO."])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

([0], array([[ 3.12059498, -3.34273005]]))

In [None]:
model.predict(["Asked #Siri Where's Baby Lisa? and was told sorry I'm having trouble connecting to the network right now. @apple server fail. #ios5"])

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

([0], array([[ 1.33585036, -1.82419813]]))

### Random Movies' reviews taken from Google

In [None]:
r1 = model.predict(["Im just gonna start off by saying I LOVE this movie.Its one of my favorites of all time. I honestly cant think of too much wrong with this movie other than its a little long and Batmans by now infamous voice. But everything else is top notch. The acting,story,atmosphere,and actions scenes are all amazing. If you haven't seen this movie see it right now! I went into this not expecting to much but I came out blown away, I cant imagine any movie being much better. I'll just have to wait for The Dark Knight Rises to release to see if anything can be better. Until then, this stands as the best movie I've ever seen"])
r2 = model.predict(["This movie was on TV once so I decided to watch it since I wouldn't have to pay any money for it.The main character Will (played by Matt Lanter) has a dream where he meets a stone age Amy Winehouse (I think it's supposed to be a joke) who tells him that the world is going to end the day this movie premiered in the cinema (Coincidence?) and to stop it they must find a crystal skull. Matt later wakes up to celebrate his super-sweet sixteenth birthday (despite him being in his twenties) in a scene where we get one unfunny joke and celebrity impersonation after another. Then disaster strikes (it seems kinda redundant though since this movie already is one), hurricanes, earthquakes, meteorites and other classic disaster movie ingredients hit planet earth one after another. Will, followed by his friends: Juney (Crista Flanagan), Calvin (Gary \"G Thang\" Johnson), and Lisa (Kim Kardashian) go out into the city and tries to find his girlfriend and a safe place and later realizes that he has to find the crystal skull to set things right.The problem with this movie is, just like other movies by Jason Friedberg and Aaron Seltzer, that it doesn't stay on the theme but goes all over the place and try to spoof almost every popular movie that was made that year. And I use the term \"spoof\" lightly. Once again \"Seltzerberger\" show that they only grasp the most superficial concept of what humor is and never really bother to dig deeper and see what it is that makes things funny. Sometimes doing things outside the theme can work but not if it takes up a majority of the movie. And (for me) this movie is worse than Epic Movie. Yes you read right, Worse than Epic Movie. That movie at least had a story. Sure it was borrowed and \"crapified\" but at least it was a story. In this movie, everything that happens during the second act, when they try to find a safe place/figure out where they should go, just feels like a filler where the gang stumble into one reference after another. \"Seltzerberger\'s\" over-reliance on potty humor, movie/TV references, random musical numbers, deliberately obvious stunt-doubles and crappy special effects does not save them this time.Seltzer and Friedberg, your movie sucks horribly. If I may paraphrase a line from \'Billy Madison\' I\'d like to say: I award you only one star, and may God have mercy on your souls.Once again, if you want to see a GOOD movie made in the style that this train wreck was trying (and failing) to emulate, watch \"Hotshots\" \"Airplane!\", \"The naked gun\" movies, \"Top Secret\" instead."])
r3 = model.predict(["I was lucky enough to go to a pre-screening of Hancock last night and I really enjoyed it. I don't understand all of the criticism this movie is receiving. Everyone take a second and realize this is not a Marvel or DC comic book superhero movie. Now think about that again. It is a different story entirely and has some very unique elements.Hancock isn't action packed. It doesn't have a Superhero vs. Supervillan plot. I would probably describe it as a character study of the superhero. I think this movie does a better job of addressing some of the issues (and vices) a superhero probably would have if they existed today. The biggest conflict in the movie is within Will Smith's character's attitude, not necessarily good vs. evil.I think much of the criticism I have read about is motivated by expectations that were not met, which isn't fair at all. If you watch Hancock with only the expectation of being entertained, you will leave happy. Its a good movie, don't jump on the bandwagon of not liking it just because you can. Give it a chance and take it for what it is, a July 4th action/comedy."])
r4 = model.predict(["While I had the misfortune to see 'Bright' in a theater, most people will simply press 'play' out of curiosity on their Roku remote. I am willing to concede that this might elevate the experience a little ... the ability to take a quick trip to the kitchen or restroom after shouting 'no, don't pause it' to your partner on the couch will be liberating."])
r5 = model.predict(["This film was just brilliant and great casting location scenery story direction everyones really suited the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same scottish island as myself so loved the fact there was real connection with this film is brilliant so much that bought the film as soon as it was released for and would recommend it to everyone to watch and the fly was amazing really funny at the end it must have been good and this definitely was also to the two little that played the of norman and paul they were just brilliant children are often left out of the list think because the stars that play them all grown up are such a big for the whole film but these children are amazing and should be for what they have done do not you think the whole story was so lovely."])

print("First Review is a Positive and Model predicted: [", r1, "]. If prediction is 1 then it's the positive review otherwise negative")
print("Second Review is a Negative and Model predicted : [", r2, "]. If prediction is 1 then it's the positive review otherwise negative")
print("Third Review is a Positve and Model predicted: [", r3,  "]. If prediction is 1 then it's the positive review otherwise negative")
print("Forth Review is a Negative and Model predicted: [", r4, "]. If prediction is 1 then it's the positive review otherwise negative")   
print("Fifth Review is a Positive and Model predicted: [", r5, "]. If prediction is 1 then it's the positive review otherwise negative")      

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

First Review is a Positive and Model predicted: [ ([1], array([[-4.15010548,  3.63698387]])) ]. If prediction is 1 then it's the positive review otherwise negative
Second Review is a Negative and Model predicted : [ ([0], array([[ 2.96931982, -2.68578029]])) ]. If prediction is 1 then it's the positive review otherwise negative
Third Review is a Positve and Model predicted: [ ([1], array([[-4.19277477,  3.65910673]])) ]. If prediction is 1 then it's the positive review otherwise negative
Forth Review is a Negative and Model predicted: [ ([1], array([[-2.62061453,  2.04822135]])) ]. If prediction is 1 then it's the positive review otherwise negative
Fifth Review is a Positive and Model predicted: [ ([1], array([[-4.12974215,  3.60540986]])) ]. If prediction is 1 then it's the positive review otherwise negative


In [None]:
result

{'auprc': 0.9648316752352842,
 'auroc': 0.9647824743412426,
 'eval_loss': 0.5500777954512276,
 'fn': 469,
 'fp': 525,
 'mcc': 0.8012249368646528,
 'tn': 4440,
 'tp': 4566}

### F1-Score

In [None]:
F1 = (result['tp']+result['tn'])/(result['tp']+result['tn']+result['fp']+result['fn'])
print(F1)

0.9006
