In [1]:
import pandas as pd
import numpy as np

In [2]:
# read in files
log = pd.read_csv('evaluationlogistic.csv', index_col = 'Unnamed: 0')
bert = pd.read_csv('bert-data.csv')

In [3]:
log.head()

Unnamed: 0,Result,Text
0,FN,<br /><br />When I unsuspectedly rented A Thou...
1,TP,This is the latest entry in the long series of...
2,TN,This movie was so frustrating. Everything seem...
3,FN,"I was truly and wonderfully surprised at ""O' B..."
4,TN,This movie spends most of its time preaching t...


In [4]:
bert.head()

Unnamed: 0,Result,Text
0,TP,<br /><br />When I unsuspectedly rented A Thou...
1,TP,This is the latest entry in the long series of...
2,TN,This movie was so frustrating. Everything seem...
3,TP,"I was truly and wonderfully surprised at ""O' B..."
4,TN,This movie spends most of its time preaching t...


In [5]:
wrong = ['FP', 'FN']
correct = ['TP', 'TN']
# check for incorrectly predicted bert examples that log regression predicted correctly
lst = []
for i in range(len(bert)):
    if bert['Result'][i] != log['Result'][i] and bert['Result'][i] in wrong:
        lst.append(i)
    else:
        continue

In [6]:
incorrect_bert = bert.loc[bert.index[lst]]
incorrect_bert.shape

(59, 2)

In [7]:
# check for accurately predicted bert examples that log regression gets wrong
lst1 = []
for i in range(len(bert)):
    if bert['Result'][i] != log['Result'][i] and log['Result'][i] in wrong:
        lst1.append(i)
    else:
        continue

In [8]:
incorrect_log = log.loc[log.index[lst1]]
incorrect_log.shape

(202, 2)

## Bert Analysis
Let's examine some of the incorrect samples here first. This is to examine the weaknesses of what Bert can accurately predict.

In [9]:
# let's take a sample here
### Indexes used -> ind = [961, 297, 445, 891, 574, 805, 755, 554, 720, 499]
#checkbert = incorrect_bert.sample(10)
ind = [961, 297, 445, 891, 574, 805, 755, 554, 720, 499]

In [10]:
# all false positives used here
#lstcheckbert = list(checkbert['Text'])
#for i in range(len(lstcheckbert)):
#    with open("bert-samples/text" + str(i) + ".txt", 'w') as file:
#        file.write(lstcheckbert[i])

## Analysis of samples examined here thus far
### What does BERT struggle with?
Looking at the above samples that were all false positive predictions and that logistic regression correctly predicted, there were a couple of things I found interesting straight away. I felt BERT struggled with shorter pieces of text quite badly. Half of these incorrectly predicted samples were less than 170 words in lengths when split on whitespaces. I delved further into this below and found approximately two thirds (38 from 59) of these incorrectly predicted samples were less than 250 words long which I feel would regard a very short piece. Looking into the content of these files I found a theme in most of the sample taken. I felt that BERT struggled to deal with longer distance dependencies in the text. A lot of these reviews taken here start on a relatively positive note. In the file "text1.txt" we see this contrast for example:

***First - let's start the with good.***

The author then lists off the good aspects of the film they enjoyed. The split in this piece is definitely something BERT struggles to take into account, with the context from the start seemingly carried on throughout. Similarly it appears to struggle with negatation at different times in the text. Here's another example from "text1.txt":

***Do you remember in the movie "From Dusk til Dawn?". The movie started out interesting, then halfway through the movie it just took a degrading turn? Yep - same thing here. I would venture to say that the writers started with a concept, then had no idea what to do with it.***

The use of "the movie started out interesting" would definitely be regarded as positive sentiment when taken in isolation but in the rest of the sentences here remove this positive aspect immediately. The use of negatation and how BERT struggles with it can be seen here again in text9.txt:

***The Alfred Newmann Score has to be the most sensual and seductive score Hollywood ever produced. It's a shame it is no longer available on CD. The actors, however, never rise to the occasion. The accents are so varied, from the subdued British of Ustinov and Purdom to the Hollywood of Baxter and Mature that it seems a true hodgepodge with no central vision. Tommy Rettig is jarringly American. Acting styles span the range from zombie-like to stilted. Only Ustinov as a conniving one-eyed servant steals the show - what there is of it to steal.***

It appears here that BERT struggles to pick up on negation here ("The actors, however, never rise to the occasion"), which is something alluded to in this [paper](https://watermark.silverchair.com/tacl_a_00298.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAApwwggKYBgkqhkiG9w0BBwagggKJMIIChQIBADCCAn4GCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQML1Ykm7hyGPbSo-AlAgEQgIICT1WXO0-zaJ4FY2Jg47BKlRVD9owsMj4cxwtvgvAVYHQOzu3T1mcGsWXXY-sU9avOATz2db6BBsOXhYTSoX4eqzq1hGiBfRdk_X071Xt3rHxZUD3bMVUHGrd51VxZ1fkvY1Vy7iM_BPXdmx_2zuSaFz311-mjggmRemdFp9CCJEEgT483_jUPHg-fgBbEXHWm_-8f5uZnnF7LnaRZvPuG02M_XHjD3PMJXw7gnuJ44jAzMFLGunpoAlZ1F6ZpY4wAhUtzL7EfvyAOG1LVJuIZUAjq8DcQPgmOLgchOpq-LuJ227VO4lPceyuKoFZKD_T9AzwFecnB218SNsrVOCdhGpz929AzCgjF4bSOW1bkb2wCY8k8QaI2BxGaVu4_aGAnFNLq3L_8UhRR_ygkg-jc-lJ6lcbmM5oNbYVtkwAg5PP5hi2DulBvk7S9q7WktnB9S7H5HSFOAj9OK4sv2KFrD6EJvpQdhWTqmL-JR1uQ1mGXJ_rK6rY4UU4LJwcKqAF_wb1cEBrbvxVrV96H869c8w_ZG6jb6_fMqVtkcMerDHxZV05xejuTc1d1XDMPABq_2AgpO7iPYxBTIsDZi5pbIZBxnOLSaezjhUroyJYQ7iBwguyOv2gM63BEz6LzSH-I-gLU5DFOcnjuuzcGyXE9SEFOGuXpJ28B6oBR5gnhE8fcUp0q4Z2WeYk28vSiQe-8B14ZFdUcJoiOL4RtCVtAxgxvAoVR5KYlBMy8SaFCA5PPcfMFPLieWFvolJuxS2O4298ojzN4hGZ0Ecbl2YkgUw) that BERT does indeed struggle with "the contextual impacts of negation". Other things it appears to struggle with are the use of sarcasm, lack of proper sentence structure and the use of more informal language. All of these aspects can be seen in "text6.txt". The use of improper language such as "crappy" and the capitialized text such as "I CANNOT BELIEVE THAT DAN SCHNEIDER WOULD GO THIS LOW AND MAKE SOMETHING THIS CRAPPY!!! IT'S HORRIBLE!!!" are very clearly negative sentiment but this is labelled positive by BERT. There is quite a lot of sarcasm used in this piece also such as "because a LOT of people have an Army veteran for a dad, an artist for a brother, and a popular teen web show taped and produced with thousands of dollars of equipment!". It must be said a lot of different sentiment classifiers would struggle with sarcasm but this piece in particularly should not really be misclassified due to it's very prominent displeasure at what was being reviewed.

In [11]:
vals = list(incorrect_bert['Text'])
l = []
for i in range(len(incorrect_bert)):
    length = len(vals[i].split())
    l.append(length)

In [12]:
incorrect_bert['Length'] = l

In [13]:
incorrect_bert.loc[incorrect_bert['Length'] <= 250].shape

(38, 3)

In [14]:
incorrect_bert['Length'].mean()

284.3220338983051

## Logistic Regression

In [15]:
# larger sample due to far more incorrects here
#checklog = incorrect_log.sample(20)
# indexes -> [865, 520, 759, 809, 830, 900, 144,  13, 840, 653, 402, 993, 182, 280, 340, 493, 726, 153, 904,  81]

In [16]:
# all false positives used here
#lstchecklog = list(checklog['Text'])
#for i in range(len(lstchecklog)):
#    with open("logreg-samples/text" + str(i) + ".txt", 'w', encoding="utf-8") as file:
#        file.write(lstchecklog[i])

In [17]:
vals = list(incorrect_log['Text'])
l1 = []
for i in range(len(incorrect_log)):
    length = len(vals[i].split())
    l1.append(length)

In [18]:
incorrect_log['Length'] = l1
ind1 = [865, 520, 759, 809, 830, 900, 144,  13, 840, 653, 402, 993, 182, 280, 340, 493, 726, 153, 904, 81]
incorrect_log.loc[incorrect_log.index.isin(ind1) == True]

Unnamed: 0,Result,Text,Length
13,FP,Teenager Tamara (Jenna Dewan) has it rough. Sh...,260
81,FN,What an awesome mini-series. The original TRAF...,283
144,FN,This is a cute little French silent comedy abo...,172
153,FN,I would say to the foreign people who have see...,67
182,FN,It starts a little slow but give it a chance. ...,116
280,FN,"If it wasn't meant to be a comedy, the filmmak...",60
340,FN,Our reviewer from Toronto told you what you ne...,125
402,FN,"I am probably a little too old for this movie,...",293
493,FP,The only time I seem to trawl through IMDb com...,149
520,FN,Hammer House of Horror: Witching Time is set i...,490


In [19]:
incorrect_log['Length'].mean()

223.7128712871287

## Analysis of Logistic Regression Samples
### What does Logistic Regression struggle with?
Looking at our document lengths sampled from above logistic regression also appears to struggle with shorter documents and LR even more so. It has a significantly lower mean than our average length of incorrectly predicted BERT documents. This is also with a significantly bigger sample size. I feel LR struggles with shorter documents as there may not be enough clearly positive or negative statements in a short piece of text for it to detect. Take one of our samples found here in logreg-samples/text2.txt:

***I cant believe there are people out there that did not like this movie! I thought it was the funniest movie i had ever seen. It my have been b/c i am Mel Brooks biggest fan... I know almost all the words and get very discouraged when they censor them, when it is played on a Family Channel. :) this is one of my favorite movies, so i dont know why any one would disagree! thanks Kristina***

This is the full length of the review, which is definitely very short. This is incorrectly predicted as negative despite it being positive. It appears logistic regression struggles with long distance relationships in a sentence. Looking at the first sentence of this piece (which a human would clearly see as positive) it appears logistic regression cannot detect and assigns no real importance to the phrase "I cant believe there are people out there" and seems to focus on "that did not like this movie!", which in isolation would definitely be noted as very negative sentiment. Going by this piece it appears to struggle with "text" like language some of the smiley faces and text language "b/c" would not be commonly used on the whole in the training data more than likely and would more than likely hinder the LR classification here. We also see this struggle with improper language used in "logreg-samples/text15.txt", which is again a short piece incorrectly predicted as positive. One sentence found in this piece is "It IS a rubbish film, it DOESN'T hang together and it DOES constitute a wasted evening sitting through it". From a human perspective again this is an overwhelmingly negative sentiment, with the capitals only further highlighting this point. However, our LR classifier overlooks this as in training it would not have these words capitialzed or any words capitialized in the data. LR also appears to struggle with words being applied in a back-hand context. Taking this sentence from the same file:

***Lonesome Jim was a duff film packed with unbelievable characters in unbelievable situations which limped on lamely and boringly towards a cop-out hackneyed conclusion***

The LR classifier most likely picks up on "unbelievable characters in unbelievable situations" as features and labels them as overwhelmingly positive. However, in this context unbelievable most likely is meant in a condescending manner and is not intended to praise the film in question. In our training "unbelievable" would most likely be labelled majorly positive and used as a feature here for this piece. We see further struggles with text language, shorter documents and capitialization in "logreg-samples/text1.txt":

***The larger-than-life moral was horrible. I mean, sacrifice your life, sacrifice your beliefs, sacrifice your love so that your mighty nation can succeed???? WTF??***

It must be noted again that this implementation of LR used had frequent words removed, stopwords, punctuation, frequent words that aren't stop words removed and a maximum of 200 features among others. It appears here again that our LR implementation struggles with to deal the multiple question marks here which would imply from the human eye to be of very negative sentiment. The text language of "WTF??" is also something most onlookers would describe as very negative but it appears our LR classifier struggles with it here.

## Let's focus on both BERT's and LR's strengths now
### Let's start with BERT looking at it's correct predictions

In [20]:
lst2 = []
for i in range(len(bert)):
    if bert['Result'][i] != log['Result'][i] and bert['Result'][i] in correct:
        lst2.append(i)
    else:
        continue

In [21]:
len(lst2)

202

In [22]:
correct_bert = bert.loc[bert.index[lst2]]
correct_bert.shape

(202, 2)

In [23]:
#checkbertcorrect = correct_bert.sample(20)
# indexes -> [810, 81, 910, 451, 754, 450, 435, 682, 157, 759, 818, 517, 843, 565, 830, 493, 456, 541, 464, 985]

In [24]:
vals2 = list(correct_bert['Text'])
l3 = []
for i in range(len(correct_bert)):
    length = len(vals2[i].split())
    l3.append(length)
correct_bert['Length'] = l3

In [26]:
#lstcheckbertcor = list(checkbertcorrect['Text'])
#for i in range(len(lstchecklog)):
#    with open("bert-samples-correct/textcorrect" + str(i) + ".txt", 'w', encoding = "utf-8") as file:
#        file.write(lstcheckbertcor[i])

## BERT Correct Analysis
### What does BERT do well?

Let's now look at samples that BERT correctly predicted that logistic regression failed to predict correctly. We took 20 samples from the 202 instances of this occurring. One thing I noticed from some of these reviews straight away is that it was perfect for both negative and positive documents that don't mix emotion. Something I alluded to earlier was that both models appeared to struggle with documents that started with the opposite sentiment of what class it was and then ended talking in the sentiment it actually was. Taking files "bert-samples-correct/text14correct.txt", "bert-samples-correct/text12correct.txt", "bert-samples-correct/text10correct.txt" and "bert-samples-correct/text1correct.txt" as examples these all appear to pick up either negative or positive sentiment immediately and don't appear to deviate from this as the review progresses. Looking at the start of "bert-samples-correct/text1correct.txt" we see this positive sentiment immediately:

***What an awesome mini-series. The original TRAFFIK completely stole me away from anything else that was on. Far more engaging than the American remake, the original TRAFFIK boasts an amazing cast formed of lesser known actors to North American audiences.***

This review remains positive on the whole and doesn't deviate from this and BERT seems to capture this very well. This is something also seen in "bert-samples-correct/text13correct.txt":

***I will give it a second chance but was very disappointed in the first one. It wouldn't hold a candle to the other series. It has a lot of meaningless dialog that doesn't add to the storyline at all.***

This is quite a short piece of text, which I felt both BERT and LR struggled with but the fact it doesn't switch between positive and negative sentiment means BERT works well with this. It would be logical that BERT works better on pieces with one sentiment clearly expressed. BERT is bidirectionally trained so sentences that change sentiment midway through could definitely impair it's accuracy whereas sentences such as the ones shown above which are clearly negative can be clearly seen as negative and given more weighting for any prediction it makes. Another thing I found from looking at my samples is that BERT worked far better with more formal and proper text. Again, this is hardly surprising as BERT is not made to work on "bag of words" and expects a proper structure instead.

## Let's now focus on LR and what it does well

We will look at the samples we used earlier on our incorrect BERT samples. These would have been correctly predicted by LR so this would be a good comparison.

In [28]:
# ind is list of indexes we used earlier for incorrect BERT samples
log.loc[log.index.isin(ind) == True]

## Analysis of LR Correct Samples
### What does LR appear strong in?

Looking at the same samples BERT incorrectly predicted as all positive, there are a few things that LR appears to do well to me. Looking at "bert-samples/text6.txt", it appeared to me BERT struggled with some of the language used in this. The ending of this document was as follows:

***"MERRY SNIFFMUS!!?" WHAT!!? THAT'S ABOUT AS MUCH CREATIVITY AS A HILLARY CLINTON SPEECH ON DRUGS!!!! IT'S STUPID!!!! <br /><br />THE PLOT SETTINGS AND MORALS ARE EFFORTLESS BAGS OF POOP!!!! These shows are now telling kids that stealing, lying, and being an asshole to your parents is a GOOD THING!!! IF THESE ARE THE KINDS OF AWFUL CRAPPY SHOWS THAT THEY'RE THROWING AT KIDS THESE DAYS, THEN I DON'T WANT TO TAKE PART IN WATCHING ANY OF THEM!!!! THIS IS BIGGEST PIECE OF CRAP I'VE EVER WATCHED ON TV! BAR NONE!!! NICKELODEON, "I'M THROUGH WITH YOU!!!!" END OF STORY!!!!***

BERT most likely struggled this as this is not proper language (excessive punctuation, capitals used for whole sentences) and this document was over the word count for BERT which caps at 512 (this document was 629 words long). This meant this part of this was either left out or interpreted completely incorrectly. I feel LR would have worked better here as there was a maximum of 200 features used in this implementation so it would have only picked the most influential features be it positive or negative. It would have either likely ignored the capitialized text at the end due to it not appearing in training but taken only the most prominent features from the 629 words as mentioned earlier. We see a similar pattern in "bert-samples/text8.txt" which is the same length as the previously mentioned document but LR correctly predicts it's class instead of BERT. This document uses some of what would be regarded as uncommon words throughout "imbecile", "cadaver" and "unrequited". It uses some very abstract types of metaphors to describe negative sentiment too such as:

***The film had all of the substance of a stale white bread sandwich (with store brand white bread, no less) and the emotion of a cadaver.***

BERT even reading this bidirectionally would struggle to classify this. It again appears that LR works best here as it has a low limit on it's features and would then select the most prominent ones it finds in the text.

### Let's now investigate examples that break both models

I found a movie dataset off kaggle that was not part of the IMDB movie review dataset, run both models using the IMDB training data to train the models and then using the test data as the data I found online. This should provide us instances of false positives and false negatives that "break" both models. The dataset can be found [here](https://www.kaggle.com/datasets/newra008/movie-review-and-rating). The cleaning process for this is found [here](https://gitlab.computing.dcu.ie/sweenk27/ca4023nlt-assignment2/-/blob/master/Question2/examples-break.ipynb) and the use of our BERT model to make predictions on this can be found [here](https://gitlab.computing.dcu.ie/sweenk27/ca4023nlt-assignment2/-/blob/master/Question2/samples-bert.ipynb).

In [44]:
logbreak = pd.read_csv('evaluationlogistic-break.csv', index_col = 'Unnamed: 0')
logbreak.head()

Unnamed: 0,Result,Text
0,TP,This far best SpiderMan even Marvel movie amou...
1,TP,Coming someone doesn’t like previous two mcu s...
2,TP,Marvel outdone themselves This quintessential ...
3,TP,Saw FDFS 5am You beauty SpiderMan The best Spi...
4,TP,Best Spiderman line takes top spot MCU movies ...


In [45]:
bertbreak = pd.read_csv('evaluationbert-break-results.csv', index_col = 'Unnamed: 0')
bertbreak.head()

Unnamed: 0,Result,Text
0,TP,This is by far the best Spider-Man or even Mar...
1,TP,Coming from someone who doesn’t like the previ...
2,TP,Marvel has outdone themselves. This is the qui...
3,TP,Saw this #FDFS at 5am\nYou beauty Spider-Man\n...
4,TP,"Best in the Spiderman line, and it takes top s..."


In [47]:
 # find examples that "break" both models
lstlog = list(logbreak['Result'])
lstbert = list(bertbreak['Result'])
indexes = []
for i in range(len(lstlog)):
    if lstlog[i] == lstbert[i] and lstlog[i] in wrong:
        indexes.append(i)
    else:
        continue

In [48]:
checks = bertbreak.loc[bertbreak.index.isin(indexes) == True]

In [49]:
vals3 = list(checks['Text'])
l4 = []
for i in range(len(vals3)):
    length = len(vals3[i].split())
    l4.append(length)
checks['Length'] = l4

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  checks['Length'] = l4


In [50]:
checks['Length'].mean()

76.02702702702703

#### Let's take a selection of these unseen inputs from a seperate dataset that "break" both models.

##### FN

***I see lots of negative reviews, but to me they miss the point of the movie. The vacant, empty, "alone" feeling is what the whole experience is about. Not every movie needs car crashes, transforming robots and impossible fight scenes. If you like sci fi and don\'t need to wrack your brain - or you just want a good story with some eye candy for you and your spouse, give this one a try. My wife and I re-watch it once a year on a date night.***

##### FP

***First things first there have been alot of 5 star reviews by people who haven't even seen the movie.We've all seen great movie trailers that end up being terrible movies and this is a case in point. Really enjoyed the first film not perfect by any means but it had heart a reasonable plot good effects and a gorgeous lead in Mrs Gadot. I would echo any 1star review of the sequel. I went to see this with my wife and if it wasn't for the fact that it costs slot of money to go to the cinema these days we would both have walked out after half an hour. Truly truly awful in every way (Gals still gorgeous though)***

##### FP

***I’m writing this as I watch it. Here’s a quote from the movie that sums up my thoughts so far, “Reboots sell.” I was so excited to watch this, I’ve been a fan of the matrix ever since I watched the first film with my dad. The first 5 minutes had tacky action sequences and they had to throw in the liberal rhetoric to boot. I will treat this the same way I treat the Star Wars sequels: when I need a good laugh I’ll turn it on. Keanu Reeves told his fans that he’ll continue making Matrix films as long as they want it so as a fun I’m telling you to stop. DO NOT make another Matrix.***

### Analysis - Unseen instances that break both models

We take 3 instances that "breaks" both our models predictions here and try to assess why. Overall both our LR and BERT models work well on this new test data. However, looking at these three cases above a couple of things stand out here. Something we discussed earlier about reviews starting on one sentiment and ending on another appear true here. Other things that these models appear to struggle on is improper grammar and incorrect spelling ("costs slot of money", "DO NOT make another Matrix", "1star review"). Both models most likely picked up on:

***I see lots of negative reviews, but to me they miss the point of the movie. The vacant, empty, "alone" feeling is what the whole experience is about.***

as overwhelmingly negative sentiment. When in reality here this is meant to be positive. It appears both models cannot detect negation unless set phrases are used. The above sentences highlighted here use negation here to convey a possible message in the first sentence.