# Movie Reviews

In [1]:
import pandas as pd

data = pd.read_csv("reviews.csv")

data.head()

Unnamed: 0,target,reviews
0,neg,"plot : two teen couples go to a church party ,..."
1,neg,the happy bastard's quick movie review \ndamn ...
2,neg,it is movies like these that make a jaded movi...
3,neg,""" quest for camelot "" is warner bros . ' firs..."
4,neg,synopsis : a mentally unstable man undergoing ...


The dataset is made up of positive and negative movie reviews.

## Preprocessing

👇 Remove punctuation and lower case the text.

In [3]:
def clean_text():
    from nltk.tokenize import RegexpTokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    ser = []
    for d in data['reviews']:
        m = tokenizer.tokenize(d)
        m = [' '.join(m)]
        ser.append(m)
    df1 = pd.DataFrame(ser,columns=['sub'])
    return df1

In [18]:
data['clean_text'] = clean_text()
data = data.drop(columns=['reviews'], axis=1)
data

Unnamed: 0,target,clean_text
0,neg,plot two teen couples go to a church party dri...
1,neg,the happy bastard s quick movie review damn th...
2,neg,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first feature...
4,neg,synopsis a mentally unstable man undergoing ps...
...,...,...
1995,pos,wow what a movie it s everything a movie can b...
1996,pos,richard gere can be a commanding actor but he ...
1997,pos,glory starring matthew broderick denzel washin...
1998,pos,steven spielberg s second epic film on world w...


## Bag-of-Words modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a Bag-of-Word representation of the texts.

In [20]:
sms = data

In [21]:
sms.shape

(2000, 2)

In [22]:
sms.head()

Unnamed: 0,target,clean_text
0,neg,plot two teen couples go to a church party dri...
1,neg,the happy bastard s quick movie review damn th...
2,neg,it is movies like these that make a jaded movi...
3,neg,quest for camelot is warner bros first feature...
4,neg,synopsis a mentally unstable man undergoing ps...


In [23]:
sms.target.value_counts()

pos    1000
neg    1000
Name: target, dtype: int64

In [24]:
X = sms.clean_text
y = sms.target
print(X.shape)
print(y.shape)

(2000,)
(2000,)


In [25]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1500,)
(500,)
(1500,)
(500,)


In [27]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm

<1500x35542 sparse matrix of type '<class 'numpy.int64'>'
	with 500167 stored elements in Compressed Sparse Row format>

In [28]:
X_test_dtm = vect.transform(X_test)
X_test_dtm

<500x35542 sparse matrix of type '<class 'numpy.int64'>'
	with 162110 stored elements in Compressed Sparse Row format>

In [29]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
%time nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

Wall time: 96 ms


0.784

In [30]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[204,  51],
       [ 57, 188]], dtype=int64)

In [31]:
X_test[y_pred_class > y_test]

808    stephen please post if appropriate mafia crime...
442    the original _babe_ was my favorite movie of 1...
631    the swooping shots across darkened rooftops su...
558    synopsis lifelong friends rafe affleck and dan...
275    porter stoddard warren beatty is a successful ...
868    aspiring broadway composer robert aaron willia...
693    i have nothing against unabashedly romantic fi...
671    david schwimmer from the television series fri...
120    starting with the little mermaid and most rece...
102    ever since wargames the first real computer ha...
48     sydney lumet is the director whose work happen...
285    in the series of the erotic thrillers that flo...
700    the beach is a structurally confusing film tha...
761    weighed down by tired plot lines and spielberg...
887    this talky terribly plotted thriller stars ale...
480    as any reasonable human being would i must adm...
612    joe versus the volcano is really one of the wo...
309    because no one demanded 

In [32]:
X_test[y_pred_class < y_test]

1699    very few people would be unaware of beavis but...
1282    ok let s get one thing straight right away max...
1210    i must say from the outset that i have never b...
1636    we all know the fate of child s play which is ...
1149    the long and illustrious career of robin willi...
1138    the second jackal based film to come out in 19...
1044    what do you get when you slap together a movie...
1886    bob the happy bastard s quickie review the mum...
1761    film adaptation of hunter s thompson s infamou...
1071    zero effect gets its title from the main chara...
1842    if beavis and butthead had a favorite movie fr...
1135    as with his other stateside releases jackie ch...
1712    when i first heard about scream in 1996 i was ...
1464    swashbuckling adventure that can be enjoyed by...
1954    eddie murphy has had his share of ups and down...
1480    the trailers and the beginning of the move sum...
1986    i think the first thing this reviewer should m...
1595    hollyw

In [33]:
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([3.41031604e-06, 3.22880297e-02, 3.94452730e-10, 1.00000000e+00,
       3.08465943e-03, 1.75111641e-19, 9.29533370e-26, 6.43602614e-19,
       1.00000000e+00, 9.99508890e-01, 1.00000000e+00, 1.00000000e+00,
       1.00000000e+00, 2.21518554e-10, 9.93822747e-01, 1.00000000e+00,
       1.60864013e-09, 1.00000000e+00, 3.89136651e-24, 9.99999999e-01,
       1.95290167e-31, 1.00000000e+00, 1.00000000e+00, 4.49288352e-06,
       9.99862465e-01, 3.80680741e-15, 9.99997640e-01, 8.86366056e-42,
       9.99982549e-01, 9.99746959e-01, 9.99999998e-01, 7.89748204e-22,
       1.00000000e+00, 3.33942533e-31, 1.48693811e-04, 2.81430371e-21,
       4.58956003e-04, 1.40828814e-11, 9.98532671e-01, 9.99999038e-01,
       8.42337915e-13, 2.16798726e-08, 5.38095159e-07, 8.13711475e-02,
       9.99999995e-01, 1.00000000e+00, 2.17883384e-02, 1.00000000e+00,
       6.84081223e-04, 1.00000000e+00, 1.48235224e-18, 9.99999789e-01,
       4.76752584e-19, 3.58807955e-19, 9.99999993e-01, 6.63850629e-04,
      

In [34]:
metrics.roc_auc_score(y_test, y_pred_prob)

0.8675390156062425

## N-gram modelling

👇 Using `cross_validate`, score a Multinomial Naive Bayes model trained on a 2-gram Bag-of-Word representation of the texts.

In [37]:
vects = CountVectorizer(ngram_range=(2,2))
vects.fit(X_train)
X_train_dtm = vects.transform(X_train)
X_train_dtm = vects.fit_transform(X_train)
X_train_dtm

<1500x400932 sparse matrix of type '<class 'numpy.int64'>'
	with 871757 stored elements in Compressed Sparse Row format>

In [38]:
X_test_dtm = vects.transform(X_test)
X_test_dtm

<500x400932 sparse matrix of type '<class 'numpy.int64'>'
	with 188548 stored elements in Compressed Sparse Row format>

In [39]:
nb = MultinomialNB()
%time nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

Wall time: 69 ms


0.838

In [40]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[197,  58],
       [ 23, 222]], dtype=int64)

In [41]:
X_test[y_pred_class > y_test]

808    stephen please post if appropriate mafia crime...
442    the original _babe_ was my favorite movie of 1...
558    synopsis lifelong friends rafe affleck and dan...
275    porter stoddard warren beatty is a successful ...
573    synopsis original jurassic park survivor alan ...
868    aspiring broadway composer robert aaron willia...
693    i have nothing against unabashedly romantic fi...
120    starting with the little mermaid and most rece...
535    plunkett macleane is a period piece mired down...
774    it is with some sad irony that i screened frig...
47     instinct is the kind of movie that inexperienc...
700    the beach is a structurally confusing film tha...
701    the most absurd remake of 1998 it s a toss up ...
761    weighed down by tired plot lines and spielberg...
480    as any reasonable human being would i must adm...
612    joe versus the volcano is really one of the wo...
697    marie couldn t talk paulie the parrot star of ...
846    capsule the running gag 

In [42]:
X_test[y_pred_class < y_test]

1282    ok let s get one thing straight right away max...
1636    we all know the fate of child s play which is ...
1138    the second jackal based film to come out in 19...
1761    film adaptation of hunter s thompson s infamou...
1842    if beavis and butthead had a favorite movie fr...
1135    as with his other stateside releases jackie ch...
1712    when i first heard about scream in 1996 i was ...
1219    it must be tough to be a mob boss just ask pau...
1334    the booming introduction music finishes and th...
1073    screen story by kevin yagher and andrew kevin ...
1028    the blair witch project was perhaps one of a k...
1646    vampire lore and legend has always been a popu...
1184    i tried hard not like this movie without succe...
1162    note some may consider portions of the followi...
1000    films adapted from comic books have had plenty...
1403    my fellow americans is a movie that at first g...
1611    john carpenter directed this stylish and gory ...
1871    well t

In [43]:
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([4.03234213e-005, 9.99997761e-001, 1.22423008e-013, 1.00000000e+000,
       9.95840898e-001, 1.26264421e-001, 2.34748880e-033, 1.01292949e-011,
       1.00000000e+000, 9.99908852e-001, 1.00000000e+000, 1.00000000e+000,
       1.00000000e+000, 9.99950377e-001, 5.73132034e-001, 1.00000000e+000,
       2.24822998e-004, 1.00000000e+000, 2.77314270e-043, 1.00000000e+000,
       4.51765778e-007, 1.00000000e+000, 1.00000000e+000, 2.74237796e-001,
       1.99350946e-001, 1.56465990e-013, 1.00000000e+000, 4.88499629e-035,
       9.99616454e-001, 5.89255913e-001, 9.99999846e-001, 1.95942305e-033,
       1.00000000e+000, 1.39293241e-015, 6.49082046e-001, 4.80627345e-023,
       1.60805344e-003, 3.53307737e-006, 9.96104350e-001, 1.00000000e+000,
       8.56567925e-006, 4.19773189e-012, 1.76894001e-005, 1.00000000e+000,
       9.99762370e-001, 1.00000000e+000, 1.07372881e-006, 1.00000000e+000,
       9.99833848e-001, 1.00000000e+000, 7.23094750e-011, 1.00000000e+000,
       1.65869162e-023, 8

In [44]:
metrics.roc_auc_score(y_test, y_pred_prob)

0.9042416966786715

⚠️ Please push the exercise once you are done 🙃

## 🏁 