# Classifying IMDb Movie Reviews

## Data Overview

For this analysis we’ll be using a dataset of 50,000 movie reviews taken from IMDb. The data was compiled by Andrew Maas and can be found here: IMDb Reviews.
http://ai.stanford.edu/~amaas/data/sentiment/

The data is split evenly with 25k reviews intended for training and 25k for testing your classifier. Moreover, each set has 12.5k positive and 12.5k negative reviews.

IMDb lets users rate movies on a scale from 1 to 10. To label these reviews the curator of the data labeled anything with ≤ 4 stars as negative and anything with ≥ 7 stars as positive. Reviews with 5 or 6 stars were left out.

## Reading the Data

In [2]:
review_train=[]
review_test=[]
for line in open("Desktop/Dataset/imdb_review/full_train.txt",'r',encoding="utf8"):
    review_train.append(line.strip())
    
for line in open("Desktop/Dataset/imdb_review/full_test.txt",'r',encoding="utf8"):
    review_test.append(line.strip())
    


In [4]:
review_train[:5]

['Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!',
 'Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything fro

In [5]:
review_test[:5]

["I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.",
 'Actor turned director Bill Paxton follows up his promising debut, the Gothic-horror "Frailty", with this family friendly sports drama about the 1913 U.S. Open where a young American caddy rises from 

## Clean and Preprocess

In [7]:
import re

replace_no_space = re.compile("[.;:!\'?,\"()\[\]]")
replace_with_space = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")


def replace_n_clean(review):
    review = [replace_no_space.sub("",line.lower()) for line in review]
    review = [replace_with_space.sub("",line) for line in review]
    
    return review

review_train_clean = replace_n_clean(review_train)
review_test_clean = replace_n_clean(review_test)



In [9]:
review_train_clean[:5]

['bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt',
 'homelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter most

In [10]:
review_test_clean[:5]

['i went and saw this movie last night after being coaxed to by a few friends of mine ill admit that i was reluctant to see it because from what i knew of ashton kutcher he was only able to do comedy i was wrong kutcher played the character of jake fischer very well and kevin costner played ben randall with such professionalism the sign of a good movie is that it can toy with our emotions this one did exactly that the entire theater which was sold out was overcome by laughter during the first half of the movie and were moved to tears during the second half while exiting the theater i not only saw many women in tears but many full grown men as well trying desperately not to let anyone see them crying this movie was great and i suggest that you go see it before you judge',
 'actor turned director bill paxton follows up his promising debut the gothichorror frailty with this family friendly sports drama about the 1913 us open where a young american caddy rises from his humble background to

## Vectorization

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

cv= CountVectorizer(binary =True)

cv.fit(review_train_clean)
X= cv.transform(review_train_clean)
X_test =cv.transform(review_test_clean)

  return f(*args, **kwds)


## Build Classifier

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i< 12500 else 0  for i in range(25000)]

X_train,X_val,Y_train,Y_val = train_test_split(X,target,train_size=0.75)


for c in [0.01,0.05,0.25,0.5,1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_train,Y_train)
    print("Accuracy for C=%s: %s"%(c,accuracy_score(Y_val,lr.predict(X_val))))
    
    





Accuracy for C=0.01: 0.87136
Accuracy for C=0.05: 0.88144
Accuracy for C=0.25: 0.87984
Accuracy for C=0.5: 0.87888
Accuracy for C=1: 0.87856


## Train Final Model


In [14]:
final_model = LogisticRegression(C=0.05)
final_model.fit(X,target)
print("Final Accuracy: %s" 
       % accuracy_score(target, final_model.predict(X_test)))

Final Accuracy: 0.88276
