# Introduction

**This notebook is a basic project on Sentiment Analysis using IMDB Movie review dataset available [here](https://github.com/aaronkub/machine-learning-examples/blob/master/imdb-sentiment-analysis/movie_data.tar.gz)**

The Data has two files one for training and another for test, each has 25k reviews and also each set has 12.5k negative as well as positive reviews, respectively.

In [45]:
#Here we read the both, train and test file, here by uisng .strip() we also removed the whitespaces at the start and the end of each review, hence we're left with sort of a paragraph
review_train=[]
for line in open('/Users/nikhilshah/Downloads/movie_data/full_train.txt','r'):
    review_train.append(line.strip())

review_test=[]
for line in open('/Users/nikhilshah/Downloads/movie_data/full_test.txt','r'):
    review_test.append(line.strip())

In [40]:
#after removing the whitspace characters the data is left with line break characters which needs to be cleaned
#importing the regular expression package
import re 

replace_with_nospace= re.compile("[.;:!\'?,\"()\[\]]")
replace_with_space=re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
def pre_process(reviews):
    reviews=[replace_with_nospace.sub("",line.lower()) for line in reviews]
    reviews=[replace_with_space.sub(" ",line) for line in reviews]
        
    return reviews

In [46]:
review_train=pre_process(review_train)
review_train

['bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt',
 'homelessness or houselessness as george carlin stated has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school work or vote for the matter most

In [31]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nikhilshah/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [38]:
#for sentiment analysis stopwords posssess no sentiments, hence it needs to be removed, stopwords are he, she, is etc.
from nltk.corpus import stopwords
english_stopwords= stopwords.words('english')
def remove_stopwords(corpus):
    return [(' '.join([word for word in review.split() if word not in english_stopwords])) for review in corpus]

In [47]:
review_train=remove_stopwords(review_train)
review_train

['bromwell high cartoon comedy ran time programs school life teachers 35 years teaching profession lead believe bromwell highs satire much closer reality teachers scramble survive financially insightful students see right pathetic teachers pomp pettiness whole situation remind schools knew students saw episode student repeatedly tried burn school immediately recalled high classic line inspector im sack one teachers student welcome bromwell high expect many adults age think bromwell high far fetched pity isnt',
 'homelessness houselessness george carlin stated issue years never plan help street considered human everything going school work vote matter people think homeless lost cause worrying things racism war iraq pressuring kids succeed technology elections inflation worrying theyll next end streets given bet live streets month without luxuries home entertainment sets bathroom pictures wall computer everything treasure see like homeless goddard bolts lesson mel brooks directs stars bo

In [43]:
#After cleaning, all words needs to be converted into its stem word for example, 'deciding' to 'decide' ,'schools' to 'school'etc.
from nltk.stem.porter import PorterStemmer
def get_stemming(corpus):
    stemmer= PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()])for review in corpus]

In [48]:
stemmed= get_stemming(review_train)
stemmed

['bromwel high cartoon comedi ran time program school life teacher 35 year teach profess lead believ bromwel high satir much closer realiti teacher scrambl surviv financi insight student see right pathet teacher pomp petti whole situat remind school knew student saw episod student repeatedli tri burn school immedi recal high classic line inspector im sack one teacher student welcom bromwel high expect mani adult age think bromwel high far fetch piti isnt',
 'homeless houseless georg carlin state issu year never plan help street consid human everyth go school work vote matter peopl think homeless lost caus worri thing racism war iraq pressur kid succeed technolog elect inflat worri theyll next end street given bet live street month without luxuri home entertain set bathroom pictur wall comput everyth treasur see like homeless goddard bolt lesson mel brook direct star bolt play rich man everyth world decid make bet sissi rival jefferi tambor see live street thirti day without luxuri bolt

In [54]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nikhilshah/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [55]:
#as we see stemmer doesn't really gives us a good result as it chops the words, for e.g, scramble to scrambl
# Hence we use the other way which is to lemmatize. It does stemming without chopping the letters towards the end
from nltk.stem import WordNetLemmatizer
def get_lemmating(corpus):
    lemmater= WordNetLemmatizer()
    return [' '.join([lemmater.lemmatize(word) for word in review.split()])for review in corpus]

In [56]:
lemmatized=get_lemmating(review_train)
lemmatized    

['bromwell high cartoon comedy ran time program school life teacher 35 year teaching profession lead believe bromwell high satire much closer reality teacher scramble survive financially insightful student see right pathetic teacher pomp pettiness whole situation remind school knew student saw episode student repeatedly tried burn school immediately recalled high classic line inspector im sack one teacher student welcome bromwell high expect many adult age think bromwell high far fetched pity isnt',
 'homelessness houselessness george carlin stated issue year never plan help street considered human everything going school work vote matter people think homeless lost cause worrying thing racism war iraq pressuring kid succeed technology election inflation worrying theyll next end street given bet live street month without luxury home entertainment set bathroom picture wall computer everything treasure see like homeless goddard bolt lesson mel brook directs star bolt play rich man everyth

In [61]:
# Pre-processing the test set
review_test= pre_process(review_test)
review_test=remove_stopwords(review_test)
lemmatized_test=get_lemmating(review_test)
lemmatized_test

['went saw movie last night coaxed friend mine ill admit reluctant see knew ashton kutcher able comedy wrong kutcher played character jake fischer well kevin costner played ben randall professionalism sign good movie toy emotion one exactly entire theater sold overcome laughter first half movie moved tear second half exiting theater saw many woman tear many full grown men well trying desperately let anyone see cry movie great suggest go see judge',
 'actor turned director bill paxton follows promising debut gothic horror frailty family friendly sport drama 1913 u open young american caddy rise humble background play bristish idol dubbed greatest game ever played im fan golf scrappy underdog sport flick dime dozen recently done grand effect miracle cinderella man film enthralling film start creative opening credit imagine disneyfied version animated opening credit hbos carnivale rome lumber along slowly first number hour action move u open thing pick well paxton nice job show knack effe

In [60]:
# Here we Vectoize our whole document, which means we create a matrix out of the text, and as we know modelling can't handle string values
#we use TFIDFVectorizer which converts each words in a value, which is frequency of its occurance in the document
from sklearn.feature_extraction.text import TfidfVectorizer
vectorize= TfidfVectorizer()
vectorize.fit(lemmatized)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [62]:
x=vectorize.transform(lemmatized)
x.test=vectorize.transform(lemmatized_test)

In [64]:
#For classification we used Linear Support Vector Classifier which is very popular in high dimensions data
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i < 12500 else 0 for i in range(25000)]




In [65]:
#we achieved the accuracy of approx 87% from the LinearSVC
model= LinearSVC().fit(x,target)
print("Final Accuracy: %s" % accuracy_score(target, model.predict(x.test)))

Final Accuracy: 0.86684


In [72]:
#There's another way of vectorizing, which is CountVectorizer. it coverts our document in matirix where values are in binary form 
from sklearn.feature_extraction.text import CountVectorizer
Cvectorizer= CountVectorizer(binary=True,ngram_range=(1,3))
Cvectorizer.fit(review_train)
x=Cvectorizer.transform(review_train)
x.test=Cvectorizer.transform(review_test)

In [73]:
model= LinearSVC().fit(x,target)
print("Final Accuracy: %s" % accuracy_score(target, model.predict(x.test)))

Final Accuracy: 0.88712




## Conclusion

We improved our accuracy by few points, perhaps it can be further improved by trying other models like Tree based Models, Naive Bayes, or simple Logistic regression. This project is just for personal learning which is why I have only tried one model. I learnt a lot of things from Aaron Kub's Article [Sentiment Analysis in Python](https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184) and most of this project is inspired from there for the purpose of learning. I would like to thank him for this very useful article.