# IMBD Dataset and Sentiment Classification

The large movie review dataset contains a collection of 50,000 reviews from IMDB. The dataset contains an even number of positive and negative reviews. The authors considered only highly polarized reviews. A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10. Neutral reviews are not included in the dataset. The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.

To get the dataset, in your terminal run the following commands:

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

gunzip aclImdb_v1.tar.gz

tar -xvf aclImdb_v1.tar


In [14]:
#import libraries
from fastai.nlp import *
import sklearn

In [15]:
#set path 
PATH='aclImdb/'
names=['neg','pos']

In [16]:
!ls {PATH}train

labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [17]:
#lading training and validation data 
trn,trn_y = texts_labels_from_folders(f'{PATH}train',names)
val,val_y = texts_labels_from_folders(f'{PATH}test',names)

## Using Unigrams

In [18]:
#Converting to reviews to features by Tokenizing 
vectorizer = CountVectorizer(tokenizer=tokenize)

In [19]:
# Transforming reviews to features
trn_term_doc = vectorizer.fit_transform(trn)
val_term_doc = vectorizer.transform(val)

In [20]:
#checking vocaulary
vocab = vectorizer.get_feature_names();vocab[2000:2010]

['affinity',
 'affinité',
 'affirm',
 'affirmation',
 'affirmations',
 'affirmative',
 'affirmatively',
 'affirmed',
 'affirming',
 'affirms']

In [21]:
# renaming 
x=trn_term_doc
y=trn_y

In [22]:
#using logistic regression to make predictions
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C=0.1,dual=True)
lr.fit(x.sign(),y)
preds = lr.predict(val_term_doc.sign())
(preds==val_y).mean()

0.88404

## Using nGRAM Approach:-Bigram and Trigram
I used unigrams in the previous approach,meaning each feature was a single word.
Now using bigrams and Trigrams to predict the sentiment of the review

In [23]:
# Creating bigrams and trigrams
vectorizer = CountVectorizer(ngram_range=(1,3),tokenizer=tokenize, max_features=800000)
trn_term_doc = vectorizer.fit_transform(trn)
val_term_doc = vectorizer.transform(val)

In [24]:
#checking the vocabulary
vocab = vectorizer.get_feature_names();
vocab[1000:1010]

['! the bbc',
 '! the best',
 '! the camera',
 '! the cast',
 '! the character',
 '! the characters',
 '! the cinematography',
 '! the climax',
 '! the dialog',
 '! the direction']

In [25]:
#.sign() to binarize it number of occurences of a word aren't important
x = trn_term_doc.sign()
y = trn_y

In [26]:
# applying Logistic Regression
lr = LogisticRegression(C=0.1,dual=True)
lr.fit(x,y)
preds = lr.predict(val_term_doc.sign())
(preds==val_y).mean()

0.905