# ML Project IMDB movie rating

## 20640117 Liu, Zhechen

### Step 1: Load data

**get_data** is a function with paratemers **train_test** and **pos_neg**.  

**train_test** chosen from "train" and "test", **pos_neg** chosen from "pos" and "neg".  

Return a pandas.DataFrame with columns['revies', 'label'].  

The review contains the content of each txt and the label is 1 if positive otherwise -1.

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv
import os
import warnings
warnings.filterwarnings("ignore")
def get_data(train_test, pos_neg):
    files = os.listdir("{}/{}".format(train_test, pos_neg))
    content = []
    for filename in files:
        with open("{}/{}/{}".format(train_test, pos_neg, filename), 'r', encoding = 'UTF-8') as f:
            content.append(f.read())
    content = pd.DataFrame(content, columns =['review'])
    if pos_neg == 'pos':
        content['label'] = 1
    else:
        content['label'] = -1
    return content
train_pos = get_data("train", "pos")
train_neg = get_data("train", "neg")
test_pos = get_data("test", "pos")
test_neg = get_data("test", "neg")

### Step 2: Transfer data for model

**X**, **Y** for train part. **x_test**, **y_test** for test part

In [2]:
train = pd.concat([train_pos, train_neg])
test = pd.concat([test_pos, test_neg])
input_cols = ['review']
output_cols = ['label']
X = train[input_cols]
Y = train[output_cols]
x_test = test[input_cols]
y_test = test[output_cols]
X.head()

Unnamed: 0,review
0,Bromwell High is a cartoon comedy. It ran at t...
1,Homelessness (or Houselessness as George Carli...
2,Brilliant over-acting by Lesley Ann Warren. Be...
3,This is easily the most underrated film inn th...
4,This is not the typical Mel Brooks film. It wa...


In [3]:
Y.head()

Unnamed: 0,label
0,1
1,1
2,1
3,1
4,1


In [4]:
y = Y.values.ravel().tolist()
x = X.values.ravel()
x_test = x_test.values.ravel()

### Step 3: import tools and clean data

Import a package called **nltk**, run **nltk.download()** if first time.  

For each comment for a movie contain lots of useless things, such as .... or "<br\/><br\/>" or some words meaningless.  

So I use **stopwords**, **tokenizer** and **PorterStemmer** to deal with these things, make the context more clean.  

It's a common routine for data cleaning.

In [5]:
#import nltk
#nltk.download()
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
tokenizer = RegexpTokenizer('\w+')
stopword = set(stopwords.words('english'))
ps = PorterStemmer()
def clean_review(review):
    
    review = review.lower()
    review = review.replace("<br /><br />"," ")
    
    #Tokenize
    tokens = tokenizer.tokenize(review)
    new_tokens = [token for token in tokens if token not in stopword]
    stemmed_tokens = [ps.stem(token) for token in new_tokens]
    
    cleaned_review = ' '.join(stemmed_tokens)
    
    return cleaned_review

In [6]:
x_clean = [clean_review(sent) for sent in x]
x_test_clean = [clean_review(sent) for sent in x_test]
x_clean[0]

'bromwel high cartoon comedi ran time program school life teacher 35 year teach profess lead believ bromwel high satir much closer realiti teacher scrambl surviv financi insight student see right pathet teacher pomp petti whole situat remind school knew student saw episod student repeatedli tri burn school immedi recal high classic line inspector sack one teacher student welcom bromwel high expect mani adult age think bromwel high far fetch piti'

### Step 4: Convert texts to token counts

Import a package called **CountVectorizer**.  

CountVectorizer is a common feature numerical calculation class and is a text feature extraction method. For each training text, it only considers how often a certain word appears in the training text.

CountVectorizer converts text-neutral phrases into a word frequency matrix. It uses the **fit_transform** function to count the number of occurrences of multiple words and sentences.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
x_vec = cv.fit_transform(x_clean)
xt_vec = cv.transform(x_test_clean)
print(cv.get_feature_names())

['00', '000', '0000000000001', '00001', '00015', '001', '003830', '006', '007', '0079', '0080', '0083', '0093638', '00am', '00pm', '01', '01pm', '02', '020410', '029', '03', '04', '041', '05', '050', '06', '06th', '07', '08', '087', '089', '08th', '09', '0f', '0ne', '0r', '0s', '10', '100', '1000', '1000000', '10000000000000', '1000lb', '1001', '100b', '100k', '100m', '100min', '100mph', '100th', '100x', '100yard', '101', '101st', '102', '102nd', '103', '104', '1040', '1040a', '105', '1050', '105lb', '106', '106min', '107', '108', '109', '10am', '10line', '10mil', '10min', '10minut', '10p', '10pm', '10star', '10th', '10x', '10yr', '11', '110', '1100', '11001001', '1100ad', '111', '112', '1138', '114', '1146', '115', '116', '117', '11f', '11m', '11th', '12', '120', '1200', '1200f', '1201', '1202', '123', '12383499143743701', '125', '125m', '127', '128', '12a', '12hr', '12m', '12mm', '12th', '13', '130', '1300', '131', '1318', '132', '134', '135', '135m', '136', '137', '138', '139', '13k

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(stop_words='english')
x_vec = tv.fit_transform(x_clean)

### Step 5: Train and predict

Import packages called **MultinomialNB** and **BernoulliNB** for train and predict.  

**MultinomialNB** implements the naive Bayes algorithm for multinomially distributed data, and is one of the two classic naive Bayes variants used in text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). The distribution is parametrized by vectors $\theta_y = (\theta_{y1},..., \theta_{yn})$ for each class $y$, where $n$ is the number of features (in text classification, the size of the vocabulary) and $\theta_y$ is the probability $P(x_i|y)$ of feature $i$ appearing in a sample belonging to class $y$.

The parameters $\theta_y$
 is estimated by a smoothed version of maximum likelihood, i.e. relative frequency counting:
 $$\hat {\theta}_{yi} = \frac{N_{yi}+\alpha}{N_{y}+\alpha n}$$

 
where $N_{yi}=\sum_{x \in T} x_i$
 is the number of times feature  appears in a sample of class $y$ in the training set $T$, and $N_{y}=\sum_{i=1}^{n} N_{yi}$
 is the total count of all features for class $y$.

The smoothing priors $\alpha \geq 0$ accounts for features not present in the learning samples and prevents zero probabilities in further computations. Setting $\alpha = 1$ is called Laplace smoothing, while $\alpha \le 1$ is called Lidstone smoothing.

In [29]:
from sklearn.naive_bayes import MultinomialNB,BernoulliNB

mb = MultinomialNB()
mb.fit(x_vec,y)

mb_train_pre = mb.predict(x_vec)
mb_train_pre = mb_train_pre.tolist()

mb_test_pre = mb.predict(xt_vec)
mb_test_pre = mb_test_pre.tolist()

In [31]:
from sklearn.metrics import classification_report
print('************** train part ***************')
print(classification_report(y, mb_train_pre))
print('')
print('************** test part ***************')
print(classification_report(y_test, mb_test_pre))

************** train part ***************
              precision    recall  f1-score   support

          -1       0.88      0.92      0.90     12500
           1       0.92      0.88      0.90     12500

    accuracy                           0.90     25000
   macro avg       0.90      0.90      0.90     25000
weighted avg       0.90      0.90      0.90     25000


************** test part ***************
              precision    recall  f1-score   support

          -1       0.79      0.87      0.83     12500
           1       0.86      0.76      0.81     12500

    accuracy                           0.82     25000
   macro avg       0.82      0.82      0.82     25000
weighted avg       0.82      0.82      0.82     25000



根据数据特点和分析任务需求特点，某些分类器可能会比其他的分类器效果更好。最近最大熵（Maximum Entropy）很火，能应用到很多机器学习任务中。  
注意，最大熵其实就是LogisticRegression，该算法比朴素贝叶斯慢很多。

In [33]:
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(x_vec, y)
train_predicted = LR.predict(x_vec)
test_predicted = LR.predict(xt_vec)
print(classification_report(y, train_predicted))
print(classification_report(y_test, test_predicted))

              precision    recall  f1-score   support

          -1       0.99      0.99      0.99     12500
           1       0.99      0.99      0.99     12500

    accuracy                           0.99     25000
   macro avg       0.99      0.99      0.99     25000
weighted avg       0.99      0.99      0.99     25000

              precision    recall  f1-score   support

          -1       0.85      0.86      0.86     12500
           1       0.86      0.84      0.85     12500

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000

