# Bag of Words Meets Bags of Popcorn : TF-IDF

## Table of Contents

## 1. Introduction

For this notebook, 

In [1]:
# Install a Drive FUSE wrapper.
# https://github.com/astrada/google-drive-ocamlfuse

#!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
#!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
#!apt-get update -qq 2>&1 > /dev/null
#!apt-get -y install -qq google-drive-ocamlfuse fuse

In [2]:
#from google.colab import auth
#auth.authenticate_user()

In [3]:
#from oauth2client.client import GoogleCredentials
#creds = GoogleCredentials.get_application_default()
#import getpass
#!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
#vcode = getpass.getpass()
#!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

In [4]:
#!mkdir -p my_drive
#!google-drive-ocamlfuse my_drive

In [5]:
#!mkdir -p my_drive/BOW_kaggle

## 2. Read the Data

In [6]:
# Import libraries

import pandas as pd
import numpy as np

In [7]:
# Read the data 

X_train = pd.read_csv("labeledTrainData.tsv",quoting = 3, delimiter = "\t", header= 0)
X_test = pd.read_csv("testData.tsv", quoting = 3, delimiter = "\t", header = 0)

In [8]:
# Read only the first 600 sentences of the first review.

X_train['review'][0][:600]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br />'

In [9]:
print('Training set dimension:',X_train.shape)
print('Test set dimension:',X_test.shape)

Training set dimension: (25000, 3)
Test set dimension: (25000, 2)


In [10]:
X_train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


## 3. Preprocess the text

In [11]:
from bs4 import BeautifulSoup
import re
import nltk

In [12]:
def prep(review):
    
    # Remove HTML tags.
    review = BeautifulSoup(review,'html.parser').get_text()
    
    # Remove non-letters
    review = re.sub("[^a-zA-Z]", " ", review)
    
    # Lower case
    review = review.lower()
    
    # Tokenize to each word.
    token = nltk.word_tokenize(review)
    
    # Stemming
    review = [nltk.stem.SnowballStemmer('english').stem(w) for w in token]
    
    # Join the words back into one string separated by space, and return the result.
    return " ".join(review)
    

In [13]:
X_train['review'].iloc[:2].apply(prep).iloc[0]

'with all this stuff go down at the moment with mj i ve start listen to his music watch the odd documentari here and there watch the wiz and watch moonwalk again mayb i just want to get a certain insight into this guy who i thought was realli cool in the eighti just to mayb make up my mind whether he is guilti or innoc moonwalk is part biographi part featur film which i rememb go to see at the cinema when it was origin releas some of it has subtl messag about mj s feel toward the press and also the obvious messag of drug are bad m kay visual impress but of cours this is all about michael jackson so unless you remot like mj in anyway then you are go to hate this and find it bore some may call mj an egotist for consent to the make of this movi but mj and most of his fan would say that he made it for the fan which if true is realli nice of him the actual featur film bit when it final start is onli on for minut or so exclud the smooth crimin sequenc and joe pesci is convinc as a psychopath

In [14]:
# If there is no problem at the previous cell, let's apply to all the rows.
X_train['clean'] = X_train['review'].apply(prep)
X_test['clean'] = X_test['review'].apply(prep)

In [15]:
X_train['clean'].iloc[0]

'with all this stuff go down at the moment with mj i ve start listen to his music watch the odd documentari here and there watch the wiz and watch moonwalk again mayb i just want to get a certain insight into this guy who i thought was realli cool in the eighti just to mayb make up my mind whether he is guilti or innoc moonwalk is part biographi part featur film which i rememb go to see at the cinema when it was origin releas some of it has subtl messag about mj s feel toward the press and also the obvious messag of drug are bad m kay visual impress but of cours this is all about michael jackson so unless you remot like mj in anyway then you are go to hate this and find it bore some may call mj an egotist for consent to the make of this movi but mj and most of his fan would say that he made it for the fan which if true is realli nice of him the actual featur film bit when it final start is onli on for minut or so exclud the smooth crimin sequenc and joe pesci is convinc as a psychopath

In [16]:
print('Training dim:',X_train.shape, 'Test dim:', X_test.shape)

Training dim: (25000, 4) Test dim: (25000, 3)


## 4. TF-IDF

TF-IDF (Term Frequency - Inverse Document Frequency) can be represented tf(d,t) X idf(t). TF-IDF uses the method diminishing the weight (importance) of words appeared in many documents in common, considered them incapable of discerning the documents, rather than simply counting the frequency of words as CountVectorizer does. The outcome matrix consists of each document (row) and each word (column) and the importance (weight) computed by tf * idf (values of the matrix).


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import words

In [18]:
# analyzer is the parameter that the vectorizer reads the input data in word unit or character unit to create a matrix
# vocabulary is the parameter that the vectorizer creates the matrix by using only input data or some other source 
# Other parameters are self-explanatory and already mentioned in other notebooks.

tv = TfidfVectorizer(min_df = 3,
                    stop_words = 'english',
                    lowercase = True,
                    ngram_range = (1,3),
                    analyzer = 'word',
                    vocabulary = set(words.words()),
                    max_features = 100000)

In [19]:
# Handle with care especially when you transform the test dataset. (Wrong: fit_transform(X_test))

train_tv = tv.fit_transform(X_train['clean'])
test_tv = tv.transform(X_test['clean'])

In [20]:
# Create the list of vocabulary used for the vectorizer.

vocab = tv.get_feature_names()
print(vocab[:5])

['A', 'Aani', 'Aaron', 'Aaronic', 'Aaronical']


In [21]:
print("Vocabulary length:", len(vocab))

Vocabulary length: 235892


In [22]:
dist = np.sum(train_tv, axis=0)
checking = pd.DataFrame(dist,columns = vocab)

In [23]:
print('Training dim:',train_tv.shape, 'Test dim:', test_tv.shape)

Training dim: (25000, 235892) Test dim: (25000, 235892)


## 5. Modeling <a id= 'modeling'></a>

In [24]:
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import GridSearchCV, StratifiedKFold, learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.neural_network import MLPClassifier

In [25]:
kfold = StratifiedKFold( n_splits = 5, random_state = 2018 )

### 5.1 Support Vector Machine <a id= 'svm'></a>

In [26]:
# LinearSVC

sv = LinearSVC(random_state=2018)

param_grid = {
    'loss':['hinge'],
    'class_weight':['balanced'],
    'C': [0.3]
}

gs_sv = GridSearchCV(sv, param_grid = param_grid, verbose = 1, cv = kfold, n_jobs = -1, scoring = 'roc_auc')
gs_sv.fit(train_tv, X_train['sentiment'])
gs_sv_best = gs_sv.best_estimator_
print(gs_sv.best_params_)

# {'C': 0.1, 'class_weight': {1: 1}, 'penalty': 'l2','loss':'squared_hinge'} - 0.85408
# {'C': 0.3, 'class_weight': 'balanced', 'loss': 'hinge'} - 0.85456 (309/578)
## 

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.9s finished


{'C': 0.3, 'class_weight': 'balanced', 'loss': 'hinge'}


In [27]:
submission1 = gs_sv.predict(test_tv)

### 5.2 Bernoulli Naive Bayes Classifier <a id= 'bnb'></a>

In [28]:
#bnb = BernoulliNB()
#gs_bnb = GridSearchCV(bnb, param_grid = {'alpha': 3.0000000000000004,
#                                         'binarize': 0.10000000000000001}, verbose = 1, cv = kfold, n_jobs = -1, scoring = "roc_auc")
#gs_bnb.fit(train_tv, X_train['sentiment'])
#gs_bnb_best = gs_bnb.best_estimator_
#print(gs_bnb.best_params_)

# max_features = 100000, 552/578 - 0.80476
# alpha - 3.0000000000000004, binarize - 0.10000000000000001


In [29]:
#submission2 = gs_bnb.predict(test_tv)

### 5.3 Perceptron

In [30]:
MLP = MLPClassifier(random_state = 2018)

mlp_param_grid = {
    'hidden_layer_sizes':[(1,)],
    'activation':['tanh'],
    'solver':['sgd'],
    'alpha':[0.1], # 만약에 여기도 없으면 0.1~1까지 해봐야함
    'learning_rate':['constant'],
    'max_iter':[1000]
}


gsMLP = GridSearchCV(MLP, param_grid = mlp_param_grid, cv = kfold, scoring = 'roc_auc', n_jobs= -1, verbose = 1)
gsMLP.fit(train_tv,X_train['sentiment'])
print(gsMLP.best_params_)
mlp_best0 = gsMLP.best_estimator_

# {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': (1,), 'learning_rate': 'constant', 'max_iter': 1000, 'solver': 'sgd'} - 85.560(304/578)
# {'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': (5,), 'learning_rate': 'constant', 'max_iter': 1000, 'solver': 'sgd'} - 85.552
# {'activation': 'tanh', 'alpha': 0.07, 'hidden_layer_sizes': (5,), 'learning_rate': 'constant', 'max_iter': 1000, 'solver': 'sgd'} - 85.452

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed: 22.4min finished


{'activation': 'tanh', 'alpha': 0.1, 'hidden_layer_sizes': (1,), 'learning_rate': 'constant', 'max_iter': 1000, 'solver': 'sgd'}


In [31]:
submission3 = gsMLP.predict(test_tv)

### 5.4 Logistic Regression

In [32]:
lr = LogisticRegression(random_state = 2018)

lr2_param = {
    'penalty':['l2'],
    'dual':[True],
    'C':[1],
    'class_weight':[{1:2}]
    }

lr_CV = GridSearchCV(lr, param_grid = [lr2_param], cv = kfold, scoring = 'roc_auc', n_jobs = -1, verbose = 1)
lr_CV.fit(train_tv, X_train['sentiment'])
print(lr_CV.best_params_)
logi_best = lr_CV.best_estimator_

# {'C': 1, 'class_weight': {1: 2}, 'dual': True, 'penalty': 'l2'} - 83.932

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    1.2s finished


{'C': 1, 'class_weight': {1: 2}, 'dual': True, 'penalty': 'l2'}


In [33]:
submission6 = lr_CV.predict(test_tv)

### Ensemble

In [51]:
from sklearn.ensemble import VotingClassifier

votingC = VotingClassifier(estimators = [('mlp0',mlp_best0), #('mlp1',mlp_best1), ('mlp2',mlp_best2), 
                                         ('svm',gs_sv_best)], voting='hard',n_jobs=-1)

votingC = votingC.fit(train_tv,X_train['sentiment'])

submission = votingC.predict(test_tv)

# mlp + svm + logi, [3,2,1] - 85.560
# mlp + svm + logi, [2,2,1] - 85.452
# mlp + svm + logi, [0.41,0.39,0.2] - 85.452
# mlp + svm, no weight - 85.572

In [54]:
print('Training Score:',votingC.score(train_tv,X_train['sentiment']))

Training Score: 0.89812


## 6. Submission <a id= 'submission'></a>

In [55]:
output = pd.DataFrame( data = {'id': X_test['id'], 'sentiment': submission })
output.to_csv('submission16.csv', index = False, quoting = 3)