[Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)
======

## Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

## File descriptions

labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
## Data fields

* id - Unique ID of each review
* sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
* review - Text of the review

## Objective
Objective of this dataset is base on **review** we predict **sentiment** (positive or negative) so X is **review** column and y is **sentiment** column

## 1. Load Dataset

In [9]:
!pip install -q xgboost==0.4a30
import xgboost

[?25l[K     |▍                               | 10kB 16.1MB/s eta 0:00:01[K     |▉                               | 20kB 856kB/s eta 0:00:01[K     |█▎                              | 30kB 1.3MB/s eta 0:00:01[K     |█▊                              | 40kB 839kB/s eta 0:00:01[K     |██▏                             | 51kB 1.0MB/s eta 0:00:01[K     |██▋                             | 61kB 1.2MB/s eta 0:00:01[K     |███                             | 71kB 1.4MB/s eta 0:00:01[K     |███▌                            | 81kB 1.6MB/s eta 0:00:01[K     |████                            | 92kB 1.8MB/s eta 0:00:01[K     |████▍                           | 102kB 1.4MB/s eta 0:00:01[K     |████▉                           | 112kB 1.4MB/s eta 0:00:01[K     |█████▏                          | 122kB 1.4MB/s eta 0:00:01[K     |█████▋                          | 133kB 1.4MB/s eta 0:00:01[K     |██████                          | 143kB 1.4MB/s eta 0:00:01[K     |██████▌                   

In [1]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [9]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
# Load dataset
data = pd.read_csv('/content/gdrive/My Drive/PROJECTS/CoderSchool_Fansipan/github_repo/fansipan_imdb_review/data/movie_review.csv', sep='\t', encoding='latin-1')

In [11]:
data.head()

Unnamed: 0,id,review,sentiment
0,5814_8,With all this stuff going down at the moment w...,1
1,2381_9,"\The Classic War of the Worlds\"" by Timothy Hi...",1
2,7759_3,The film starts with a manager (Nicholas Bell)...,0
3,3630_4,It must be assumed that those who praised this...,0
4,9495_8,Superbly trashy and wondrously unpretentious 8...,1


In [12]:
# Get the list of stop words in English
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')

stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
# Remove special characters, html tags and "trash"
import re

def preprocessor(text):
  # remove HTML markup
  text = re.sub('<[^>]*>', '', text)
  
  # Save emoticons for later appending
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
  
  # Remove any non-word character and append the emoticons,
  # removing the nose character for standardization. Convert to lower case
  text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
  
  return text

In [0]:
# Tokenizer and Stemming
# tokenizer: to break down our twits in individual words
# stemming: reducing a word to its root

from nltk.stem import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
  return [porter.stem(word) for word in text.split()]

## Load Trained Models

In [0]:
import pickle
import os

In [0]:
svm = pickle.load(open('/content/gdrive/My Drive/PROJECTS/CoderSchool_Fansipan/github_repo/fansipan_imdb_review/models/Model_SVM_default_imdb.pkl', 'rb'))

In [0]:
X = data['review']
y = data['sentiment']

In [0]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop, tokenizer=tokenizer_porter, preprocessor=preprocessor)
X_tfidf = tfidf.fit_transform(X)

X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(X_tfidf, y, test_size=0.3, random_state=101)

In [0]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [0]:
svm_preds = svm.predict(X_test_tfidf)
svm_cm = confusion_matrix(y_test_tfidf, svm_preds)
svm_report = classification_report(y_test_tfidf, svm_preds)

In [24]:
print(svm_cm)
print(svm_report)

[[2993  415]
 [ 335 3007]]
              precision    recall  f1-score   support

           0       0.90      0.88      0.89      3408
           1       0.88      0.90      0.89      3342

    accuracy                           0.89      6750
   macro avg       0.89      0.89      0.89      6750
weighted avg       0.89      0.89      0.89      6750



### Data Resampling

In [0]:
n = 10000

X_train_tfidf_new = X_train_tfidf[:n, :n]
X_test_tfidf_new = X_test_tfidf[:n, :n]

y_train_tfidf_new = y_train_tfidf[:n]
y_test_tfidf_new = y_test_tfidf[:n]

In [0]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

val_rate = []
c_range =  range(1,100,10)
for this_c in c_range:
    svm = SVC(kernel='linear',C=this_c)
    val_score = 1-cross_val_score(svm, X_train_tfidf, y_train_tfidf, cv=5).mean()
    val_rate.append(val_score)

plt.figure(figsize=(15,7))
plt.plot(c_range, val_rate, color='orange', linestyle='dashed', marker='o',
         markerfacecolor='black', markersize=5, label='Validation Error')

plt.xticks(np.arange(c_range.start, c_range.stop, c_range.step), rotation=60)
plt.grid()
plt.legend()
plt.title('Validation Error vs. C Value')
plt.xlabel('C')
plt.ylabel('Validation Error')
plt.show()