### Building A Machine Learning Model for Text Classification



Opinion mining is a subfied of of NLP 


####  Internet Movie Database (IMDb)

We use a small dataset of the movie reviews on IMDb consisting of movie reviews that are labeled as positive (if movie was rated with more than 6 stars) and negative (if movie was rated with fewer than 5 stars)

This dataset has been collected by Maas et al.
https://www.aclweb.org/anthology/P11-1015/



In [22]:
# Imports
import tensorflow as tf
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


In [29]:
#Load data
df = pd.read_csv('C:/Users/uknow/Desktop/movie_data.csv')
df.tail()

Unnamed: 0,review,sentiment
49964,This movie is terrible. It's about some no bra...,0
49965,"Well, what was fun... except for the fun part....",0
49966,By the time this film was released I had seen ...,0
49967,"Well, if you like pop/punk, punk, ska, and a t...",0
49968,Where this movie is faithful to Burroughs' vis...,1


In [54]:
#Print first 100 characters from the review atindex 49964:
df.loc[49964, 'review'][:80]

'this movie is terrible it s about some no brain surfin dude that inherits some c'

### Cleaning text data

using Python's regular expressions library re

In [55]:
import re

In [56]:
def clean_text(text):
    text = re.sub('<[^>]*>', '', text) # remove HTML markups 
    text = re.sub('[\W]+', ' ', text.lower()) # remove non-word characters and converted the text into lowercase
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text) # find emoticons
    text = text + " ".join(emoticons).replace('-', '') #  add emoticons at the end of the document string
    return text

clean_text(df.loc[49964, 'review'][:80])

'this movie is terrible it s about some no brain surfin dude that inherits some c'

In [53]:
df["review"]= df["review"].apply(clean_text)
df.tail()

Unnamed: 0,review,sentiment
49964,this movie is terrible it s about some no brai...,0
49965,well what was fun except for the fun part it s...,0
49966,by the time this film was released i had seen ...,0
49967,well if you like pop punk punk ska and a tad b...,0
49968,where this movie is faithful to burroughs visi...,1


### Processing text data

- removing whitespace characters 

In [102]:
#split the text corpora into individual elements 
def tokenizer(text):
    return text.split()
    
df.loc[49964, 'review'][:80].split()

['this',
 'movie',
 'is',
 'terrible',
 'it',
 's',
 'about',
 'some',
 'no',
 'brain',
 'surfin',
 'dude',
 'that',
 'inherits',
 'some',
 'c']

 - word stemming = reduce words to their root form
 
 such as the word inherits to its root form inherit


In [64]:
# Porter stemmer from the NLTK library
from nltk.stem.porter import PorterStemmer
porter=PorterStemmer()

In [62]:
def word_stemming(text):
    return[porter.stem(w) for w in text.split()]


In [63]:
word_stemming(df.loc[49964, 'review'][:80])

['thi',
 'movi',
 'is',
 'terribl',
 'it',
 's',
 'about',
 'some',
 'no',
 'brain',
 'surfin',
 'dude',
 'that',
 'inherit',
 'some',
 'c']

 - removing stop-words
 
 such as is, and, has

In [68]:
# download the stop-words set
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\uknow\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [70]:
from nltk.corpus import stopwords
stop = stopwords.words('english')


In [71]:
[words for words in word_stemming(df.loc[49964, 'review'][:80]) if words not in stop]

['thi', 'movi', 'terribl', 'brain', 'surfin', 'dude', 'inherit', 'c']

### Encoding text data


convert text data into a numerical form before passing it on to a machine learning algorithm

- Bag of words 

In [94]:
#raw term frequencies

# using CountVectorizer in sklearn
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer()
bag = cvect.fit_transform(X)

In [88]:
#vocabulary of unique words from the entire review dataset 

#unique words are mapped to integer indices stored as a dictionary

cvect.vocabulary_

{'this': 93289,
 'movie': 62057,
 'is': 48287,
 'just': 50056,
 'crap': 21786,
 'even': 31795,
 'though': 93388,
 'the': 92869,
 'directors': 26236,
 'claim': 18165,
 'to': 94113,
 'be': 9255,
 'part': 68364,
 'of': 65709,
 'that': 92833,
 'oi': 65883,
 'culture': 22652,
 'it': 48450,
 'still': 88549,
 'very': 99773,
 'bad': 8055,
 'directorial': 26234,
 'debut': 23904,
 'topic': 94496,
 'itself': 48515,
 'interesting': 47615,
 'and': 4849,
 'accept': 2241,
 'acting': 2530,
 'due': 28551,
 'fact': 32888,
 'they': 93155,
 'are': 6095,
 'all': 3996,
 'amateurs': 4404,
 'never': 63904,
 'acted': 2524,
 'before': 9610,
 'but': 14273,
 'worst': 103432,
 'thing': 93220,
 'about': 2055,
 'film': 34319,
 'dialogs': 25703,
 'unexperienced': 97577,
 'naive': 63023,
 'directing': 26218,
 'there': 93068,
 'no': 64459,
 'timing': 93873,
 'at': 6985,
 'in': 46228,
 'felt': 33852,
 'like': 54397,
 'were': 101828,
 'so': 85965,
 'exited': 32298,
 'do': 27171,
 'their': 92944,
 'first': 34652,
 'featur

In [93]:
#Each row in the sparse feature matrix corresponds to each review
#the numeric value in a feature vector represents the frequency of each word words as indexed in the the bag-of-words vocabulary
bag

<49969x105112 sparse matrix of type '<class 'numpy.int64'>'
	with 6791470 stored elements in Compressed Sparse Row format>

 - Assessing word relevancy using TfidfTransformer or TfidfVectorizers
 
 downweight frequently occurring words in the feature vectors that don't contain useful or discriminatory information

In [97]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
tfidf.fit_transform(bag)

<49969x105112 sparse matrix of type '<class 'numpy.float64'>'
	with 6791470 stored elements in Compressed Sparse Row format>

In [100]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf2 = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)


### Logistic Regression for text classification

Train a logistic regression model to classify the movie reviews into positive and negative reviews


-  50/50 Train/Test splits:

In [121]:
X = df["review"]
y = df['sentiment']

from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(24984,) (24984,)
(24985,) (24985,)


In [122]:
from sklearn.model_selection import KFold
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

 - Pipeline

In [130]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

logreg = Pipeline([('vect', tfidf2), ('clf', LogisticRegression(random_state=1))])


In [136]:
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

### Peformance metrics

In [137]:
# Classification accuracy on the test dataset
print('Test Accuracy: %.3f' % logreg.score(X_test, y_test))

Test Accuracy: 0.892


In [151]:
# Recognition rate
print(logreg.score(X_train, y_train), logreg.score(X_test, y_test))

0.9323567082933077 0.8916549929957974


In [133]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score

In [139]:
print(confusion_matrix(y_test, y_pred))

[[10942  1463]
 [ 1244 11336]]


In [148]:
#Accuracy scores on the training set
print(round(accuracy_score(y_test, y_pred),2)*100)

89.0


In [150]:
#CV Accuracy 
#cross-validation accuracy scores on the training set
CV = (cross_val_score(logreg, X_train, y_train, cv=5, n_jobs=-1, scoring = 'accuracy').mean())
CV

0.8861266684415959

### We built a machine learning model that performs well 

it can predict whether a movie review is positive or negative with a 90 % accuracy.