## IMDB Dataset Classification 

#### credits and cites to: 
Andrew Maas (http://ai.stanford.edu/~amaas/data/sentiment/)

Aaron Kub via Medium (https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184)

This program using vectorizer and parameters which results the best as showned in his Medium page above.

In [1]:
import re

In [2]:
#Load data. Source: Aaron Kub
train = []
for line in open('../aclImdb/movie_data/full_train.txt', 'r'):
    train.append(line.strip())
    
test = []
for line in open('../aclImdb/movie_data/full_test.txt', 'r'):
    test.append(line.strip())
    
print("Train samples: ", train[0])
print("Test samples: ", test[0])

Train samples:  Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!
Test samples:  I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton 

In [3]:
#Preprocess data. Source: Aaron Kub
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess(data):
    data = [REPLACE_NO_SPACE.sub("", line.lower()) for line in data]
    data = [REPLACE_WITH_SPACE.sub(" ", line) for line in data]
    
    return data

train = preprocess(train)
test = preprocess(test)

print("Train samples: ", train[0])
print("Test samples: ", test[0])

Train samples:  bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt
Test samples:  i went and saw this movie last night after being coaxed to by a few friends of mine ill admit that i was reluctant to see it because from what i knew of ashton kutcher he was only able to do comedy i was wr

In [4]:
#Process data with english stopwords. Source: Aaron Kub
from nltk.corpus import stopwords

english_stop_words = stopwords.words('english')
def stopwords_removal(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

no_stop_words = stopwords_removal(train)

In [5]:
#Stem data. Source: Aaron Kub
def stem_text(corpus):
    from nltk.stem.porter import PorterStemmer
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

stemmed_reviews = stem_text(train)

In [6]:
#Lemmatize data. Source: Aaron Kub
def lemmatize_text(corpus):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

lemmatized_reviews = lemmatize_text(train)

In [7]:
#Vectorize data with N-Gram and parameter tune with SVM. Source: Aaron Kub
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

target = [1 if i < 12500 else 0 for i in range(25000)]

stop_words = ['in', 'of', 'at', 'a', 'the']
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words=stop_words)
ngram_vectorizer.fit(train)
X = ngram_vectorizer.transform(train)
X_test = ngram_vectorizer.transform(test)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)

for c in [0.001, 0.005, 0.01, 0.05, 0.1]:
    
    svm = LinearSVC(C=c)
    svm.fit(X_train, y_train)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(y_val, svm.predict(X_val))))
    
# Accuracy for C=0.001: 0.88784
# Accuracy for C=0.005: 0.89456
# Accuracy for C=0.01: 0.89376
# Accuracy for C=0.05: 0.89264
# Accuracy for C=0.1: 0.8928
    
final = LinearSVC(C=0.01)
final.fit(X, target)
print ("Final Accuracy: %s" 
       % accuracy_score(target, final.predict(X_test)))

# Final Accuracy: 0.90064

Accuracy for C=0.001: 0.88976
Accuracy for C=0.005: 0.89408
Accuracy for C=0.01: 0.89504
Accuracy for C=0.05: 0.896
Accuracy for C=0.1: 0.896
Final Accuracy: 0.90024


In [8]:
#Test the opinion
try_test = ngram_vectorizer.transform(["fuckin mad"])                            
final.predict(try_test)

array([1])

In [10]:
#Import as model file
import pickle
filename = 'model.ml'
outfile = open(filename, 'wb')
pickle.dump(final, outfile)
outfile.close()