# Proposal

* <h6>What is the problem you are attempting to solve?</h6>
    * Movie reviews are subjective attitudes, emotions and opinions of people and a sentiment extracts information from these reviews. This sentiment analysis will neatly categorize people's sentiment as positive, negative or neutral and provide a topic summary. 


* <h6>How is your solution valuable?</h6>
    * Movie rating doesn't capture the nuances of the opinions and no one wants to sit and read thousands of reviews. These insights can be used to business decisions such as direct marketing, recommendation system, future project directions.
    

* <h6>What is your data source and how will you access it?</h6>
	* IMDb movie reviews, using scrapy web scraper.
    

* <h6>What techniques from the course do you anticipate using?</h6>
	* Web scraping (scrapy): to get the data
	* Natural language processing (spaCy): to process the data and to create features 
	* word2vec (Continuous Bag of Words): converting words to vectors
	* Neural Network: to analyze the data


* <h6>What do you expect to be the biggest challenge you’ll face?</h6>
	* Web scraping is new to me and it seems you need some understanding of web development.
	* Feature creation: sentence negation, sarcasm, terseness, language ambiguity, slang, misspellings, and many others make this task very challenging.
	* Neural Network optimization: adjusting hyperparameters
    * Evaluating results: first stragey is to compare it to existing ratings.



Problem Description: 
* What is the problem that you will be investigating? Why is it interesting?
* Data: What data will you use? If you are collecting new datasets, how do you plan to collect them?
* Methodology/Algorithm: What method or algorithm are you proposing? If there are existing implementations, will you use them and how? How do you plan to improve or modify such implementations?
* Related Work: What reading will you examine to provide context and background?
* Evaluation Plan: How will you evaluate your results? Qualitatively, what kind of results do you expect (e.g. plots or figures)? Quantitatively, what kind of analysis will you use to evaluate and/or compare your results (e.g. what performance metrics or statistical tests)?

* What is the problem that you will be investigating? Why is it interesting?
<p style="font-family:Helvetica Neue;font-size:20px"> Movie reviews are subjective attitudes, emotions and opinions of people and a sentiment analysis extracts these information from the reviews. This sentiment analysis will neatly categorize people's sentiment as positive, negative or neutral and provide a topic summary.</p>

In [1]:
import pandas as pd
import numpy as np

# gensim modules
import gensim
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Word2Vec, Doc2Vec

#string maluplation
import re

# random
from random import shuffle

# Beautiful soup is the best way to remove html tags form paragraphs.
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
import spacy
nlp = spacy.load('en')

#plot liberiry
import matplotlib.pyplot as plt
%matplotlib inline

#classifiers
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

#Evaluater
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix



# Importing the data

<p style="font-family:Helvetica Neue;font-size:20px">This dataset contains 50,000 reviews split evenly into a 25k train and 25k test sets, and each review is in individual files. The overall distribution of labels is balanced (25k pos and 25k neg), and those were divided into two folders named pos and neg. This data was collected by  Christopher Potts in 2011 and I used all of it for training by model and collected my own data from IMBd, using scraper called scrapy, of a new movie for testing purposes.</p>

In [2]:
data_df = pd.read_csv("imdb_master.csv", encoding='ISO-8859-1')
print(data_df.head())
data = data_df['review']

   Unnamed: 0  type                                             review label  \
0           0  test  Once again Mr. Costner has dragged out a movie...   neg   
1           1  test  This is an example of why the majority of acti...   neg   
2           2  test  First of all I hate those moronic rappers, who...   neg   
3           3  test  Not even the Beatles could write songs everyon...   neg   
4           4  test  Brass pictures (movies is not a fitting word f...   neg   

          file  
0      0_2.txt  
1  10000_4.txt  
2  10001_1.txt  
3  10002_3.txt  
4  10003_3.txt  


# Cleaning the review text

<p style="font-family:Helvetica Neue;font-size:20px">Put all the data in a pandas data frame from the individual files and assign sentiment based on the folder the files are in.  Clean the data using Regular expression operations (re), beautiful soup, NLTK and python functions. Since we need to capture the contextual meaning of words and documents, when cleaning the data I need to be careful not to remove too much so as to lose the contextual meaning.</p>

In [6]:
# Function to convert a raw review to a string of words
# The input is a single string (a raw movie review), and 
# the output is a single string (a preprocessed movie review)
    
def review_to_words( raw_review ):
    #Beautiful soup is the best way to remove HTML tags form paragraphs.
    review_text = BeautifulSoup(raw_review, "lxml").get_text() 
    
    #Remove everything but the char after ^   
    letters_only = re.sub("[^a-zA-Z],.", " ", review_text) 
    
    #Convert to lower case, split into individual words
    words = letters_only.lower().split()
    
    #Join the words back into one string separated by space, and return the result.
    return(words)



#In Python, searching a set is much faster than searching a list, so convert the stop words to a set
#stops = set(stopwords.words("english"))                  
# Remove stop words (this may not be helpful for this project)
#meaningful_words = [w for w in words if not w in stops] 

# NLP processed each review
<p style="font-family:Helvetica Neue;font-size:20px">NLP tokenenizes tex, meaning it segments the text into individual words and annotations, and returns iterable processed Doc with all the information of the original text.</p>

<p style="font-family:Helvetica Neue;font-size:20px">Using this to creat list of review list</p>

In [9]:
#Cleaning the text data, apply review_to_words() function created above to raw review data
clean_review = []
for review in data:
    clean_review.append(review_to_words(review))
    


In [None]:
#parsing the review so to make each review a list of word to pass into Word2Vec
clean_review_vocab = []
for review in clean_review:
    review = [token.lemma_ for token in review]
    clean_review_vocab.append(review)

In [68]:
#list of words in the model
vect_df = pd.DataFrame()
vect_df['file'] = data_df['file']
vect_df['type'] = data_df['type']
vect_df['review'] = clean_review
vect_df['label'] = data_df['label']
sent_value = {'label': {'neg': 0, 'unsup': 2, 'pos': 1}}
vect_df.replace(sent_value, inplace=True)
vect_df.to_csv( "NLPprossReview.csv", index=False)
vect_df.head()

Unnamed: 0,file,type,review,label
0,0_2.txt,test,"[once, again, mr., costner, has, dragged, out,...",0
1,10000_4.txt,test,"[this, is, an, example, of, why, the, majority...",0
2,10001_1.txt,test,"[first, of, all, i, hate, those, moronic, rapp...",0
3,10002_3.txt,test,"[not, even, the, beatles, could, write, songs,...",0
4,10003_3.txt,test,"[brass, pictures, (movies, is, not, a, fitting...",0


In [107]:
df = pd.read_csv("NLPprossReview.csv")

# Word2Vec

* Methodology/Algorithm: What method or algorithm are you proposing? If there are existing implementations, will you use them and how? How do you plan to improve or modify such implementations?

<p style="font-family:Helvetica Neue;font-size:20px">The meat of this project is a shallow neural network called word2vec. Word2vec is a two-layer neural network that takes in text and outputs their vectors, so it is a vectorizer. The key difference between word2vec and the other vectorizer (e.g. tfidf, frequency, one-hot-encoder) is the word embading, which says somthing about the relation between the words probabilistically. Other vectorizers lose the ordering of the words and they also ignore semantics of the words because words' vectors are equidistance from each other. Word2Vec expects sentences and each sentnce is a list of words.</p>

In [13]:

model = Word2Vec(
    clean_review,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

#can save the model for continued training later or load and use later
model.save("word2vec.model")

In [14]:
#Examin the build model
vocab = model.wv.vocab.keys()

In [15]:
#load the saved model
from gensim.models import KeyedVectors
model = KeyedVectors.load("word2vec.model")

In [17]:
print(model.wv.most_similar('great'))

[('wonderful', 0.7530227899551392), ('terrific', 0.7318073511123657), ('fantastic', 0.7208240628242493), ('good', 0.6874746084213257), ('fine', 0.6351935863494873), ('brilliant', 0.6263442039489746), ('great,', 0.6158812642097473), ('superb', 0.6036423444747925), ('excellent', 0.582080602645874), ('top-notch', 0.5528312921524048)]


In [18]:
print(model.wv.doesnt_match("good bad awful terrible".split()))
print(model.wv.doesnt_match("awesome bad awful terrible".split()))
print(model.wv.doesnt_match("nice pleasant fine excellent".split()))
# Classic test
model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

good
awesome
pleasant


  """


[('princess', 0.4177146553993225)]

# Graph the result of Word2Vec

<p style="font-family:Helvetica Neue;font-size:20px">How good is this vector representation?
    <br> show this using graphs
    <br> Hey Zack I am having trouble graphing this so I will do it later
    <br> moving on to doc2vec</p>

# Build class to take the average of each review vector to represent that review

<p style="font-family:Helvetica Neue;font-size:20px">Now that we have vector representation of the words in our corpus, we need to use it to creat vector representation of the documents to pass it into the classifier neurol network. One way to do that is to take the average of the word vectors in the documents and use the averaged word vector to reprsent the documents.</p>

In [19]:
num_features = 300

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given
    # paragraph
    #
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    #
    nwords = 0
    # 
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.wv.index2word)
    #
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1
            featureVec = np.add(featureVec,model.wv[word])
    # 
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec


def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    # 
    # Initialize a counter
    counter = 0
    # 
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    # 
    # Loop through the reviews
    for review in reviews:
       #
       # Print a status message every 1000th review
       if counter%1000 == 0:
           print ("Review %d of %d" % (counter, len(reviews)))
       # 
       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
       #
       # Increment the counter
       counter = counter + 1
    return reviewFeatureVecs

In [75]:
#Using the new dataframe segment out the train and test 
train_data = vect_df[vect_df.type == 'train']
train_data = train_data[train_data.label != 2]
test_data = vect_df[vect_df.type == 'test']
test_data = test_data[test_data.label != 2]
train_data.shape

(25000, 4)

# Average the vector for each review document and used the averaged vector for classification

In [77]:
#creat training and testing datasets

trainDataVecs = getAvgFeatureVecs(train_data['review'], model, num_features)
train_sent = train_data['label']

testDataVecs = getAvgFeatureVecs(test_data['review'], model, num_features)
test_sent = test_data['label']

Review 0 of 25000
Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 of 25000
Review 22000 of 25000
Review 23000 of 25000
Review 24000 of 25000
Review 0 of 25000
Review 1000 of 25000
Review 2000 of 25000
Review 3000 of 25000
Review 4000 of 25000
Review 5000 of 25000
Review 6000 of 25000
Review 7000 of 25000
Review 8000 of 25000
Review 9000 of 25000
Review 10000 of 25000
Review 11000 of 25000
Review 12000 of 25000
Review 13000 of 25000
Review 14000 of 25000
Review 15000 of 25000
Review 16000 of 25000
Review 17000 of 25000
Review 18000 of 25000
Review 19000 of 25000
Review 20000 of 25000
Review 21000 o

# MLP

In [78]:
# Establish and fit the model, with a triple, 1000 perceptron layer.
mlp_wv = MLPClassifier(hidden_layer_sizes=(1000,3))
mlp_wv.fit(trainDataVecs, train_sent)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000, 3), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

# Evaluate the classifer model

In [80]:
cross_val_score(mlp_wv, test_arrays, test_labels, cv=5)

0.84936


In [81]:
y_mlp_pred = mlp_wv.predict(testDataVecs)
pd.crosstab(test_sent, y_mlp_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,0,1,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,10407,2093,12500
1,1673,10827,12500
All,12080,12920,25000


# logistic regression

In [82]:
classifier_wv = LogisticRegression()
classifier_wv.fit(trainDataVecs, train_sent)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [83]:
cross_val_score(classifier_wv, test_arrays, test_labels, cv=5)

0.86728


In [85]:
y_logis_pred = classifier.predict(testDataVecs)
3confusion_mat = confusion_matrix(test_sent, y_logis_pred)
pd.crosstab(test_sent, y_logis_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,0,1,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,10838,1662,12500
1,1656,10844,12500
All,12494,12506,25000


In [117]:
confusion_mat = confusion_matrix(test_sent, y_logis_pred)

In [106]:
D2V_mlp_result = pd.DataFrame({'Actuallabel': test_sent,
                             'Predictedlabel': y_logis_pred,
                              'Review': test_data['review']})
testDataVecs_df = pd.DataFrame(testDataVecs)
D2V_mlp_result = pd.concat([D2V_mlp_result, testDataVecs_df],1)
D2V_mlp_result.to_csv('D2V_mlp_result', index=False)

# Doc2Vec
* How does Doc2Vec work?
Doc2Vec does essentially the same thing we did above, vectorizes the word and "aggregates" the words to represent the document.
* The differece is the way it aggregates the vectors, and how does Doc2Vec does that?
Doc2Vec treats all the documents as "labeled" word and does "mathameaticl trick" to represent that word as a vector. The label here will be the name of the file each review contains in.
* format
Doc2Vec like word to Word2Vec, expects sentences and each sentnce as a list of words, and each sentence is labled. To acomplish this we will be using LabeledSentence class (see code below)
* What is this mathamatical trick?
labeled documents goes into a neural network and creates word embedding just like the Word2Vec with word window for context and document embedding

In [86]:
class LabeledLineSentence(object):
    def __init__(self, doc_list, labels_list):
       self.labels_list = labels_list
       self.doc_list = doc_list
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield TaggedDocument(words=doc,tags=[self.labels_list[idx]])

In [87]:
tagged_data = LabeledLineSentence(clean_review, data_df['file'])

<p style="font-family:Helvetica Neue;font-size:20px">another method is to use doc2vec module
<br> build the model</p>

In [88]:
model2 = Doc2Vec(
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=1,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    #sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    vector_size=300,      # Word vector length.
    hs=1,           # Use hierarchical softmax.
    alpha = 0.0002, # decrease the learing rate
    min_alpha = 0.0002 # fix the learing rate, no dacay
    
)

model2.build_vocab(tagged_data)

<p style="font-family:Helvetica Neue;font-size:20px">Here we are training the model by controling the learning rate and iterating over the documents multiple times for better result[1]
<br>
<br>[1] https://rare-technologies.com/doc2vec-tutorial/ </p>

In [89]:
#iterating multiple times trians gives better result as claimed by rare-technologies
for epoch in range(10):
    print('iteration {0}'.format(epoch))
    model2.train(tagged_data, total_examples=model2.corpus_count, epochs=model2.epochs)

iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9


<p style="font-family:Helvetica Neue;font-size:20px"> not a great result, can be impoved with feature engenering, delet less of text, add more data, and add parts of speach.
<br>
<br>
traing and test data split with vectorization of the documents<p>

In [90]:
train_arrays = []
trian_labels = np.array(train_data['label'])
test_arrays = []
test_labels = np.array(test_data['label'])

for x in train_data['file']:
    train_arrays.append(model2[x])

for x in test_data['file']:
    test_arrays.append(model2[x])

In [91]:
# Import the model.
from sklearn.neural_network import MLPClassifier

# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,3))
mlp.fit(train_arrays, trian_labels)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000, 3), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [92]:
print(mlp.score(test_arrays, test_labels))

cross_val_score(mlp, test_arrays, test_labels, cv=5)

0.5


array([0.6402, 0.6294, 0.6256, 0.6088, 0.6048])

In [93]:
classifier = LogisticRegression()
classifier.fit(train_arrays, trian_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [94]:
y_pred = classifier.predict(test_arrays)

In [95]:

confusion_matrix(test_labels, y_pred)

array([[7917, 4583],
       [4623, 7877]], dtype=int64)

In [96]:
classifier.score(test_arrays, test_labels)

0.63176

In [97]:
cross_val_score(classifier, test_arrays, test_labels, cv=5)

array([0.6986, 0.6436, 0.6196, 0.5884, 0.5866])

### References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659.