# Proposal

* <h6>What is the problem you are attempting to solve?</h6>
    * Movie reviews are subjective attitudes, emotions and opinions of people and a sentiment extracts information from these reviews. This sentiment analysis will neatly categorize people's sentiment as positive, negative or neutral and provide a topic summary. 


* <h6>How is your solution valuable?</h6>
    * Movie rating doesn't capture the nuances of the opinions and no one wants to sit and read thousands of reviews. These insights can be used to business decisions such as direct marketing, recommendation system, future project directions.
    

* <h6>What is your data source and how will you access it?</h6>
	* IMDb movie reviews, using scrapy web scraper.
    

* <h6>What techniques from the course do you anticipate using?</h6>
	* Web scraping (scrapy): to get the data
	* Natural language processing (spaCy): to process the data and to create features 
	* word2vec (Continuous Bag of Words): converting words to vectors
	* Neural Network: to analyze the data


* <h6>What do you expect to be the biggest challenge you’ll face?</h6>
	* Web scraping is new to me and it seems you need some understanding of web development.
	* Feature creation: sentence negation, sarcasm, terseness, language ambiguity, slang, misspellings, and many others make this task very challenging.
	* Neural Network optimization: adjusting hyperparameters
    * Evaluating results: first stragey is to compare it to existing ratings.



Problem Description: 
* What is the problem that you will be investigating? Why is it interesting?
* Data: What data will you use? If you are collecting new datasets, how do you plan to collect them?
* Methodology/Algorithm: What method or algorithm are you proposing? If there are existing implementations, will you use them and how? How do you plan to improve or modify such implementations?
* Related Work: What reading will you examine to provide context and background?
* Evaluation Plan: How will you evaluate your results? Qualitatively, what kind of results do you expect (e.g. plots or figures)? Quantitatively, what kind of analysis will you use to evaluate and/or compare your results (e.g. what performance metrics or statistical tests)?

* What is the problem that you will be investigating? Why is it interesting?
<p style="font-family:Helvetica Neue;font-size:20px"> Movie reviews are subjective attitudes, emotions and opinions of people and a sentiment extracts information from these reviews. This sentiment analysis will neatly categorize people's sentiment as positive, negative or neutral and provide a topic summary.</p>

In [17]:
import pandas as pd
import numpy as np

# gensim modules
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Word2Vec
from gensim.models import Doc2Vec

#string maluplation
import re

# numpy
import numpy as np

# random
from random import shuffle

# classifier
import sklearn
from sklearn.linear_model import LogisticRegression

# Beautiful soup is the best way to remove html tags form paragraphs.
from bs4 import BeautifulSoup
from nltk.corpus import stopwords

<p style="font-family:Helvetica Neue;font-size:20px">This dataset contains 50,000 reviews split evenly into a 25k train and 25k test sets, and each review is in individual files. The overall distribution of labels is balanced (25k pos and 25k neg), and those were divided into two folders named pos and neg. This data was collected by  Christopher Potts in 2011 and I used all of it for training by model and collected my own data from IMBd, using scraper called scrapy, of a new movie for testing purposes.</p>

In [141]:
from os import listdir
from os.path import isfile, join

docLabels = []
docLabels1 = [f for f in listdir('aclImdb\\train\\neg') if f.endswith('.txt')]
docLabels2 = [f for f in listdir('aclImdb\\train\\pos') if f.endswith('.txt')]
docLabels3 = [f for f in listdir('aclImdb\\test\\neg') if f.endswith('.txt')]
docLabels4 = [f for f in listdir('aclImdb\\test\\pos') if f.endswith('.txt')]
docLabels = docLabels1 + docLabels2 + docLabels3 + docLabels4

In [145]:
data = []
for doc in docLabels:
    if isfile(join('aclImdb\\train\\neg\\', doc)):
        open_file = open('aclImdb\\train\\neg\\' + doc, 'r', encoding="utf8")
        open_file = open_file.read()
        data.append(open_file)
    elif isfile(join('aclImdb\\train\\pos\\', doc)):
        open_file = open('aclImdb\\train\\pos\\' + doc, 'r', encoding="utf8")
        open_file = open_file.read()
        data.append(open_file)
    elif isfile(join('aclImdb\\test\\neg\\', doc)):
        open_file = open('aclImdb\\test\\neg\\' + doc, 'r', encoding="utf8")
        open_file = open_file.read()
        data.append(open_file)
    else:
        open_file = open('aclImdb\\test\\pos\\' + doc, 'r', encoding="utf8")
        open_file = open_file.read()
        data.append(open_file)

<p style="font-family:Helvetica Neue;font-size:20px">Put all the data in a pandas data frame from the individual files and assign sentiment based on the folder the files are in.  Clean the data using Regular expression operations (re), beautiful soup, NLTK and python functions. Since we need to capture the contextual meaning of words and documents, when cleaning the data I need to be careful not to remove too much so as to lose the contextual meaning.</p>

In [146]:
# Function to convert a raw review to a string of words
# The input is a single string (a raw movie review), and 
# the output is a single string (a preprocessed movie review)
    
def review_to_words( raw_review ):
    #Beautiful soup is the best way to remove HTML tags form paragraphs.
    review_text = BeautifulSoup(raw_review, "lxml").get_text() 
    #
    #Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    #Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    #In Python, searching a set is much faster than searching a list, so convert the stop words to a set
    #stops = set(stopwords.words("english"))                  
    # Remove stop words (this may not be helpful for this project)
    #meaningful_words = [w for w in words if not w in stops]   
    #
    #Join the words back into one string separated by space, and return the result.
    return( " ".join( words )) 

In [147]:
#apply review_to_words() function created above to raw review data
clean_review = []
for x in data:
    clean_review.append(review_to_words(x))

In [148]:
class LabeledLineSentence(object):
    def __init__(self, doc_list, labels_list):
       self.labels_list = labels_list
       self.doc_list = doc_list
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield TaggedDocument(words=doc.split(),tags=[self.labels_list[idx]])

In [149]:
tagged_data = LabeledLineSentence(clean_review, docLabels)

* Methodology/Algorithm: What method or algorithm are you proposing? If there are existing implementations, will you use them and how? How do you plan to improve or modify such implementations?

<p style="font-family:Helvetica Neue;font-size:20px">The meat of this project is a shallow neural network called word2vec. Word2vec is a two-layer neural network that takes in text and outputs their vectors, so it is a vectorizer. The key difference between word2vec and the other vectorizer (e.g. tfidf, frequency, one-hot-encoder) is the word embading, which says somthing about the relation between the words probabilistically. Other vectorizers lose the ordering of the words and they also ignore semantics of the words because words' vectors are equidistance from each other.</p>

In [None]:
model = word2vec.Word2Vec(
    list_of_reviews,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=10,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

#can save the model for continued training later or load and use later
model.save("word2vec.model")

In [8]:
model = Word2Vec.load("word2vec.model")

In [12]:
#Examin the build model
vocab = model.wv.vocab.keys()
vect_df = pd.DataFrame(model.wv.vectors)
vect_df['vocab_key'] = vocab
vect_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,291,292,293,294,295,296,297,298,299,vocab_key
0,0.674226,0.413999,-0.191157,0.412055,0.198682,0.122666,-0.745847,0.1224,-1.134042,0.178597,...,0.023496,-0.020488,-0.286641,-0.515498,0.454745,0.88541,0.347443,-0.049568,-0.011023,
1,0.328533,0.435401,0.239809,-0.855035,-0.299644,-0.536809,0.291407,-0.284222,-0.722225,1.159497,...,-0.344035,-0.564331,0.495891,0.631248,-0.59441,-0.922665,-0.512988,0.841829,0.158342,this
2,0.068361,0.402387,-0.402625,1.044695,0.42352,-0.200199,-0.787988,0.129094,-0.227085,0.336626,...,-0.020124,0.580071,-0.067423,-0.571375,0.442019,1.007197,0.748192,-0.028187,0.417967,possibly
3,-0.814146,-1.120861,-0.676685,1.929251,-0.184621,0.664005,-0.402437,-0.099938,-0.725896,0.273354,...,0.09068,0.344576,0.535061,-0.872072,0.987788,-0.049637,0.225583,0.685698,0.627497,bad
4,-0.393544,-0.488274,-0.941881,0.055469,0.310603,0.294286,-0.236678,0.35938,-0.201362,0.236543,...,0.318816,-0.167705,-0.071356,0.031614,0.319891,0.108595,-0.398131,0.372208,0.084882,-pron-


<p style="font-family:Helvetica Neue;font-size:20px">How good is this vector representation?
    <br> show this using graphs
    <br> Hey Zack I am having trouble graphing this so I will do it later
    <br> moving on to doc2vec</p>

In [None]:
from sklearn.manifold import TSNE

tsne = sklearn.manifold.TSNE(n_components = 0 , random_state = 0)
all_vector_matrix = vect_df.drop('vocab_key',1)
all_vector_matrix_2d = tsne.fit_transform(all_vector_matrix)

<p style="font-family:Helvetica Neue;font-size:20px">Now that we have vector representation of the words in our corpus, we need to use it to creat vector representation of the documents to pass it into the classifier neurol network. There are couple of previous methods for acomplish this, one of which I will implement here term-frequency-inverse-document-frequency (tfidf)</p>

<p style="font-family:Helvetica Neue;font-size:20px">another method is to use doc2vec module
<br> build the model</p>

In [150]:
model = Doc2Vec(
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=1,  # Minimum word count threshold.
    window=6,      # Number of words around target word to consider.
    #sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    vector_size=300,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

model.build_vocab(tagged_data)

<p style="font-family:Helvetica Neue;font-size:20px">Here we are training the model by controling the learning rate and iterating over the documents multiple times for better result[1]
<br>
<br>[1] https://rare-technologies.com/doc2vec-tutorial/ </p>

In [151]:
for epoch in range(20):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19


<p style="font-family:Helvetica Neue;font-size:20px">doc2vec is build on top of word2vec so we get word vectors and can be tested.<p>

In [251]:
model.most_similar('future')

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('entirety', 0.25249195098876953),
 ('subsequent', 0.23595638573169708),
 ('dwivedi', 0.22563199698925018),
 ('esp', 0.2254733145236969),
 ('cheta', 0.22267746925354004),
 ('riefenstahl', 0.22252562642097473),
 ('anchorwoman', 0.22070996463298798),
 ('avantegardistic', 0.21919968724250793),
 ('davinci', 0.21898135542869568),
 ('halla', 0.2186593860387802)]

<p style="font-family:Helvetica Neue;font-size:20px"> not a great result, can be impoved with feature engenering, delet less of text, add more data, and add parts of speach.
<br>
<br>
traing and test data split with vectorization of the documents<p>

In [245]:
trian_arrays = numpy.zeros((25000, 300))
trian_labels = numpy.zeros(25000)

for i, file_name in enumerate(docLabels1 + docLabels2):
    if i < (12500):
        trian_arrays[i] = model[file_name]
        trian_labels[i] = 0
    elif i >= (12500):
        trian_arrays[i] = model[file_name]
        trian_labels[i] = 1


In [247]:
test_arrays = numpy.zeros((25000, 300))
test_labels = numpy.zeros(25000)

for i, file_name in enumerate(docLabels3 + docLabels4):
    if i < (12500):
        test_arrays[i] = model[file_name]
        test_labels[i] = 0
    elif i >= (12500):
        test_arrays[i] = model[file_name]
        test_labels[i] = 1

In [252]:
# Alright! We've done our prep, let's build the model.
# Neural networks are hugely computationally intensive.
# This may take several minutes to run.

# Import the model.
from sklearn.neural_network import MLPClassifier

# Establish and fit the model, with a single, 1000 perceptron layer.
mlp = MLPClassifier(hidden_layer_sizes=(1000,3))
mlp.fit(train_arrays, train_labels)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1000, 3), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [253]:
mlp.score(test_arrays, test_labels)

0.5

In [183]:
from sklearn.model_selection import cross_val_score
cross_val_score(mlp, test_arrays, test_labels, cv=5)

array([0.75444911, 0.7094    , 0.685     , 0.657     , 0.65193039])

In [249]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression()
classifier.fit(train_arrays, trian_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [254]:
y_pred = classifier.predict(test_arrays)

In [256]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_labels, y_pred)

array([[12500,     0],
       [12500,     0]], dtype=int64)

In [250]:
classifier.score(test_arrays, test_labels)

0.5

In [241]:
cross_val_score(classifier, test_arrays, test_labels, cv=5)

array([0.793 , 0.7178, 0.7014, 0.6696, 0.6606])

### References

Potts, Christopher. 2011. On the negativity of negation. In Nan Li and
David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20,
636-659.