# Predicting sentiment from product reviews

### Introduction:
The goal is to use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative [1]. The code includes:
<ol>
    <li><p>data cleaning</p></li>
    <li><p>feature engineering</p></li>
    <li><p>Train a logistic regression model to predict the sentiment of product reviews</p></li>
    <ul>
        <li>inspect the weights of a trained logistic regression model</li>
        <li>make a prediction (for class and probability) of sentiment for a new product review</li>
        <li>compute the accuracy of the model</li> 
    </ul>
    <li>compare different logistic regression models</li>
</ol>

(1)This is one of the assignments from the Coursera class: [Machine Learning: Classification](https://www.coursera.org/learn/ml-classification/home/welcome).

<ul>
    <li><p>Import the required modules</p></li>
    <li><p>create 2 functions:</p></li>
    <ul>
        <li>remove_ponctuations(): replace all ponctuations in the review by None.</li>
        <li>json_2_np(): the list of samples (in fact list of indexes) to use as training examples or as test examples were given in a json file. Note that there is apparently a buil-in function to do that (will update that soon)</li>
    <ul>
</ul>

In [54]:
import pandas as pd
import numpy as np
import string
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from math import exp

def remove_punctuation(text):
    return text.translate(None, string.punctuation)                      

def json_to_np(fileName):
    with open(fileName) as f:
        content = f.readlines()
    content = ''.join(content)
    content = content.translate(None,'[,]')
    return np.fromstring(content, dtype=int, sep=' ')

<ol>
    <li>Load amazon dataset (csv file) in a pandas DataFrame 'products' 
    The file has 3 columns: baby product name(type=text)/reviews(type=text)/rating (type=int(1-5))
    dataFile = r'E:\Data...' since the slashes are special characters, prefixing the string with a 'r' will prevent escape the whole string. 
    </li>
    
    <li><p>Perform text cleaning:</p></li>
    <ul>
        <li>use remove_ponctuations() on all reviews</li>
        <li>fill n/a values in the review column with empty strings (if applicable). The n/a values indicate empty reviews. Use pandas fillna()</li>
        <li>save all the clean reviews as another column in products</li>
    </ul>
</ol>


In [55]:
dataFile = r'E:\DataScientist\myNotebook\ML_classification (Uni.Washington)\amazon_baby.csv'
TrainFile = r'E:\DataScientist\myNotebook\ML_classification (Uni.Washington)\module-2-assignment-train-idx.json'
TestFile = r'E:\DataScientist\myNotebook\ML_classification (Uni.Washington)\module-2-assignment-test-idx.json'
products = pd.read_csv(dataFile, header=0) #[shape=(183531,3)].
products.fillna({'review': ''}, inplace=True)
#create a new col 'review_clean' = copy of col review but without 
#all punctuations in text: note that it removes also punctuation of I'd, would've, hadn't
products['review_clean'] = products['review'].apply(remove_punctuation)

###### EXTRACT SENTIMENT ########
products = #[166752,4]
#ignore all reviews with rating=3, bcz they tend to have neutral sentiment.
#products['sentiment'][(products.sentiment > 3)] = 1 thsi approach can generate some error warning
#change rating to \in {-1,1} : for rating<3 new_rating=-1, rating>3 new_rating=1
#use this token pattern  to keep single-letter words 
#First, learn vocabulary from the training data and assign columns of words
#Then, convert the training data into a sparse matrix train_matrix
#4.2.3.7. Limitations of the Bag of Words representation
#A collection of unigrams (what bag of words is) cannot capture phrases and multi-word expressions, effectively disregarding any word order dependence. Additionally, the bag of words model doesn’t account for potential misspellings or word derivations.
#N-grams to the rescue! Instead of building a simple collection of unigrams (n=1), one might prefer a collection of bigrams (n=2), where occurrences of pairs of consecutive words are counted.
#One might alternatively consider a collection of character n-grams, a representation resilient against misspellings and derivations.
#For example, let’s say we’re dealing with a corpus of two documents: ['words', 'wprds']. The second document contains a misspelling of the word ‘words’ . A simple bag of words representation would consider these two as very distinct documents, differing in both of the two possible features. A character 2-gram representation, however , would find the documents matching in 4 out of 8 features, which may help the preferred classifier decide better:
#The metacharacter \b is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
#There are three different positions that qualify as word boundaries:
#Before the first character in the string, if the first character is a word character.
#After the last character in the string, if the last character is a word character.
#Between two characters in the string, where one is a word character and the other is not a word character.
Buidl the word count vector for each review.
We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as bag-of-word features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, sci-kit learn and many other tools use sparse matrices to store a collection of word vectors.
Learn a vocabulary (set of all words) from the training data. only the words that show up in the training data will be considered for feature extraction. Compute the occurences of the words in each review and collect them into a row vector.
Build a sparse matrix where each row is the word count vector for the corresponding review. 
#You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training.

#The traceback holds the answer here: when you call X.toarray() at the end, it's converting a sparse matrix representation 
#to a dense representation. This means that instead of storing a constant amount of data for each word in each document, you're now 
#storing a value for all words over all documents. Use a small subset of data and then toarray to see the array

In [56]:

products = products[products['rating'] !=3] 
train_dataIdx = json_to_np(TrainFile) # [shape=(133416,)]
test_dataIdx = json_to_np(TestFile) #[shape=(33336,)]

products['sentiment'] = products['rating']
products.loc[products['sentiment'] < 3, 'sentiment'] = -1
products.loc[products['sentiment'] > 3, 'sentiment'] = 1
train_data = products.iloc[train_dataIdx]
test_data = products.iloc[test_dataIdx]

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

print '\n step: ****Generating Train_matrix and Test_Matrix****'
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.transform(test_data['review_clean'])


 step: ****Generating Train_matrix and Test_Matrix****


We also need to flatten y into a 1-D array, so that scikit-learn will properly understand it as the response variable.

We will now use logistic regression to create a sentiment classifier on the training data.
first create an instance of the logistic regression class.
Then call the method fit to train the classifier. this model should use the sparse word count matrix (train matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.
there should be over 100,000 coefficients in this sentiment_model. remembef that positive weights correspond to weights that cause positive sentiments, while negative weights correspond to negative sentiment. Calculate the number of positive (>=0 ) coefficients.

In [57]:
sentiment = np.ravel(train_data['sentiment']) # flatten y into a 1-D array
 
# instantiate a logistic regression model, and fit with X and y
print '\n step: Start Logistic Regression'
LogReg = LogisticRegression()
sentiment_model = LogReg.fit(train_matrix, sentiment)
weight_matrix = sentiment_model.coef_
weight_m, weight_n = weight_matrix.shape
neg_weight_count = (weight_matrix < 0).sum()
pos_weight_count = weight_n - neg_weight_count
print "\n Total Number of weights: ", weight_n ,"| Number of positive weights: ", pos_weight_count



 step: Start Logistic Regression

 Total Number of weights:  121712 | Number of positive weights:  85707


#10. We will now make a class prediction for the sample_test_data. The sentiment_model should predict +1 if the sentiment is positive and -1 if the sentiment is negative. The score (also called margin) is defined by:
\begin{align}
score^{(i)}= w^T x^{(i)}
\end{align}
where x^{(i)} is the features for data point $i$. For each row, the score/margin is in the range (+inf/-inf).  
#Now that a model is trained we can make predictiosns. Use a subset of test_data (take the 11th, 12th and 13th data points in the test data)
 
#11. Predicting sentiment
#These score can be used to make class predictions as follows : y=+1 if \Theta^T x > 0 and y=-1 if Theta^Tx <= 0 

\begin{align}
y^{(i)} = \left\{
    \begin{array}{rl}
        +1 & \mathrm{\ if \ } w^T x^{(i)}>0 \\
        -1 & \mathrm{\ if \ } w^T x^{(i)} \leq 0
    \end{array}
\right.
\end{align}

In [66]:
sample_test_data = test_data.iloc[10:13] 
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
class_predict_testData = np.zeros((3,1))
for index, myscore in enumerate(scores):
    if myscore > 0 : 
        class_predict_testData[index] = 1
    else:
        class_predict_testData[index] = -1

ValueError: X has 121712 features per sample; expecting 20

#Checkpoint: Make sure your class predictions match with the ones obtained from sentiment_model. The logistic regression classifier in scikit-learn comes with the predict function for this purpose:
#print '\n Predicted Label from sentiment_model for sample_test_data 11/12/13: ', model_predict_spleTest    

#12. Probability predictions
\begin{align}
P(y^{(i)}=(+1|x^{(i)},w) = \frac{1}{1+exp(-w^T x^{(i)})}
\end{align}

Find the most positive and negative review


In [59]:
model_predict_spleTest = sentiment_model.predict(sample_test_matrix)

prob_pos_class = np.zeros((3,1))
 
for index, myscore in enumerate(scores):
    prob_pos = 1/(1 + exp(-myscore) ) 
    prob_pos_class[index] = prob_pos
print '\n*** P(y=1|x,w) (probability of sample test being pos. class (+1)): \n ', prob_pos_class
       


*** P(y=1|x,w) (probability of sample test being pos. class (+1)): 
  [[  9.96300679e-01]
 [  4.17571851e-02]
 [  3.01815820e-05]]


In [60]:
#CheckPoint
#print 'predict_proba output', model_proba_spleTest
#you can define a custom function that takes a float value as its input and returns a formatted string:
#The f here means fixed-point format (not 'scientific'), and the .2 means two decimal places (you can read 
#more about string formatting here).
#float_formatter = lambda x: "%.2f" % x
#model_proba = np.set_printoptions(formatter={'float_kind':float_formatter})
#print '\n P(y=1|x,w) (Model based probability of sample test being positive class(-1|+1)): ', model_proba_spleTest
#model_proba_spleTest_arr = model_proba_spleTest_arr.reshape((model_proba_spleTest_NbrRows,1))
#print model_proba_spleTest_arr

In [61]:
model_proba_spleTest = sentiment_model.predict_proba(sample_test_matrix)
model_proba_spleTest_arr = model_proba_spleTest[:,1]
sample_test_data.loc[:,'probaPosRev'] = pd.Series(model_proba_spleTest_arr, index=sample_test_data.index)

#13. Examine full test dataset: test_data
#Sort the dataframe's rows by reports, in ascending order
##print '\n*** Products of Top 20 reviews: ', top20reviews.iloc[0:20]['name']
#14.
##print '\n*** Products of Bottom 20 reviews: ', bottom20reviews.iloc[0:20]['name']

#15 Accuracy classifier on the train_data. Examine full train dataset: train_data:

\begin{align}
accuracy = \frac{\mathrm{\# \  correctly \  classified \ examples}}{\mathrm{\# \ total \ examples}}
\end{align}


In [67]:
model_proba_test = sentiment_model.predict_proba(test_matrix)
model_proba_test_arr = model_proba_test[:,1]
test_data.loc[:,'probaPosRev'] = pd.Series(model_proba_test_arr, index=test_data.index)

top20reviews = test_data.sort_index(by='probaPosRev', ascending=0)

bottom20reviews = test_data.sort_index(by='probaPosRev', ascending=1)

model_sentiment_Train = sentiment_model.predict(train_matrix)
train_data.loc[:,'modelSentiment'] = pd.Series(model_sentiment_Train, index=train_data.index)
train_data.loc[:,'modelSentimentVSMeasured'] = pd.Series(train_data['modelSentiment'] == train_data['sentiment'], \
                                                         index=train_data.index )
correctly_classified_TrainExples = float(len(train_data[train_data['modelSentimentVSMeasured']==True]))
total_TrainExamples = train_data['modelSentimentVSMeasured'].shape[0]
accuracy_sentiment_Train = (float(correctly_classified_TrainExples) / total_TrainExamples)

print '\n***The accuracy of the "Sentiment model" classifier on the training set is:  ', accuracy_sentiment_Train

ValueError: X has 121712 features per sample; expecting 20

#15 Accuracy classifier on the test_data. Examine full test dataset: test_data
#13. Examine full test dataset: test_data
#Find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review.
#model_proba_Test = sentiment_model.predict_proba(test_matrix)
#print model_proba_Test[1]
# check the accuracy on the training set
#sentiment_model.score(train_matrix, sentiment)
 
#loc works on labels in the index.
#iloc works on the positions in the index (so it only takes integers).
#ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
#It's important to note some subtleties that can make ix slightly tricky to use:
 
#if the index is of integer type, ix will only use label-based indexing and not fall back to position-based indexing. If the label is not in the index, an error is raised.
#if the index does not contain only integers, then given an integer, ix will immediately use position-based indexing rather than label-based indexing. If however ix is given another type (e.g. a string), it can use label-based indexing.
#Compute a new set
   
#17-Train logistic regression
#We also need to flatten y into a 1-D array, so that scikit-learn will properly understand it as the response variable.

In [63]:
model_sentiment_Test = sentiment_model.predict(test_matrix)
test_data.loc[:,'modelSentiment'] = pd.Series(model_sentiment_Test, index=test_data.index)
test_data.loc[:,'modelSentimentVSMeasured'] = pd.Series((test_data['modelSentiment'] == test_data['sentiment']), \
                                                        index=test_data.index )
correctly_classified_TestExples = float(len(test_data[test_data['modelSentimentVSMeasured']==True]))
total_TestExamples = float(test_data['modelSentimentVSMeasured'].shape[0])
accuracy_testModelSentiment = (correctly_classified_TestExples/total_TestExamples)
print 'The accuracy of the sentiment_model classifier on the test data is: ', accuracy_testModelSentiment

significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 'well', 'able', 'car', \
                     'broke', 'less', 'even', 'waste', 'disappointed', 'work', 'product', 'money', 'would', 'return']

vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) #limit of 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.fit_transform(test_data['review_clean'])

simple_model = LogReg.fit(train_matrix_word_subset, sentiment)

The accuracy of the sentiment_model classifier on the test data is:  0.932325413967


## Weights
Build a table to store (word, coefficient) pairs. Consider the coefficients of the Simple Model, how many of the 20 coefficients (corresponding to the 20 significant words) are positive for the simple model. (Make sure to exclude the intercept term)

## Comparing models
   #accuracy of the sentiment model on the train_data


On both, the training set and the test set, we compare the accuracy of the sentiment model vs that of the simple_Model 


In [64]:
simple_model_coef_table = pd.DataFrame({'word':significant_words, 'coefficient': simple_model.coef_.flatten()})
smallestWeights = simple_model_coef_table.sort_index(by='coefficient', ascending=0)
   
weight_m, weight_n = simple_model.coef_.shape
neg_weight_count = (simple_model.coef_ < 0).sum()
pos_weight_count = weight_n - neg_weight_count
print "\n***Total Number of weights of the Simple Model that are positive: ", pos_weight_count

model_simple_Train = simple_model.predict(train_matrix_word_subset)
train_data.loc[:,'simpleModel'] = pd.Series(model_simple_Train, index=train_data.index)
train_data.loc[:,'SimpleModelVSMeasured'] = pd.Series((train_data['simpleModel'] == train_data['sentiment']), \
                                                      index=train_data.index) 
pos_classified_TrainExples_sm = float(train_data['SimpleModelVSMeasured'].sum())
accuracy_simpleModel_Train = pos_classified_TrainExples_sm / total_TrainExamples
print 'Accuracy of the Simple model on the training set: ', accuracy_simpleModel_Train , ' |Sentiment_Model: ', \
accuracy_sentiment_Train

model_simple_Test = simple_model.predict(test_matrix_word_subset)
test_data.loc[:,'simpleModel'] = pd.Series(model_simple_Test, index=test_data.index)
test_data['SimpleModelVSMeasured'] = np.array(test_data['simpleModel'] == test_data['sentiment'] )
pos_classified_TestExples_sm = float(test_data['SimpleModelVSMeasured'].sum())
accuracy_simpleModel_Test = (pos_classified_TestExples_sm / total_TestExamples)
print 'Accuracy of the Simple model on the test set : ', accuracy_simpleModel_Test, '| Sentiment_Model: ', \
accuracy_testModelSentiment



***Total Number of weights of the Simple Model that are positive:  10
Accuracy of the Simple model on the training set:  0.866822570007  |Sentiment_Model:  0.967627570906
Accuracy of the Simple model on the test set :  0.869360451164 | Sentiment_Model:  0.932325413967


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### Baseline Majority Class Prediction
The majority classifier is a very simple model, which predicts the majority output label in training data for all results.
This can be used as a reference model for comparison with the classifier model. Let's say I have a training data set of size 100. 60 are positives (+1) and 40 are negatives (-1). Now since positives are majority, the model will predict positive for any input.
#Let's assume we have a test data set of size 50. 10 are positives, and 40 negatives. Since your model predicts positive for all of them, it is correct only 10 times. Your accuracy is then 10/50 = 0.2
#If you have a test set of size 50 with 32 positives and 18 negatives, your accuracy will be 32/50 = 0.64
#I hope this helps.
<ol>
    <li>Find the majority of the class in training data ('1' or '-1'), i.e the class with the largest count.<li>
</ol>


In [65]:
#occurence of values in column sentiment ordered by ascendant: this is a series.
getClassCnts_sentiment_train = pd.value_counts(train_data['sentiment']) 
majorClass_in_train = getClassCnts_sentiment_train.keys()[0]
countsMaj_test = test_data[test_data.sentiment == majorClass_in_train].sum()['sentiment']
majority_class_accuracy = float(countsMaj_test)/float(total_TestExamples)
print '\n*** Accuracy of the majority Class Classifier:', majority_class_accuracy


*** Accuracy of the majority Class Classifier: 0.842782577394
