# Latent Semantic Indexing/Analysis and Sentiment Classification

Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text.

LSA is an information retrieval technique which analyzes and identifies the pattern in unstructured collection of text and the relationship between them.

**LSA itself is an unsupervised way of uncovering synonyms in a collection of documents.**

Goals of this notebook:
    
    1. LSA used to analyze relationships between a set of documents and the terms they contain. 
    
    2. Analyze and classify sentiment.

# Data

This is a collection of over 500,000 fine foods reviews from Amazon.

In [1]:
import pandas as pd

df = pd.read_csv('Reviews.csv')
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


## TF-IDF

TF-IDF is an information retrieval technique that weighs a term's frequency and its inverse document frequecy (IDF). Each word has its respective TF and IDF score. The product of the TF and IDF scores of a word is called the TFIDF weight of that word.

Basically, the **higher** the TFIDF score (weight), the **rarer** the word and vice versa.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf.fit(df['Text'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [3]:
X = tfidf.transform(df['Text'])
df['Text'][1]

'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".'

For the above review, let's check out a few tf-idf scores for a few words in it to get a better sense of tf-idf scores.

In [4]:
X[1, tfidf.vocabulary_['peanuts']]

0.37995462060339136

In [5]:
X[1, tfidf.vocabulary_['jumbo']]

0.530965343023095

In [6]:
X[1, tfidf.vocabulary_['error']]

0.2302711360436964

'jumbo' has the highest tf-idf score which means it is the rarest compared to 'peanuts' and 'error'. This is how to use the scores for each word to determine how important each term is inside a collection of documents.

## Sentiment Classification

We will be using the 'Score' column from the Reviews table. Sentiment will be binary- 0 or 1. So score 3 will be eliminated since it is just neutral. Scores 1 and 2 will be changed to negative sentiment (0) and scores 4 and 5 will be changed to positive (1).

In [7]:
import numpy as np

df.dropna(inplace = True)
df[df['Score'] != 3]
df['Positivity'] = np.where(df['Score'] > 3, 1, 0)
cols = ['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time', 'Summary']
df.drop(cols, axis=1, inplace=True)
df.head()

Unnamed: 0,Text,Positivity
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1


This is the distribution of Positivity:

In [8]:
df.groupby('Positivity').size()

Positivity
0    124645
1    443766
dtype: int64

## Train/Test Split

Now, let us set up the model to classify sentiment.

In [9]:
from sklearn.model_selection import train_test_split

X = df.Text
y = df.Positivity
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

In [10]:
print("Training set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(X_train),
                                                                             (len(X_train[y_train == 0]) / (len(X_train)*1.))*100,
                                                                            (len(X_train[y_train == 1]) / (len(X_train)*1.))*100))

Training set has total 426308 entries with 21.91% negative, 78.09% positive


In [11]:
print("Test set has total {0} entries with {1:.2f}% negative, {2:.2f}% positive".format(len(X_test),
                                                                             (len(X_test[y_test == 0]) / (len(X_test)*1.))*100,
                                                                            (len(X_test[y_test == 1]) / (len(X_test)*1.))*100))

Test set has total 142103 entries with 21.99% negative, 78.01% positive


There is class imbalance that affects the strength of model's predictability. 

Ratio of negative to positive instances is 22:78.

Tactic we will use to work around this is to use Decision Tree algorithms. So we will use the Random Forest classifier to learn the imbalanced data and set class_weight = balanced.

First, we will define a function to print out the accuracy score.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [13]:
def accuracy_summary(pipeline, X_train, y_train, X_test, y_test):
    sentiment_fit = pipeline.fit(X_train, y_train)
    y_pred = sentiment_fit.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("accuracy score: {0:.2f}%".format(accuracy*100))
    return accuracy

## Feature # Selection

In order to have an efficient sentiment analysis or to solve any NLP problem, there needs to be a lot of features. So we are going to try 10,000 to 30,000. And we will print out accuracy scores that associate with each number of features selected.

In [14]:
cv = CountVectorizer()
rf = RandomForestClassifier(class_weight="balanced")
n_features = np.arange(10000,30001,10000)

def nfeature_accuracy_checker(vectorizer=cv, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=rf):
    result = []
    print(classifier)
    print("\n")
    for n in n_features:
        vectorizer.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range)
        checker_pipeline = Pipeline([
            ('vectorizer', vectorizer),
            ('classifier', classifier)
        ])
        print("Test result for {} features".format(n))
        nfeature_accuracy = accuracy_summary(checker_pipeline, X_train, y_train, X_test, y_test)
        result.append((n,nfeature_accuracy))
    return result

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [16]:
print("Result for trigram with stop words (Tfidf)\n")
feature_result_tgt = nfeature_accuracy_checker(vectorizer=tfidf,ngram_range=(1, 3))

Result for trigram with stop words (Tfidf)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators='warn', n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False)


Test result for 10000 features




accuracy score: 90.55%
Test result for 20000 features


MemoryError: 

## Now, check the Classification Report

In [None]:
from sklearn.metrics import classification_report

cv = CountVectorizer(max_features = 30000, ngram_range=(1,3))
pipeline = Pipeline(['vectorizer', cv), ('classifier', rf)])
sentiment_fit = pipeline_fit(X_train, y_train)
y_pred = sentiment_fit.predict(X_test)


print(classification_report(y_test, y_pred, target_names=['negative','positive']))

## Chi-Squared Fit for Feature Selection

Feature selection is an important problem in Machine Learning.

We will calculate the Chi square scores for all the features and visualize the top 20.

Terms/words/N-grams are the features.
Positive and negative are 2 classes.

Given a feature X, we can use Chi square test to determine how important it is to distinguish a class.

In [None]:
from sklearn.feature_selection import chi2
import matplotlin.pyplot as plt
%matplotlib inline

plt.figure(figsize = (12,8))
scores = list(zip(tfidf.get_feature_names(),chiscore))
chi2 = sorted(scores, key=lambda x:x[1])
topchi2 = list(zip(*chi2[-20:]))
x = range(len(topchi2[1]))
labels = topchi2[0]
plt.barh(x,topchi2[1],align='center',alpha=0.5)
plt.plot(topchi2[1],x, '-o',markersize=5,alpha=0.8)
plt.yticks(x,labels)
plt.xlabel('$\chi^2$')
plt.show();

What is observed from the graph output are the top words that have the highest chi sq score and are most influential in classifying a sentiment. The top 5 are:
Based on the words, these are suspected to classify for _____ reviews. 