# INFS 770 - Advanced Data Mining Application
## Assignment 4
### John Herbert

## T0: Import Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
from sklearn.feature_extraction.text import *
from sklearn.cluster import KMeans
from gensim.models import LdaModel
from gensim import corpora
from gensim import matutils
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

## T1: Import Dataset

In [2]:
data = pd.read_csv('amazon_review_texts.csv')

In [3]:
data.head(5)

Unnamed: 0,pid,helpful,score,text,category
0,B000GAYQL8,0/0,5,GREAT WATCH AND GREAT LOOK. BIG FACE AND 4 DIF...,watch
1,B000IBNPDA,0/0,5,"Bought this as a Christmas gift, my boyfriend ...",watch
2,B000J2HA16,0/0,5,"I love this watch! Its sporty, without looking...",watch
3,B000BDIQPM,0/0,5,"Works great,looks nice,dont have to worry abou...",watch
4,B000GZTH9E,0/3,4,I need to change the watch wrist and I havent ...,watch


In [4]:
print(data["score"].value_counts().sort_index())
print(data["category"].value_counts().sort_index())

1     595
2     259
3     303
4     773
5    2070
Name: score, dtype: int64
automotive     1000
electronics    1000
software       1000
watch          1000
Name: category, dtype: int64


## T2: Tokenization

In [5]:
stopwords = set(nltk.corpus.stopwords.words("english"))

def before_token(documents):
    # conver words to lower case
    lower = map(str.lower, documents)
    # remove puntuations
    punctuationless = list(map(lambda x: " ".join(re.findall('\\b\\w\\w+\\b',x)), lower))
    # remove numbers
    return list(map(lambda x:re.sub('\\b[0-9]+\\b', '', x), punctuationless))

# initialize a stemmer
stemmer = nltk.stem.PorterStemmer()

# define a function that preprocess a single document and returns a list of tokens
def preprocess(doc):
    tokens = []
    for token in doc.split():
        if token not in stopwords:
            tokens.append(stemmer.stem(token))
    return tokens
            
# preprocess all documents
processed = list(map(preprocess, before_token(data['text'])))

# calculate the token frequency
# the FreqDist function takes in a list of tokens and return a dict containg unique tokens and frequency
fdist = nltk.FreqDist([token for doc in processed for token in doc])
fdist.tabulate(10)

  watch     use     one    work    time    like product   great     get   would 
   2553    2476    1795    1605    1420    1375    1336    1318    1309    1217 


## T3: Top 10 words

The words that would not work will in clustering and classification if I was attempting to determine what each cluster's topic was are
* **use**
* **one**
* **like**
* **great**
* **get**
* **would**
* **Product**

These words are general, and may not be specific to a cluster depending on the topics we are predicting. In this case, we know what each topic is and these are not specific to any of the 4 categories. They more describe the use or sentiment of the products. Since each topic is about reviews of products from Amazon, **product** is too general and describes items within each of the catagories.

However, **watch** and **time** are good descriptors of the **watch** category

## T4: Reconstruct the documents

In [6]:
processed_doc = list(map(" ".join, processed))
# normalization is needed for clustering
vectorizer = TfidfVectorizer(max_df=0.8, stop_words='english') # Default norm is 'l2'
X = vectorizer.fit_transform(processed_doc)

print('The number of features = %d' % X.shape[1])

The number of features = 10833


## T5: K-means Categorization

In [7]:
km = KMeans(n_clusters=4, max_iter=100, random_state=4)
km.fit(X)
km.transform(X)

# examine the representative words for each cluster
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(4):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind])
    print

Cluster 0:
 use
 product
 work
 great
 good
 instal
 program
 like
 softwar
 time
Cluster 1:
 batteri
 charg
 charger
 power
 adapt
 appl
 camera
 canon
 work
 origin
Cluster 2:
 watch
 look
 band
 time
 great
 wear
 love
 like
 nice
 price
Cluster 3:
 bed
 air
 inflat
 comfort
 pump
 sleep
 mattress
 deflat
 airb
 easi


## T6: Cluster Examination and Explanation

Some of the clusters do a good job at explaining each of the 4 categories of the amazon reviews in the data set.
* **Cluster 0**: Appears to be focused on the 'software' category as there are features like 'install', 'program', and 'softwar'. It is interesting that that softwar is missing the 'e' at end, however this may be due to the stemmer bringing it to its base word as shown above. However, 'time' appears in this cluster which may be related to the 'watch' category.
* **Cluster 1**: Appears to be focused on the 'electronic' category. There are features such as 'batteri', 'char', 'charger', 'power', 'canon', and 'adapt' that suggest this is the category. However, if 'appl' was not in the top 10, I might mistake this for a camera category (if I did not know the categories ahead of time). Also, this category could partially be related to the automotive category as it takes about adapters and batteries.
* **Cluster 2**: Appears to be focused on the 'watch' category. There are features such as 'watch', 'band', 'time',  and 'wear'. The other features in this cluster appear to be forcused on the sentiment of the products, not the category.
* **Cluster 3**: This cluster does an overall poor job describing the category. It has features such as 'air', 'bed', 'inflate', 'sleep', 'mattress' which would imply that this category is on inflattable mattresses and not 'autmotive' the only remaining category.

In [8]:
print('Stems of software and apple')
print(stemmer.stem("software"))
print(stemmer.stem("appl"))

Stems of software and apple
softwar
appl


## T7: Gensim LDA Model Topics

In [9]:
# convert the vectorized data to a gensim corpus object
word2id = dict((k, v) for k, v in vectorizer.vocabulary_.items())
id2word = dict((v, k) for k,v in vectorizer.vocabulary_.items())
d=corpora.Dictionary()
d.id2token = id2word
d.token2id = word2id
corpus = matutils.Sparse2Corpus(X, documents_columns=False)

# build the lda model after we find the optimal number of topics
lda = LdaModel(corpus, num_topics=4,id2word=id2word, random_state=42, passes=30)

lda.show_topics()

[(0,
  '0.006*"use" + 0.006*"work" + 0.005*"product" + 0.005*"great" + 0.004*"good" + 0.004*"time" + 0.004*"batteri" + 0.004*"like" + 0.004*"easi" + 0.003*"instal"'),
 (1,
  '0.030*"watch" + 0.008*"band" + 0.006*"look" + 0.006*"wear" + 0.005*"love" + 0.004*"wrist" + 0.004*"face" + 0.004*"beauti" + 0.003*"nice" + 0.003*"great"'),
 (2,
  '0.002*"cartridg" + 0.001*"cannon" + 0.001*"printer" + 0.001*"dud" + 0.001*"matress" + 0.001*"pic" + 0.001*"deskjet" + 0.001*"bilstein" + 0.001*"sd950i" + 0.001*"bronco"'),
 (3,
  '0.003*"mop" + 0.001*"nissan" + 0.001*"la" + 0.001*"el" + 0.001*"en" + 0.001*"es" + 0.001*"gasket" + 0.001*"vw" + 0.001*"producto" + 0.001*"radioand"')]

## T8: LDA Examination and Explanation

Overall these features do a poor job at explaining the categories:
* **Category 0**: This is unclear the category of this topic as it has *batteri*, and *install* which could be either software, automotive, or electronics. It does have *time*, but also *install* so it is unclear if watch would be a category
* **Category 1**: This is fairly clear the topic is watch as the features select are *watch*, *band*, *wrist*, and *face* with only neutral tterms as the top features.
* **Category 2**: This category is ambiguous as it has *cannon*, *printer*, and *deskject* which may suggest it is electronic, however it also has *bronco* which might suggest automotive.
* **Category 3**: This category appears to be about automotive asd it has *gasket*, *vw*, *nissan* and *radio* which are components or a make of a vehicle with no ambiguous terms that are confusing.

Overall I believe the clustering does a better job at categorizing the data. It generally gives features that are easier to decipher what the category is for 3 of 4 topics. However, the LDA does a good job at predicting the automotive category. The LDA does however tend to have more useless topic features in the list than the cluster. 

## T9: 5-fold CV SGD Classifier

In [10]:
# unlisting processed data to insert as column in data frame 
# Using processed data from earlier task to run through the SVM model
processed2 = [' '.join(c for c in lst) for lst in processed]
data['proc_text'] = processed2

In [11]:
# 5-fold cross validation
skf = StratifiedKFold(n_splits=5)
fold = 0
f1 = []
for train_index, test_index in skf.split(data["proc_text"], data["score"]):
#for train_index, test_index in skf:
    fold += 1
    print("Fold %d" % fold)
    # partition
    train_x, test_x = data["proc_text"].iloc[train_index], data["proc_text"].iloc[test_index]
    train_y, test_y = data["score"].iloc[train_index], data["score"].iloc[test_index]
    # vectorize
    vectorizer = TfidfVectorizer(max_df=0.8, stop_words='english',min_df=2) 
    # min_df removes terms that appear less than a given threshold, therefore setting it to 2
    X = vectorizer.fit_transform(train_x)
    X_test = vectorizer.transform(test_x)
    # train model
    clf = SGDClassifier(random_state=fold) # Stochastic gradient desent SVM
    clf.fit(X, train_y)
    # predict
    pred_y = clf.predict(X_test)
    # classification results
    for line in metrics.classification_report(test_y, pred_y).split("\n"):
        print(line)
    f1.append(metrics.f1_score(test_y, pred_y, average='weighted'))

   
print("Average F1: %.2f" % np.mean(f1))
print('The number of features = %d' % X_test.shape[1]) 

Fold 1
              precision    recall  f1-score   support

           1       0.66      0.33      0.44       119
           2       0.27      0.06      0.10        51
           3       0.04      0.02      0.02        61
           4       0.27      0.17      0.21       155
           5       0.62      0.91      0.74       414

    accuracy                           0.56       800
   macro avg       0.37      0.30      0.30       800
weighted avg       0.49      0.56      0.49       800

Fold 2
              precision    recall  f1-score   support

           1       0.44      0.59      0.50       119
           2       0.12      0.06      0.08        52
           3       0.18      0.15      0.17        60
           4       0.27      0.23      0.25       155
           5       0.67      0.70      0.68       414

    accuracy                           0.51       800
   macro avg       0.34      0.35      0.34       800
weighted avg       0.49      0.51      0.50       800

Fold 3
 

## T10: Create New Variable: satisfaction & 5-fold CV SGD Classifier

In [12]:
# Creating a new variable 'satisfaction' where scores of 4 and 5 = 1, all else = 0
data['satisfaction'] = data['score'].apply(lambda x: 1 if x == 4 or x == 5 else 0 )

# 5-fold cross validation
fold = 0
f1 = []
for train_index, test_index in skf.split(data["proc_text"], data["satisfaction"]):
#for train_index, test_index in skf:
    fold += 1
    print("Fold %d" % fold)
    # partition
    train_x, test_x = data["proc_text"].iloc[train_index], data["proc_text"].iloc[test_index]
    train_y, test_y = data["satisfaction"].iloc[train_index], data["satisfaction"].iloc[test_index]
    # vectorize
    vectorizer = TfidfVectorizer(max_df=0.8, stop_words='english',min_df=2) 
    # min_df removes terms that appear less than a given threshold, therefore setting it to 2
    X = vectorizer.fit_transform(train_x)
    X_test = vectorizer.transform(test_x)
    # train model
    clf = SGDClassifier(random_state=fold) # Stochastic gradient desent SVM
    clf.fit(X, train_y)
    # predict
    pred_y = clf.predict(X_test)
    # classification results
    for line in metrics.classification_report(test_y, pred_y).split("\n"):
        print(line)
    f1.append(metrics.f1_score(test_y, pred_y, average='weighted'))

   
print("Average F1: %.2f" % np.mean(f1))
print('The number of features = %d' % X_test.shape[1]) 

Fold 1
              precision    recall  f1-score   support

           0       0.84      0.23      0.35       231
           1       0.76      0.98      0.86       569

    accuracy                           0.76       800
   macro avg       0.80      0.60      0.61       800
weighted avg       0.78      0.76      0.71       800

Fold 2
              precision    recall  f1-score   support

           0       0.62      0.73      0.67       231
           1       0.88      0.82      0.85       569

    accuracy                           0.79       800
   macro avg       0.75      0.77      0.76       800
weighted avg       0.81      0.79      0.80       800

Fold 3
              precision    recall  f1-score   support

           0       0.65      0.71      0.67       231
           1       0.88      0.84      0.86       569

    accuracy                           0.80       800
   macro avg       0.76      0.77      0.77       800
weighted avg       0.81      0.80      0.81       800

## T11: Bing Liu's Lexicon

In [13]:
# read the lexicon
lexicon = dict()

# read postive words
with open("negative-words.txt", "r") as in_file:
    for line in in_file.readlines():
        if not line.startswith(";") and line != "\n":
            lexicon[line.strip()] = -1

# read negative words
with open("positive-words.txt", "r") as in_file:
    for line in in_file.readlines():
        if not line.startswith(";") and line != "\n":
            lexicon[line.strip()] = 1

In [14]:
# Setting the vocab arugment to the Bing Liu's Lexicon
vocab = lexicon.keys()
# 5-fold cross validation
fold = 0
f1 = []
for train_index, test_index in skf.split(data["proc_text"], data["satisfaction"]):
#for train_index, test_index in skf:
    fold += 1
    print("Fold %d" % fold)
    # partition
    train_x, test_x = data["proc_text"].iloc[train_index], data["proc_text"].iloc[test_index]
    train_y, test_y = data["satisfaction"].iloc[train_index], data["satisfaction"].iloc[test_index]
    # vectorize
    vectorizer = TfidfVectorizer(max_df=0.8, stop_words='english',min_df=2,vocabulary=vocab) 
    # min_df removes terms that appear less than a given threshold, therefore setting it to 2
    X = vectorizer.fit_transform(train_x)
    X_test = vectorizer.transform(test_x)
    # train model
    clf = SGDClassifier(random_state=fold) # Stochastic gradient desent SVM
    clf.fit(X, train_y)
    # predict
    pred_y = clf.predict(X_test)
    # classification results
    for line in metrics.classification_report(test_y, pred_y).split("\n"):
        print(line)
    f1.append(metrics.f1_score(test_y, pred_y, average='weighted'))

   
print("Average F1: %.2f" % np.mean(f1))
print('The number of features = %d' % X_test.shape[1]) 

Fold 1
              precision    recall  f1-score   support

           0       0.64      0.25      0.36       231
           1       0.76      0.94      0.84       569

    accuracy                           0.74       800
   macro avg       0.70      0.60      0.60       800
weighted avg       0.72      0.74      0.70       800

Fold 2
              precision    recall  f1-score   support

           0       0.69      0.51      0.59       231
           1       0.82      0.91      0.86       569

    accuracy                           0.79       800
   macro avg       0.75      0.71      0.72       800
weighted avg       0.78      0.79      0.78       800

Fold 3
              precision    recall  f1-score   support

           0       0.71      0.47      0.57       231
           1       0.81      0.92      0.86       569

    accuracy                           0.79       800
   macro avg       0.76      0.70      0.71       800
weighted avg       0.78      0.79      0.78       800

## T12: Comparing T10 & T11 

The average F1 score across all 5 of the cross validation folds for T10 (using a SGD classifier across all words in the corpus excluding stop words) is 0.78. 

The average F1 score across all 5 of the cross validation folds for T11 (using a SGD classifier across just the words in the lexicon dictionary) is 0.75.

Generally, it is better practice to use all the words in the corpus instead of just the words in the a specific dictionary. This is most likely because there are words that have meaning on whether a customer was satisfied or not that are taken out when using the lexicon dictionary. Therefore, the SGD clasifier does a better job when it has more features to train with and make predictions on the clasification when not using the lexicon as a dictionary in the vectorizor.

## T13: PCA

In [15]:
# Vecotrizing the data using the same parameters from T9
vectorizer = TfidfVectorizer(max_df=0.8, stop_words='english',min_df=2) 
X = vectorizer.fit_transform(data["proc_text"]).todense()
#X_std = StandardScaler().fit_transform(X) # you need to do standardization, since pca is sensitive 
# to the relative scaling of the original variables
X = StandardScaler().fit_transform(X)

# Running PCA variable selection that explain 90%
pca = PCA(svd_solver='randomized',whiten=True).fit(X)
print(pca.explained_variance_ratio_)

[3.99642752e-03 3.34949529e-03 2.53746962e-03 ... 3.36972499e-38
 1.10980541e-38 4.60171578e-39]


## T14: PCA Explanation

PCA is a dimension reduction method that is generally used for variable reduction in a model. It does this by creating new variables called principal components created by a reconstruction of the variables into linear combinations of the initial variables. Eachof thse principal components are essentially a dimension in the model. So if you have 10 predictors, there are 10 dimensions to the model, therefore 10 principals components. PCA attempts to put maximum information in the first component, then maximum remaining information in the second and so on. Therefore, it is easy to identify which components can be removed without losing information on the variances. 

## T15: PCA Variable Reduction and SGD Classifier with 5-fold CV

In [16]:
sumofvariance=0.0
n_components = 0
for item in pca.explained_variance_ratio_:
    sumofvariance += item
    n_components+=1
    if sumofvariance>=0.9:
        break
print('The number of components selected by PCA = %d' % n_components) 

The number of components selected by PCA = 2017


In [22]:
X_train_pca = pca.transform(X)
skf = StratifiedKFold(n_splits=5)
fold = 0
f1 = []
for train_index, test_index in skf.split(X_train_pca, data["satisfaction"]):
#for train_index, test_index in skf:
    fold += 1
    print("Fold %d" % fold)
    # partition
    train_x, test_x = X_train_pca[train_index], X_train_pca[test_index]
    train_y, test_y = data["satisfaction"].iloc[train_index], data["satisfaction"].iloc[test_index]
    # train model
    clf = SGDClassifier(random_state=fold) # Stochastic gradient desent SVM
    clf.fit(train_x, train_y)
    # predict
    pred_y = clf.predict(test_x)
    # classification results
    for line in metrics.classification_report(test_y, pred_y).split("\n"):
        print(line)
    f1.append(metrics.f1_score(test_y, pred_y, average='weighted'))

print("Average F1: %.2f" % np.mean(f1))

Fold 1
              precision    recall  f1-score   support

           0       0.29      0.49      0.37       231
           1       0.72      0.52      0.61       569

    accuracy                           0.51       800
   macro avg       0.51      0.51      0.49       800
weighted avg       0.59      0.51      0.54       800

Fold 2
              precision    recall  f1-score   support

           0       0.33      0.23      0.27       231
           1       0.72      0.81      0.76       569

    accuracy                           0.64       800
   macro avg       0.52      0.52      0.52       800
weighted avg       0.61      0.64      0.62       800

Fold 3
              precision    recall  f1-score   support

           0       0.30      0.39      0.34       231
           1       0.72      0.63      0.67       569

    accuracy                           0.56       800
   macro avg       0.51      0.51      0.51       800
weighted avg       0.60      0.56      0.58       800

## T16: Explanation of PCA Results

The F1 score from the model using the PCA calculated variables was 0.59. This was worse than the model in T10, which was 0.78. This is because the SGD Classifier uses all the features, excluding stop words and features that appear in only one document. This is because PCA generally works better when the features are highly correlated and variable reduction is needed to reduce the noise and focus on the features bringing new information. It appears that there is not much multicolinearity between the categories as each is fairly different and unique. However, if 2 of the categories were, for example, 'personal computers' and 'laptops', we may see better scores with the PCA model as there is a lot of overlap between these 2 topics.