# Semantic Analysis Of Amazon Electronic Review

## Introduction:

For my project for CSCI760 Computer Linguistics course for Spring 2018 semester, I will be processing a set of courpus and building a ML model that will help me determine a particular document's semantical meaning. I will be processing a dataset that I found online, a corpus of Amazon reviews on electronics. I will be running them through various NLP libraries to find the tf-idf scores of each document in a vectorized space, and will be calculating their pairwise cosine similarities to determine how each document is related to the other documents in the corpus. I will also try and visualize these documents in a 2/3 dimensional euclidean space, and also run these data points through various clustering algorithms to help better visualize their semantics.

For this project I will be using the following programming languages/libraries:

Python 3: https://docs.python.org/3/
NLTK: https://www.nltk.org/
gensim: https://radimrehurek.com/gensim/
scikit-learn: http://scikit-learn.org/stable/
matplotlib: https://matplotlib.org/
pandas: https://pandas.pydata.org/

To set up this project, please reference the README that is commited to this git repository as part of the project.

## Finding data

The first step in my project was to find a sufficient dataset that I can use to start building my model from. For my project on semantic analysis, I had decided to use a set of Amazon review articles on electronics that I found at the following URL:

http://jmcauley.ucsd.edu/data/amazon/

Credit goes to Julian McAuley, UCSD for providing a collection of 1000K+ amazon reviews on electronics.

## Cleaning/Formatting the data

After obtaining the data, I noticed it was structured in JSON format, along with other fields that were not really useful for my project (i.e reviewerName, asin product code, unixReviewTime, etc...). In order to extract the field that I was interested in (reviewText), I had decided to convert them into CSV format and store them in an array for easy access.

In hindesight, I could've probably used a JSON parser to run through them, but an unintended side effect was that I realized there were some entries which were not formatted properly and/or missing reviewText fields. Below is the python script that I used to parse my data and output a `testdata.csv` file which I will be using to build my model.

In [None]:
import json
import csv
import pandas as pd

def allFieldsPresent(jsondata):
    return len(jsondata.keys()) == 9

#LONGER/ACTUAL DATA
f=open('../datasets/amazon_review_electronic_full.json','r')
w=open('../datasets/CSV_AMAZON_REVIEW_ELECTRONIC_FULL.csv','w')

csvwriter = csv.writer(w)

rowcount=0

print('CSV Conversion Start')

for line in f:
    jsondata = json.loads(line)
    if rowcount == 0:
        header = jsondata.keys()
        print (header)
        csvwriter.writerow(header)
        
    rowcount += 1

    # Only convert if all fields are present. Some docs do not have reviewerName.
    if allFieldsPresent(jsondata):
        csvwriter.writerow(jsondata.values())
        
    if rowcount % 100000 == 0:
        print ('Processing Line:', rowcount)

    # Limiter to process smaller dataset.
    if rowcount == 1000:
        break;
    
w.close()
f.close()

print('CSV Conversion Complete')

print('Using Pandas to extract reviewText column')

df = pd.read_csv('../datasets/CSV_AMAZON_REVIEW_ELECTRONIC_FULL.csv', skipinitialspace=True)
df.index.name = 'index'
df.reviewText.to_csv('../datasets/testdata.csv', header=['reviewText'], encoding='utf-8')

print ('\n\n==END==\n\n')


## Creating stoplist

A stoplist is a list of words that we generally do not care about. Preposition words such as "like", "through", "at" and other words that generate unnecessary noise in our dataset are considered stoplist words. These words will need to be removed from our corpus. Fortunately, the Python nltk library readily provides for us a list of stop words that we can use by simply providing a language type parameter (in our case english), and it will return a set of stoplist words.

```python
#Remove stop words
# Load stop words to reduce noise
stoplist = set(nltk.corpus.stopwords.words('english'))
stoplist.update(['-', 'yet', 'yea', 'zs15'])
```

This list will be passed along into the `TfidfVectorizer` function (part of sklearn library) as parameters to reduce the noice generated by these unnecessary words when calculating the tf-idf score.



## Calculating the tf-idf score

To find the tf-idf score, we will use the scikitlearn library. This library will generate a tf-idf score, based upon a corpus input parameter, along with optional stoplist parameters that will remove all the unnecessary stoplist words for you automatically.

Before we continue, it's important to understand what the tf-idf score is. Tf-idf stands for Term Frequency - Inverse Document Frequency. It's basically the following formula:

Tf(term): (frequency of a given term in a document) / (normalized over the total number of terms in the document)
i.e:
Given a sentence: "This project is a very hard project"
Tf(project) = 2/7 = 0.285714

Idf(term): log((Total number of document) / (number of document containing the term))
i.e:
Given these two sentences:
A - "This project is a very hard project"
B - "I like this project"
Idf(project) = log(2/2) = log(1) = 0

Tf-idf is simply then: Tf * Idf

<img src="img/tfidf_eq.png">

This score is useful as it will tell us the frequency of a particular term in the document, with respect to the number of frequencey across the entire corpus.

Note* a logrithmic function is applied to the Idf calculation. This is purely for weighing up/down the result as the size of the dataset (and thus the frequency of the term) grows. This is not as noticable in our example as it only contains two documents/sentences, as opposed to a corpus of millions of lines/documents.

To calculate the TfIdf score, we will use the library scikitlearn, which readly provides us a function that will give us the vectorized tf-idf score per document. We simply need to provide the corpus and an optional parameter of stoplist:

```python
from sklearn.feature_extraction.text import TfidfVectorizer
#
# TF-IDF Vectorizing using scikitlearn
#

tfidf_vectorizer = TfidfVectorizer(stop_words=stoplist, use_idf=True)
V = tfidf_vectorizer.fit_transform(documents)
```
#### Vectorized tf-idf score
The output will be a vector V of the tf-idf scores for each individual document. Sample output of the first 7 terms of the first document:

```
  (0, 532)	0.0880608061719492
  (0, 535)	0.09853796408314193
  (0, 614)	0.11644878383359814
  (0, 869)	0.11644878383359814
  (0, 1060)	0.10901512199433466
  (0, 1293)	0.23289756766719627
  (0, 623)	0.21803024398866933
  .
  .
  .
```
In the above resulting vector, the first value is a (document index in corpus, term index in dictionary) pair, while the second value is its respective tf-idf feature score.

Printing out the shape of V, we can see that it's a 99x1430 sparse matrix, where the 99 rows are our the number of documents in our corpus, with 1430 unique terms that we are storing in our dictionary for our corpus. (99 row cause one of the row was an invalid entry that I during the CSV parsing step.)

```
<99x1430 sparse matrix of type '<type 'numpy.float64'>'
```

#### Matching term to actual word

By looking up the term index and mapping them to our result using the following lines of code:

```python
review_number=0
feature_names = tfidf_vectorizer.get_feature_names()
feature_index = V[review_number].nonzero()[1]
tfidf_scores = zip(feature_index, [V[review_number, x] for x in feature_index])
for word, score in [(feature_names[index], score) for (index, score) in tfidf_scores]:
  print (str(word) + ' => ' + str(score))
```
Below is the output of the first 7 terms.

```
got => 0.0880608061719492
gps => 0.09853796408314193
husband => 0.11644878383359814
otr => 0.11644878383359814
road => 0.10901512199433466
trucker => 0.23289756766719627
impressed => 0.21803024398866933
.
.
.
```

## Finding cosine similarities

Now that we have a the tf-idf score of the document in a vectorized space, we can calculate the angle between any two document to determine how close they are to each other. Recall that cos(theta) ranges from -1 to 1, and that cos(0) = 1. Thus when we compare the cosine of the difference in their angles, the closer the result is to 1, the closer in similarities their tf-idf score. To get the pairwise cosine similarity of a document to all the other documents in our corpus, scikit-learn readily provides for us the cosine_similarity function, into which we will pass two parameters. The first being the vectorized tf-idf score of the document that we would like to compare, and the second being the entire collection of tf-idf score for all the documents in the corpus for us to compare with:

```python
cs_results = cosine_similarity(V[0:3], V)
```

<img src="img/cosine_pic.png">

The below code snippet from our `process_data.py` is responsible for calculating the cosine similarities for the documents and printing out the pairwise similaritiy score for the first 3 documents (note the limiter if case in the snippet)


In [None]:
#
# cosine similarity output
#

cs_results = cosine_similarity(V[0:], V)
doc_num = 1
for i_result in cs_results:
    # Limiter
    if doc_num > 3:
        break;
    print ("Document#: ",doc_num)
    print (i_result,'\n\n')
    doc_num += 1

### Output of the cosine similarity result:

<img src="img/cosine_score.png">

## Visualizing the data

Our vector `V` with its rich data filled tf-idf scored/cosine similarity score, it was time for me to start visualizing these datapoints to see how they are similar to one another. As human beings, living in a 3 dimensional space, it's near impossible for us to visualize a 1000+ dimensional vector space, thus a dimensional reduction algorithm is needed to reduce the ~1000+ vector elements in `V` into 2/3 dimensional vectors that we can then plot and visualize. There are many dimensional reduction algorithms, but I had ultimately settled on the T-SNE algorithm, which emphasized on keeping the relative distance between data points, while reducing their vectors/features into 2/3 dimensions. This algorithm was invented by Laurens van der Maaten, a ML research scientist currently working at facebook, and Geoffry Hinton. Below is a link to Laurens van der Maaten's paper from '08, along with an excerpt.

Paper on T-SNE: http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

<img src="img/tsne_paper.png">

In addition, I also found a wonderful youtube tutorial that explains in detail with visual examples of what exactly the T-SNE algorithm was doing. Below is a short description of this tutorial, along with the link to this video.

Source: https://www.youtube.com/watch?v=NEaUSP4YerM

#### t-SNE

t-Distributed Stochastic Neighbor Embedding - From sklearn library.
Developed by Laurens van der Maaten and Geoffry Hinton, it is a machine learning algorithm to help with dimensional reduction.

 - Dimension reduction algorithm
 - Plots higher level n-dimension (n > 3) into 2 or 3 dimensional space
 - Reducing it to a lower 2/3 dimensional space will allow us to plot it and visualize it.
 - Retains a lot of the relations of the higher dimensional graph
 
#### Naive projection:
 
<img src="img/tsne1.png">

As you can see, projecting it in a naive way will cause the points to cluster together in an unfavorable way that will cause them to lose their original properties and relative values towards one another.

#### T-SNE projection:

<img src="img/tsne2.png">

Ideally we will want the projected plots in the lower dimension to retain their original properties, which effectively allows us to view the data in a correct manner.

T-SNE uses machine learning algorithm to calculate the probability of each point being in a similar cluster with the next. It finds this similarity score based on a normal Gausian distribution. After these scores are obtained, it then projects the plots onto a lower order dimension in a random mannar and try to re-adjust itself to match the distribution scores that we obtaine from above:

<img src="img/tsnelearning.png">

Lastly when clustering the points on the lower dimension, it uses a T-Distribution instead of the normal Gausian distribution that we used earlier. This is where the "T" in T-SNE comes from:

<img src="img/tdistribution.png">


#### Plotting tf-idf vector in 2-D

For plotting our data, I had chosen the matplotlib library to help with all my plotting and data visualization needs. The library also provided custom functions that took in data and assign it to each data point for visualization/annotation purpose. I used the below two functions to provide a hover over effect for each of my data points that showed which document numbers they were. Since the tf-idf and tsne algorithm processed the data in place (kept the same order in the vector), I was simply able to provide the index of each data point and associate them back to the document # in my courpus. Below is the sample code and the sample annotation effect of hovering over two points that are in close proximity to each other (31 and 85).

In [1]:
# Hover over annotation/labeling.
def update_annot(ind):
    
    pos = sc.get_offsets()[ind['ind'][0]]
    annot.xy = pos
    text = 'Review#: {}'.format(ind['ind'][0])
    annot.set_text(text)
    annot.get_bbox_patch().set_alpha(1)


def hover(event):
    vis = annot.get_visible()
    if event.inaxes == ax:
        cont, ind = sc.contains(event)
        if cont:
            update_annot(ind)
            annot.set_visible(True)
            fig.canvas.draw_idle()
        else:
            if vis:
                annot.set_visible(False)
                fig.canvas.draw_idle()


<img src="img/31_85.png">

#### Checking/Manual analysis 

Having the lables annotated, I was able to see which documents were closely related each other, and start the process of manually check my visualized data to determine if the results were meaningful. For the above example, I was able to manually check documents 31 and 85 as seen below:

#### 31
`Great bracket for small TVs.  I used it on a 19" LCD and it worked well.  The large mounting plate may block some connections on some TVs.  Installed easily and could be adjusted to make level.  The links were very tight and need to be loosened for initial installation.  When the TV is in the desired position, you can tighten them up.`

#### 85
`Used this with a Panasonic 22 inch LCD. Very durable and works as advertised.It was advertised as fitting 22"-37" televisions, but the size of the t.v mounting plate is so large that it covers some of the plug connections on the back of the television. I will have to cut the metal plate to get to the P.C connector if i choose to connect a computer.Haven't decided if i will keep or send back to get something with a smaller mounting plate.Other than that, it's a great product and meets my need for a hanging television.`

#### Analysis:

An initial analysis reveals that some of the common words found in both reviews include "mounting" and "connections", which leads me to believe that the semantic behind these reviews were about how easy it was to install these mounts and how it affected the connection ports of each device.

*Note also other common "clusters" that occurr in the plot*

### K-Means clustering

Performing K-Means clustering will allow us to start "grouping" these reviews into randomized grouping and tune their mean distance from a randomly determed focus point that will act as the center of each cluster.



In [None]:
km = KMeans(n_clusters=5, init='k-means++', n_init=10, max_iter=1000)
km.fit(V_tsne)
kmeanV = km.predict(V_tsne)

Below is a color coded K-Mean clustering of my original 100 reviews that I had used as part of my development process to write the code to process the data:

<img src="img/kmean_100.png">

I realize that I had to add more data to get a better feel of the semantic that each cluster represented. Here's the K-Mean clustering of my data increased to about 1000:

<img src="img/kmean_1000.png">

In addition to performing clustering, I also wanted to know what the possible semantic meaning of each cluster represented. For that, I retrieved the TF-IDF score for each cluster and printed out the top 5 features (words) for each cluster

In [5]:
def print_top_features(cluster_number, dataframe):
    stoplist = set(nltk.corpus.stopwords.words('english'))
    stoplist.update(['-', 'yet'])
    
    cluster=dataframe.loc[dataframe['cluster']==cluster_number]
    cluster_tfidf = TfidfVectorizer(stop_words=stoplist, use_idf=True)

    V_cluster = cluster_tfidf.fit_transform(cluster.reviewText)
    
    indices = np.argsort(cluster_tfidf.idf_)[::-1]
    features = cluster_tfidf.get_feature_names()
    top_n = 5
    top_features = [features[i] for i in indices[:top_n]]
    print('{} - {}'.format(kmeanColors[cluster_number],top_features))

<img src="img/kmean_topword.png">


I noticed that the term 'zune' showed up in multiple clusters. I believe this is due to different semantical meaning around this device. Having a larger dataset should provide a better clustering outcome, which may help us better predict the semantical meaning around this word. In addition, I had also tried to increase/decrease the K size for K-Mean clustering, but still this word continues to show up in multiple groups.

#### Closing Analysis

For the next step, I would need to look into effective methods of running this modeling process over larger data sets. Perhaps look into better hardware and, pre-processing the datasets, and removing unncessary steps at formatting data.

In addition, I would also like to look into more effective dimension reduction algorithm, so that I can better visualize my k-means clustering. Currently the clustering was done on a 2-D vector, which mean the other 1000+ vector/fectures were reduced, making the clustering less effective as it could have been, had it been done on the full list of 1000+ dimension/vectors. See the comparison of the two results below:

#### Pre T-SNE clustering

<img src="img/pre_kmean.png">

Notice the red cluster that was clustered together in a higher dimensional plane, but when projected onto a 2D space, it seems to have been clustered in the "background" of the other existing clusters. A more effective dimension clustering method may exist where we can better visualize these cluster, perhaps I can also look into doing 3D visualization to help reduce the visual effects of the dimension reduction algorithm.

#### Post T-SNE clustering

<img src="img/post_kmean.png">

Compared to the pre t-sne clustering, the post clustering looks much neater. However as the clustering was down on a 2D vector, which had significantly less feature, the results may not be as accurate or meaningful as the pre t-sne clustering that was done above.