# Project 5: Semantic Analysis (Part 1)
## Introduction:

For the first part of my project, I will be processing a dataset that I found online, a corpus of Amazon reviews on electrioncs. I will be running them through various NLP libraries to find the tf-idf scores of each document in a vectorized space, and will be calculating their pairwise cosine similarities to determine how each document is related to the other documents in the corpus.

For this project I will be using the following programming languages/libraries:

Python 2.7: https://docs.python.org/2/

NLTK: https://www.nltk.org/

gensim: https://radimrehurek.com/gensim/

scikit-learn: http://scikit-learn.org/stable/


## Step 1:  Finding data

The first step in my project was to find a sufficient dataset that I can use to start building my model from. For my project on semantic analysis, I had decided to use a set of Amazon review articles on electronics that I found at the following URL:

http://jmcauley.ucsd.edu/data/amazon/

Credit goes to Julian McAuley, UCSD for providing a collection of 1000K+ amazon reviews on electronics.

## Step 2: Cleaning/Formatting the data

After obtaining the data, I noticed it was structured in JSON format, along with other fields that were not really useful for my project (i.e reviewerName, asin product code, unixReviewTime, etc...). In order to extract the field that I was interested in (reviewText), I had decided to convert them into CSV format and store them in an array for easy access.

In hindesight, I could've probably used a JSON parser to run through them, but an unintended side effect was that I realized there were some entries which were not formatted properly and/or missing reviewText fields.

In [6]:
import json
import csv

def allFieldsPresent(jsondata):
    return len(jsondata.keys()) == 9

#SHORT DATA
f=open('./datasets/amazon_review_electronic_short.json','r')
#f=open('./src/jsondata_test.json','r')
w=open('./src/amazonReviewElectronicShortCSV.csv','w')

#LONGER/ACTUAL DATA
#f=open('./datasets/amazon_review_electronic_full.json','r')
#w=open('./datasets/CSV_AMAZON_REVIEW_ELECTRONIC_FULL.csv','w')

csvwriter = csv.writer(w)

rowcount=0
for line in f:
    jsondata = json.loads(line)
    if rowcount == 0:
        header = jsondata.keys()
        print (header)
        csvwriter.writerow(header)
    
    rowcount += 1

    # Only convert if all fields are present. Some docs do not have reviewerName.
    if allFieldsPresent(jsondata):
        csvwriter.writerow(jsondata.values())
        
    #if rowcount % 100000 == 0:
    #    print ('Processing Mark:',rowcount)

w.close()
f.close()
print ('\n\n==END==\n\n')

[u'reviewerID', u'asin', u'reviewerName', u'helpful', u'reviewText', u'overall', u'summary', u'unixReviewTime', u'reviewTime']


==END==




Sample CSV output of a document:

```csv
[reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime]

[AO94DHGC771SJ,0528881469,amazdnu,"[0, 0]","We got this GPS for my husband who is an (OTR) over the road trucker.  Very Impressed with the shipping time, it arrived a few days earlier than expected...  within a week of use however it started freezing up... could of just been a glitch in that unit.  Worked great when it worked!  Will work great for the normal person as well but does have the ""trucker"" option. (the big truck routes - tells you when a scale is coming up ect...)  Love the bigger screen, the ease of use, the ease of putting addresses into memory.  Nothing really bad to say about the unit with the exception of it freezing which is probably one in a million and that's just my luck.  I contacted the seller and within minutes of my email I received a email back with instructions for an exchange! VERY impressed all the way around!",5.0,Gotta have GPS!,1370131200,"06 2, 2013"]
```

## Step 3: Creating the corpus

To create the corpus, I basically read through the csv file and extracted the 4th index (reviewText) and stored them in a single array (with each index being a single review/document) since for the libraries that I'll be using, it required the corpus to be stored as a single array.

```python
f = open('./src/amazonReviewElectronicShortCSV.csv','r')

csvdata = csv.reader(f)

#Index listing
# 0 - reviewerID
# 1 - asin
# 2 - reviewerName
# 3 - helpful
# 4 - reviewText (Use this as corpus?)
# 5 - overall
# 6 - summary
# 7 - unixReviewTime
# 8 - reviewTime

#Generate corpus
documents = []
rowcount = 0
for row in csvdata:
    if rowcount > 0:
        documents.append(row[4])
    rowcount+=1
    
f.close()
```

## Step 4: Tokenize/Creating stoplist

A stoplist is a list of words that we generally do not care about. Preposition words such as "like", "through", "at" and other words that generate unnecessary noise in our dataset are considered stoplist words. These words will need to be removed from our corpus. Fortunately, the Python nltk library readily provides for us a list of stop words that we can use by simply providing a language type parameter (in our case english), and it will return a set of stoplist words.

```python
#Remove stop words
stoplist = set(nltk.corpus.stopwords.words('english'))
stoplist.update(['-'])

texts = [[ word for word in document.lower().split() if word not in stoplist]
         for document in documents]
```

While checking all the words in all of our documents to see if they are in the stoplist, we can also tokenize them at the same time.

## Step 5: Storing tokenized words into a dictionary

Next, once we have tokenized all the words, we can store them uniquely into a dictionary. For this, we will use the gensim library to generation a Dictionary for us by supplying a list of tokenized words.

```python
#Store dictionary as binary/txt using gensim
dictionary = gensim.corpora.Dictionary(texts)
#dictionary.save('./dict/amazon_electronic_review.dict')
dictionary.save_as_text('./src/dict/amazon_slectronic_review_text.txt')
```
Although we won't be using the dictionary/tokenized values in this part of the project, it's none the less a good resource to have, in case other libraries will require a tokenized list or a dictonary of the corpus.

## Step 6: Calculating the tf-idf score

To find the tf-idf score, we will use the scikitlearn library. This library will generate a tf-idf score, based upon a corpus input parameter, along with optional stoplist parameters that will remove all the unnecessary stoplist words for you automatically.

Before we continue, it's important to define what the tf-idf score is. Tf-idf stands for Term Frequency - Inverse Document Frequency. It's basically the following formula:

Tf(term): (frequency of a given term in a document) / (normalized over the total number of terms in the document)
i.e:
Given a sentence: "This project is a very hard project"
Tf(project) = 2/7 = 0.285714

Idf(term): log((Total number of document) / (number of document containing the term))
i.e:
Given these two sentences:
A - "This project is a very hard project"
B - "I like this project"
Idf(project) = log(2/2) = log(1) = 0

Tf-idf is simply then: Tf * Idf

<img src="files/tfidf_eq.png">

This score is useful as it will tell us the frequency of a particular term in the document, with respect to the number of frequencey across the entire corpus.

Note* a logrithmic function is applied to the Idf calculation. This is purely for weighing up/down the result as the size of the dataset (and thus the frequency of the term) grows. This is not as noticable in our example as it only contains two documents/sentences, as opposed to a corpus of millions of lines/documents.

To calculate the TfIdf score, we will use the library scikitlearn, which readly provides us a function that will give us the vectorized tf-idf score per document. We simply need to provide the corpus and an optional parameter of stoplist:

```python
from sklearn.feature_extraction.text import TfidfVectorizer
#
# TF-IDF Vectorizing using scikitlearn
#

tfidf_vectorizer = TfidfVectorizer(stop_words=stoplist, use_idf=True)
V = tfidf_vectorizer.fit_transform(documents)
```
#### Vectorized tf-idf score
The output will be a vector V of the tf-idf scores for each individual document. Sample output of the first 7 terms of the first document:

```
  (0, 532)	0.0880608061719492
  (0, 535)	0.09853796408314193
  (0, 614)	0.11644878383359814
  (0, 869)	0.11644878383359814
  (0, 1060)	0.10901512199433466
  (0, 1293)	0.23289756766719627
  (0, 623)	0.21803024398866933
  .
  .
  .
```
In the above resulting vector, the first value is a (document index in corpus, term index in dictionary) pair, while the second value is its respective tf-idf feature score.

Printing out the shape of V, we can see that it's a 99x1430 sparse matrix, where the 99 rows are our the number of documents in our corpus, with 1430 unique terms that we are storing in our dictionary for our corpus. (99 row cause one of the row was an invalid entry that I during the CSV parsing step.)

```
<99x1430 sparse matrix of type '<type 'numpy.float64'>'
```

#### Matching them to actual term

By looking up the term index and mapping them to our result using the following lines of code:

```python
document_number=0
feature_names = tfidf_vectorizer.get_feature_names()
feature_index = V[document_number,:].nonzero()[1]
tfidf_scores = zip(feature_index, [V[document_number, x] for x in feature_index])
for word, score in [(feature_names[index], score) for (index, score) in tfidf_scores]:
  print str(word) + ' => ' + str(score)
```
We will get the output for the first 7 terms as follow:

```
got => 0.0880608061719492
gps => 0.09853796408314193
husband => 0.11644878383359814
otr => 0.11644878383359814
road => 0.10901512199433466
trucker => 0.23289756766719627
impressed => 0.21803024398866933
.
.
.
```

## Step 7: Finding cosine similarities

Now that we have a the tf-idf score of the document in a vectorized space, we can calculate the angle between any two document to determine how close they are to each other. Recall that cos(theta) ranges from -1 to 1, and that cos(0) = 1. Thus when we compare the cosine of the difference in their angles, the closer the result is to 1, the closer in similarities their tf-idf score. To get the pairwise cosine similarity of a document to all the other documents in our corpus, scikit-learn readily provides for us the cosine_similarity function, into which we will pass two parameters. The first being the vectorized tf-idf score of the document that we would like to compare, and the second being the entire collection of tf-idf score for all the documents in the corpus for us to compare with:

```python
cs_results = cosine_similarity(V[0:3], V)
```

<img src="files/cosine_pic.png">


In [5]:
import csv
import gensim
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from pprint import pprint

f = open('./src/amazonReviewElectronicShortCSV.csv','r')

csvdata = csv.reader(f)

#Index listing
# 0 - reviewerID
# 1 - asin
# 2 - reviewerName
# 3 - helpful
# 4 - reviewText (Use this as corpus?)
# 5 - overall
# 6 - summary
# 7 - unixReviewTime
# 8 - reviewTime

#Generate corpus
documents = []
rowcount = 0
for row in csvdata:
    if rowcount > 0:
        documents.append(row[4])
    rowcount+=1
    
f.close()

#Remove stop words
stoplist = set(nltk.corpus.stopwords.words('english'))
stoplist.update(['-'])

texts = [[ word for word in document.lower().split() if word not in stoplist]
         for document in documents]


#Store dictionary as binary/txt using gensim
dictionary = gensim.corpora.Dictionary(texts)
#dictionary.save('./dict/amazon_electronic_review.dict')
dictionary.save_as_text('./src/dict/amazon_slectronic_review_text.txt')


#
# TF-IDF Vectorizing using scikitlearn
#

tfidf_vectorizer = TfidfVectorizer(stop_words=stoplist, use_idf=True)
V = tfidf_vectorizer.fit_transform(documents)

#print V[0:3]

#
# Mapping feature score to actual words in doc
#
document_number=0
feature_names = tfidf_vectorizer.get_feature_names()
feature_index = V[document_number,:].nonzero()[1]
tfidf_scores = zip(feature_index, [V[document_number, x] for x in feature_index])
#for word, score in [(feature_names[index], score) for (index, score) in tfidf_scores]:
#  print str(word) + ' => ' + str(score)

#
# Calculating the pairwise cosine similarity for each document in corpus
#

cs_results = cosine_similarity(V[0:3], V)
doc_num = 1
for i_result in cs_results:
    print ('Document#: ',doc_num,'\n')
    print (i_result,'\n\n')
    doc_num += 1


print ('\n\n==END==\n\n')


Document#:  1 

[ 1.          0.06448182  0.17047312  0.10509248  0.1618741   0.          0.
  0.02447777  0.03198922  0.06696618  0.          0.          0.01635356
  0.          0.04392718  0.00866927  0.03240968  0.00879309  0.
  0.04998871  0.05426122  0.01145236  0.08385288  0.06949689  0.02494436
  0.04321394  0.01178378  0.10896441  0.          0.05077     0.04628443
  0.05622397  0.0429728   0.02878872  0.01505179  0.04503667  0.
  0.00559552  0.04642412  0.00637493  0.03398488  0.03087508  0.
  0.05252068  0.          0.04118057  0.13082372  0.07467776  0.05148969
  0.04338703  0.03146021  0.04940529  0.02292995  0.05466849  0.01069828
  0.03212453  0.04472577  0.03248506  0.08553808  0.01743837  0.03154781
  0.02796502  0.04210158  0.03061191  0.12345703  0.03368483  0.02726801
  0.0094039   0.10593499  0.03497451  0.05347856  0.03176913  0.04370887
  0.10662     0.03737303  0.01101159  0.07845182  0.00998599  0.02682439
  0.01561005  0.          0.03599854  0.01830357  0.   