# Table of Contents
- [Introduction](#introduction)
- [Dataset Description and Preprocessing](#dataset)

<br/>

## 1 Introduction <a name="introduction"></a>

In this assignment, we demonstrate the usage of machine learning for clustering and classification of text data.
Specifically we make use of the K-means algorithm in order to cluster user reviews on certain Amazon products.
The inferred clusters are used to label the dataset and train an SVN model to classify new samples.
We present the performance of the model on the training and test datasets and deploy the model as a microservice to be used in production.

<br/>

## 2 Dataset Description and Preprocessing <a name="dataset"><a/>

### 2.1 The dataset
The dataset is composed of various user reviews on products of the following categories:

* Books
* Clothing, shoes and jewelry
* Movies and TV
* Musical Instruments
* Software

Some samples taken from the movies dataset:

In [32]:
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)

from main import import_data
import_data('datasets/Movies_and_TV_5.json', size=100, file_type='json', field='reviewText').sample(4)

Unnamed: 0,input
86,Liked
99,"This is your classic Christmas Carol with Henry Winkler playing Scrooge. He was actually fairly young when he played this part. They did a great job making him look like an old man. Of course, Henry did a great job with the acting portion. The ending in this take was a little unbelievable as Scroodge was very old by that time. Still a very uplifting film with a good lesson for life."
41,"Great product, fast shipping"
81,This is a classic scrooge movie but set in the 30's and I love Henry Winkler!


The full dataset can be found here: http://jmcauley.ucsd.edu/data/amazon/

In order to construct the training dataset, we sample and mix 500 items from each category.

In [5]:
%load_ext autoreload
%autoreload 2

from main import import_data
import random

random.seed(2)

size_per_class = 1000
data = import_data('datasets/Clothing_Shoes_and_Jewelry_5.json', size=10000, file_type='json', field='reviewText').sample(size_per_class).reset_index()
data = data.append(import_data('datasets/Books_5.json', size=10000, file_type='json', field='reviewText').sample(size_per_class).reset_index())
data = data.append(import_data('datasets/Movies_and_TV_5.json', size=10000, file_type='json', field='reviewText').sample(size_per_class).reset_index())
data = data.append(import_data('datasets/Software_5.json', size=10000, file_type='json', field='reviewText').sample(size_per_class).reset_index())
data = data.append(import_data('datasets/Musical_Instruments_5.json', size=10000, file_type='json', field='reviewText').sample(size_per_class).reset_index())
data = data.reset_index()

<br/>


### 2.2 Feature extraction
In order to represent the data in numeric form and in a meaningful way, the TFIDF algorithm is used.


#### 2.2.1 Term Frequency – Inverse Document Frequency (TFIDF)
The TFIDF algorithm is used to transform a document into a numeric vector. Each unique word that appears in the
collection of documents (corpus), corresponds to a dimension of the vector. The algorithm combines two quantities:

<br/>

**Term Frequency**

The number of times a word $i$ appears in the document $d$ divided by the total number of words in that document.

<br/>

$TF(i,d) = \frac{f_{i,d}}{\sum_{i'\epsilon d} {f_{i',d}}}$


<br/>

**Inverse Document Frequency**

Common words that appear in most documents, may not provide useful information. In our use case, the words “good”,
“great” and “like” appear frequently in all kinds of user reviews, therefore do not add any value to the clustering process.
 Therefore the IDF score of a term is calculated, based on the number of documents containing this term versus the total
 number of documents:

 <br/>

$IDF(i,d) = \log\frac{|D|}{|\left \{ d\epsilon D: i\epsilon d \right \}|}$

<br/>

,where $|D|$ is the total count of documents, and $|\left \{ d\epsilon D: i\epsilon d \right \}|$ the count of documents containing the term i.
The TFIDF score is calculated by multiplying the above quantities:

<br/>

$ TFIDF(i,d) = TF(i,d) \cdot IDF(i,d)$

<br/>

**Example**

Consider the following documents:

* d0: “Good quality hat.“
* d1: “Fits great, looks good! “
* d2: “Good hat.“

The TDIF matrix is calculated as:

In [14]:
import pandas as pd
from main import TextPreprocessor

example = pd.Series(['Good quality hat.', 'Fits great, looks good!', 'Good hat.'])
tfidf_vector, tfidf_matrix, dense_tfidf_matrix = TextPreprocessor().generate_tfidf(example)
dense_tfidf_matrix

Unnamed: 0,fit,good,great,hat,look,quality
0,0.0,0.425441,0.0,0.547832,0.0,0.720333
1,0.546454,0.322745,0.546454,0.0,0.546454,0.0
2,0.0,0.613356,0.0,0.789807,0.0,0.0


#### 2.2.2 Puncutation and stop words removal
In order to reduce the dataset size and noise, we remove punctuations and common words like “this”, “and”, “a”, etc…

Spacy.io is a fully featured library which can be used to perform such tasks.

In [15]:
import spacy

#### 2.2.3 Stemming/Lemmatization
Words inside a document need to be transformed to their root form in order for the TFIDF to perform. For example the words “driving” and “driver “ should point to the same term “drive”. Stemming is the process of performing such conversion.

Lemmatization is similar to stemming, while also taking into account meaning and the context in which the term is used. For example the words “people” and “person”, should point to the same lemma.

<br/>

### 2.3 Implementation

For preprocessing of the data we created a wrapper class that uses scikit-learn and spacy.
The class provides a method that takes the raw text data as input, performs lemmatization and stop words removal
and outputs the tfidf transformation.

In [16]:
from main import TextPreprocessor



## 3 Clustering with Kmeans

In this chapter we describe the Kmeans algorithm that was used for clustering the documents and discuss the results.

### 3.1. The Kmeans algorithm
Kmeans is an unsupervised machine learning algorithm for clustering data into a predefined number (K) of distinct groups (clusters). Each data point can belong only to a single cluster. The data is clustered in such a way that the sum of squared distances between a cluster’s centroid and all the points belonging to the cluster is minimized. 
The I/O of the algorithm is listed below:

Input: 
* Number of clusters
* Dataset

Output:
* Cluster centroids
* Clustered data points

The algorithm runs through an iterative process until a stop condition has occurred:

<br/>

<img src="kmeans.png" alt="kmeans" width="600"/>

### 3.2. Choosing optimal parameters


In [30]:
from main import *

size_per_class = 200
data = import_data('datasets/Clothing_Shoes_and_Jewelry_5.json', size=10000, file_type='json',
                   field='reviewText').sample(size_per_class).reset_index()
data = data.append(import_data('datasets/Books_5.json', size=10000, file_type='json', field='reviewText').sample(
    size_per_class).reset_index())
data = data.append(
    import_data('datasets/Movies_and_TV_5.json', size=10000, file_type='json', field='reviewText').sample(
        size_per_class).reset_index())
data = data.append(import_data('datasets/Software_5.json', size=10000, file_type='json', field='reviewText').sample(
    size_per_class).reset_index())
data = data.append(
    import_data('datasets/Musical_Instruments_5.json', size=10000, file_type='json', field='reviewText').sample(
        size_per_class).reset_index())
data = data.reset_index()
print("Loaded data")

tpr = TextPreprocessor(drop_common_words=False)
tfidf_vector, tfidf_matrix, dense_tfidf_matrix = tpr.generate_tfidf(data.input, debug=True)

Loaded data
Created tfidf_vectorizer
Created tfidf matrix
Created tfidf dense matrix


In [31]:
%load_ext autoreload
%autoreload 2
from main import TextClustering

tcl = TextClustering(data, tfidf_matrix, tfidf_vector, 8)
tcl.find_optimal_clusters(20, debug=False)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 3.3 Exploring the data

In [34]:
from IPython.display import display

tcl.fit_kmeans(8)

for i in range(8):
    idx = tcl.clustered_data.pred == i
    print("Cluster {}".format(i))
    display(tcl.clustered_data.loc[idx, ['input', 'top_terms']][0:5])




Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.



Cluster 0


Unnamed: 0,input,top_terms
1,Love them,"[love, book, shoe, movie, granddaughter, daughter]"
11,I love them!,"[love, book, shoe, movie, granddaughter, daughter]"
15,"I love converse! I still don't really get their sizing bc every pair I've had seems to be different, but this pair is 6.5/8.5, I am a women's 8.5-9 and they fit perfectly! I love love love them","[love, book, shoe, movie, granddaughter, daughter]"
25,LOVE EM,"[love, book, shoe, movie, granddaughter, daughter]"
27,Love them !!!,"[love, book, shoe, movie, granddaughter, daughter]"


Cluster 1


Unnamed: 0,input,top_terms
2,Great! !!,"[great, product, fit, sound, string, work]"
9,Great product! Fast shipping,"[great, product, fit, sound, string, work]"
16,Great ones,"[great, product, fit, sound, string, work]"
30,Great pair of boots and my son loves them!,"[great, product, fit, sound, string, work]"
34,"Good, inexpensive Jessie costume, worked great for Halloween.","[great, product, fit, sound, string, work]"


Cluster 2


Unnamed: 0,input,top_terms
70,Good price..Shipped as promised,"[good, movie, capo, cable, classic, love]"
100,Good fit. my son is very happy with these.,"[good, movie, capo, cable, classic, love]"
141,I love this mask it was the best it was so realistic,"[good, movie, capo, cable, classic, love]"
163,good shoes,"[good, movie, capo, cable, classic, love]"
224,Good human interest story.,"[good, movie, capo, cable, classic, love]"


Cluster 3


Unnamed: 0,input,top_terms
105,"I want to buy 10 of these. They do what a watch should do; tell time. They are not smart, but have convenient features and its a very small design. I hope Casio never stops the manufacturing of this watch.","[movie, film, watch, good, time, great]"
153,Enjoy them!,"[movie, film, watch, good, time, great]"
249,Hillerman has a poetic way of describing the setting for his stories. The stories include\ndetailed information about the traditions of the Navajos...their beliefs and how they have\nAdapted to modern day living. I enjoy his stories because I knew nothing about the Native\nAmericans. The story plots are always unique and the endings unexpected. He is a good\nStoryteller.,"[movie, film, watch, good, time, great]"
271,Great yarn- I read it after my 16 yr old daughter who is steampunk crazy! Would make a good movie!,"[movie, film, watch, good, time, great]"
292,"Though I am not one that really believes in past lives, I am aware that those types of things make for a good story. Its also a good way for a story to use both history and present day to intertwine a great story about a historic figure.\n\nLady of Hay is a dramatic story of Joanna who is researching hypnotic regression (which I am not sure I believe in either) and as she researches this she decided to let them put her under hypnosis. While under hypnosis Jo becomes Matilda. Matilda is a young lady from the 12th century and her life a mixture of emotions.\n\nMatilda's husband William is not really nice, King John feels he can do what he wants with her, and Richard de Clare is the only man who really loves and understand her. Joanna has a life not to far away from Matilda's as she has a group of men who think they know what is best for her but she is not married to any of them and Jo is pretty head strong, but with each man thinking they know whats best for her she finds her self confused and overwhelmed and sometimes she has a hard time distinguishing between herself and Matilda at times.\n\nIt was a interesting story about two different centuries, but two women who where a lot alike, although I don't really believe in reincarnation, past life stuff it was still a interesting story and if you like historical/modern storytelling then I think you would like this book.","[movie, film, watch, good, time, great]"


Cluster 4


Unnamed: 0,input,top_terms
200,"First off, I accidentally read two sentences in the front sleeve, and then I read the excerpt before page 1, and I immediately knew what the problem was, and how the problem would be solved. Unfortunately, Crichton decided to spend half the book building a mystery that is answered on the front sleeve of the book. The mystery falls flat, as the main character thinks his wife is cheating. Of course, that is not true, but it takes Crichton a heck of lot of time to get to the point. His ""insights"" into Silicon Valley are pretty much an extrapolation of stereotypes.\nAbout 140 pages into it, he finally starts the darn book as the main character gets to Nevada. It's a finally action packed, fluid, and fun, despite the fact that it feels like a blatant [take] of the movie Tremors. But hey, it was a good movie. Just replace worms with nano-particles, and you've got the same thing.\nThen, in the ending, it gets truly surreal with the typical hack ending. The last twenty pages are pretty much a yawner.\nHis classic fluidity, and the 100 pages of action are what makes this book somewhat enjoyable.\n\nWithout a doubt, one of his worst books. Timeline and now this. It seems like Crichton's career is just about winding down. Maybe he and Tom Clancy can move in with other Al Gore, MC Hammer, Michael Jordan, and other has beens.","[book, read, child, narnia, story, great]"
202,"Sadly, the ""Chronicles of Narnia"" boxed set that I treasured as a child is out of print, so I decided to replace it with this collectible edition that contains all seven books in the series. I can't find any fault with the books themselves: These stories about four children who travel repeatedly to the world of Narnia are treasured classics. In this edition, the books are presented in the order that author C.S. Lewis preferred, which sucks. The correct order that I recommend is: 1.) ""The Lion, The Witch, and the Wardrobe,"" 2.) ""Prince Caspian"", 3.) ""The Voyage of the Dawn Treader,"" 4.) ""The Silver Chair,"" 5.) ""The Horse and His Boy,"" 6.) ""The Magician's Nephew,"" and 7.) ""The Last Battle."" If you are new to Narnia, I highly suggest that you read the books in the order I listed here. If you read the books in the sequence they are arranged in the book, you will be very disappointed because everything will happen out of order. Also, I was a little disappointed with the quality of the book itself...the binding cracked almost as soon as I opened the book. Still, for $14.95, this edition is a steal, and the Narnia books are just as enjoyable to me now as they were when I was a young girl.","[book, read, child, narnia, story, great]"
203,Great book for the grand kids. I recommended it!,"[book, read, child, narnia, story, great]"
204,"Absolutely the best juvenile novel I've read in years! While AIRBORN is reminiscent of many classic adventure yarns, the characters are nicely fleshed out and much more three-dimensional than those of its predecessors.\n\nWhile this book at first appears to be a young man's ""coming of age"" story, the young woman who is his supporting character is strong, funny, and independent -- an excellent foil who easily stands on her own.\n\nEven though AIRBORN is obviously an ""alternate history"" fantasy, the characters could easily be real people. An excellent read, especially for adults who need a break from real life for just a bit. ;-D\n\nI'll be on the lookout for more of Kenneth Oppel's well-written, substantial (but fun!) books.","[book, read, child, narnia, story, great]"
206,"Lewis crafts a masterpiece with so much truth interwoven, as to bring joy to the heart. He captures the realities of struggle, evil, joy and joy. Read, and re-read often.","[book, read, child, narnia, story, great]"


Cluster 5


Unnamed: 0,input,top_terms
562,"Bought this for my kids as once again I go to this original version being much better then the ""New and Improved"" version!","[use, program, version, software, work, product]"
600,"Recently decided to upgrade my Toshiba P75-A7200 (24GB RAM) laptop with a SSD. I ran into problems with the Samsung 500GB SSB and the OEM Windows 8 didn't take after cloning the original HDD. I talked with my neighbor who is a certified IT guy and he said to use the OEM recovery disk...which I didn't have or make. So, I decided to buy Windows 8.1 Full Version. I was going to use the HDD as a backup hard drive and SSD as my main drive. Well that didn't work either! Hah! I ended up with just the SSD as my main drive. I installed Windows 8.1 without any issues...well, that is what I thought. The first install did not take at all. My graphics were all off, wireless adapter had to be re-installed, and there is no Office free trial deal. Every page while surfing was blotchy and gaming was almost impossible. So, I consulted my IT guys and said to reload with the disk. Now, seven days into this project...I figured what the heck. Well, the good news is that Windows 8.1 found some missing registries and other files and fixed them. All the updates took...which took a long time. My wireless adapter had to be reloaded...which actually went easier the second time around. As for Office...well, I chunked down some more $$$ and bought Office 365. All in all this upgrade set me back almost $300.\n\nLIKES:\n- Laptop is beyond stupid fast now (with SSD, 24GB RAM, and Windows 8.1, Office 365)\n- Start up from a complete cold shut down takes 5 seconds...stupid fast!\n- 8.1 takes some getting used to...but now after using it...it is okay.\n- Display/graphics processor/videos/gaming are all sharp, accurate, and fun again\n- Starting to use APPs and cloud...takes some patience, but it isn't that bad\n- ""Start"" button is back and can be used similar to Windows 7\n- I still have a 64-bit Windows 7 on my desktop machine.\n- It is slightly better than Windows 8\n\nNOT SO GOOD\n- Upgrading to Windows 8.1 took two loads, it didn't take the first time\n- Obviously an attempt by Microsoft to fix Windows 8's issues\n- I hope it doesn't crash...shouldn't have that burden on my mind\n- I have my old HDD and slot for it...afraid to install it now.\n\nBOTTOM LINE\n- I think if you really want to maximize Windows 8.1, get an SSD and max out your RAM. Back up your files before loading and ensure you sequence your BIOS to boot from the CD drive. Now that the dust has settled after one week of piddling with the laptop, it is beyond fast, looks great, and gaming is lightning fast (had to buy a laptop cooler because the processor is screaming heat). Office 365 is very stable and I like it...however, it doesn't come with Office Picture Viewer. You can go to Microsoft's webpage and download Sharepoint 2010 for free and have it back...very easy to do. All good now...whew!","[use, program, version, software, work, product]"
602,"I have been using TurboTax for years. The transfer of data from prior years is fast and easy. I use it when it first comes out each year to do some quick tax calculations to plan my year end transactions. For planning purposes where one does not have all the various forms from financial institutions, one must force the program to ignore certain missing information, or go back to the paper copies of the various forms received from those institutions in prior years. These problems, of course, go away when one is doing the actual tax return, since those forms for the current year would be at hand. This is only a problem when doing early estimates without the various forms.\n\nI like that the program is constantly updated as one approaches the actual filing time, so that everything is current and up to date at the time of filing.\n\nA good solid product that gets better each year.\n\nBe sure to buy the level of Turbo Tax that you need for your own particular situation - since moving up their product line can be expensive.","[use, program, version, software, work, product]"
603,"Norton is THE BEST when it comes to internet security. This particular product gives you 3 licenses, which you know you're eventually going to want another license if you buy computers often, so I definitely recommend going ahead and getting the 3 pack like this. Norton has prevented several viruses from infecting my computer. When I've tried competitors they've let things through. I also love the Norton 360 for the password management system, which keeps track of all your passwords even across computers. It really is a game-changing product. I would not own a computer without Norton installed on it.","[use, program, version, software, work, product]"
604,"Easy to install, easy to use with decent parental controls, Trend Micro has outdid themselves again. It works well on my netbook as well as one of my other systems and doesn't get in the way the way McAfee and other security suites do. It's great because you don't have to buy extra copies as it allows you to install it on three different computers. I highly recommend it.","[use, program, version, software, work, product]"


Cluster 6


Unnamed: 0,input,top_terms
0,Fits perfectly,"[nice, work, like, perfect, guitar, use]"
3,in transit,"[nice, work, like, perfect, guitar, use]"
4,Can never go wrong with some Chucks!,"[nice, work, like, perfect, guitar, use]"
6,Decent sneeks,"[nice, work, like, perfect, guitar, use]"
12,"I'm pretty sure they aren't REAL but they are high quality. They say ""Made in Vietnam"" and I'm pretty sure they're usually manufactured in China. Great buy.","[nice, work, like, perfect, guitar, use]"


Cluster 7


Unnamed: 0,input,top_terms
5,"I wear converse all the time, these ones are just the way they are supposed to be!","[size, shoe, wear, converse, order, fit]"
7,This sneakers run approximately 2 sizes larger. They are way too large,"[size, shoe, wear, converse, order, fit]"
8,I really like my new white classic shoes! They are very nice. I can wear them with whatever I want!,"[size, shoe, wear, converse, order, fit]"
10,"Chucks are awesome, especially the high tops. Chucks as a whole run about a half size large so order about a half size smaller than you would normally buy in other shoes. These look awesome with shorts or if you're a hipster, cut off jeans","[size, shoe, wear, converse, order, fit]"
13,"I like the look of the shoes for sure, but they are too big. I usually wear a 12, sometimes a 13 as in the pic of my DC's, but if I slide my foot all the way forward into these size 12's I have two finger widths I can put between the heel of the shoe and my heel. How are a size 13 DC shoe smaller than a size 12 Converse?? Gonna be sending these guys back.","[size, shoe, wear, converse, order, fit]"
