In [358]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from collections import Counter
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.cluster import MeanShift, estimate_bandwidth
from textblob import TextBlob
%matplotlib inline

from sklearn import ensemble
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

For this capstone I will be working with Amazon reviews. There are millions and millions of them so I'm limiting my dataset to the top 10 reviewers for pet supplies. Top reviewers tend to give more comprehensive reviews, so they should be of adequate length to train and test models to cluster them into their respective authors. 

In [332]:
# The file is JSON from the 5-core hyperlink at http://jmcauley.ucsd.edu/data/amazon/

df = pd.read_json('Pet_Supplies_5.json', lines=True)

In [333]:
df.head(20)

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,1223000893,"[0, 0]",3,"I purchased the Trilogy with hoping my two cats, age 3 and 5 would be interested. The 3 yr old cat was fascinated for about 15 minutes but when the same pictures came on, she got bored. The 5 year old watched for about a few minutes but then walked away. It is possible that because we have a wonderful courtyard full of greenery and trees and one of my neighbors has a bird feeder, that there is enough going on outside that they prefer real life versus a taped version. I will more than likely pass this on to a friend who has cats that don't have as much wildlife to watch as mine do.","01 12, 2011",A14CK12J7C7JRK,Consumer in NorCal,Nice Distraction for my cats for about 15 minutes,1294790400
1,1223000893,"[0, 0]",5,There are usually one or more of my cats watching TV and staying out of trouble when this DVD is playing. They seem to like the mice and birds the most and maybe go a little less stir crazy being inside all the time.,"09 14, 2013",A39QHP5WLON5HV,Melodee Placial,Entertaining for my cats,1379116800
2,1223000893,"[0, 0]",4,"I bought the triliogy and have tested out all the DVDs. It appears that volume 2 is the most well received of the three and the one I would recommend. It's funny to watch my cat watch it bc she looks behind the TV trying to find the birds. I turn this on sometimes when I'm leaving the house, by the time I get home, she doesn't seem to be paying attention anymore but figured she'd at least enjoy the sounds.","12 19, 2012",A2CR37UY3VR7BN,Michelle Ashbery,Entertaining,1355875200
3,1223000893,"[2, 2]",4,"My female kitty could care less about these videos-but she cares less about almost everything. My little male however digs these. He doesn't go ape over them, but he really does watch them for a bit and it makes me feel better to throw them on when I have to go out to work and leave him.","05 12, 2011",A2A4COGL9VW2HY,Michelle P,Happy to have them,1305158400
4,1223000893,"[6, 7]",3,"If I had gotten just volume two, I would have given it five stars, but since I got the trilogy, I can only give it three stars. I read all the reviews, and knew that vol. two was the best, hands down, but for few extra dollars I decided to get all three in a combo pack. Since birds are a natural source of food for cats (feral) it was natural that they were instantly attracted to vol. two. Contrary to all the cartoons, cats are not fishermen, and thus, fish, either in a bowl or in the wild are not something they are naturally attracted to. Since gerbils and guinea pigs are not native where I live, they could have been little dogs as far as my cats could tell. Rodents are also a natural food, so volume one was much better than volume three. Also, the quality could have been better (for my eyes only) but my cats could care less about video quality... they just see birds.. LOL","03 5, 2012",A2UBQA85NIGLHA,"Tim Isenhour ""Timbo""",You really only need vol 2,1330905600
5,4847676011,"[10, 10]",5,"My Rottie has food allergies to poultry, beef and dairy. I've had a difficult time finding a toothpaste that doesn't make him allergic and he enjoys the taste. This toothpaste is peanut flavor (smells like black licorice). He loves the taste and doesn't wiggle as much when I brush his teeth every night. The price is ok, but I do wish that the tube came in a larger size. Soooo, if your pup has allergies or doesn't like his/her current toothpaste you might want to try this one.","07 13, 2007",A2V3UP9NPMHVKJ,"Alex Thomas ""Tommy""",Great for Pups with Food Allergies,1184284800
6,4847676011,"[2, 2]",5,My puppy loves this stuff! His tail starts wagging as soon as I ask him if he's ready to brush his teeth! It is actually an enjoyable daily experience! Definitely my &#34;Go To&#34; dog toothpaste.,"11 16, 2013",A2R4JCEFLTFU8F,"goldilox ""goldi""",Naturally Yummy!,1384560000
7,4847676011,"[2, 2]",4,My toy poodle loves this stuff and will let me &#34;sort&#34; of brush her teeth because of it. I was hoping it would help with her doggy breath and it does some. Interestingly... it says &#34;peanutbutter&#34; but it doesn't smell like peanutbutter.,"03 9, 2014",A14B4MJ7KZE63B,K Young,bought to help with dog breath,1394323200
8,4847676011,"[0, 0]",5,Works great and dog doesn't hate the taste. Gum health is important so just have to brush those pearly whites.,"11 19, 2013",A2JDB26Y78TT8Z,"L. Miller ""LMinGA""",Brushing those teeth isn't so hard with this dog toothpaste,1384819200
9,4847676011,"[1, 1]",5,"Yes , my Princess is enjoying the taste showing that She is getting best/top results .. She loves that and me too ... Strongly recommended , without regret ...","05 2, 2013",A8GY0E0SS8VMD,Wolves,Good product ...,1367452800


In [334]:
# Remove the text display restrictions so I can get a sense of the size of the reviews

pd.set_option('display.max_colwidth', -1)

In [335]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157836 entries, 0 to 157835
Data columns (total 9 columns):
asin              157836 non-null object
helpful           157836 non-null object
overall           157836 non-null int64
reviewText        157836 non-null object
reviewTime        157836 non-null object
reviewerID        157836 non-null object
reviewerName      156493 non-null object
summary           157836 non-null object
unixReviewTime    157836 non-null int64
dtypes: int64(2), object(7)
memory usage: 10.8+ MB


How many reviews are completed by the most prolific reviewers?

In [336]:
df['reviewerName'].value_counts()

Amazon Customer                              2045
Jessica                                      175 
Peter Suslock                                147 
Lisa                                         139 
Sarah                                        135 
Jennifer                                     133 
Chris                                        128 
Ashley                                       110 
Emily                                        104 
Nancy                                        100 
Dee                                          99  
Linda                                        97  
Jen                                          96  
Susan                                        96  
Nicole                                       94  
Maggie                                       93  
Melissa                                      91  
Spudman                                      90  
Michael                                      89  
JJ                                           88  


Something about 'Amazon Customer' just doesn't seem legit to me. Plus the fact that they have >10x the next most frequent reviewer is problematic. However, the next 10 have a pretty similar quantity of reviews. I'll make that my dataset. 

In [337]:
df_top_10 = df.loc[df['reviewerName'].isin(
    ['Jessica', 'Peter Suslock', 'Lisa', 'Sarah', 'Jennifer', 'Chris', 'Ashley', 'Emily', 'Nancy', 'Dee'])]

In [338]:
df_top_10.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
117,B00005MF9W,"[0, 0]",4,"I love this litter box. I do not use the lids, and keep using the same receptacle until it tears or cracks. (Usually 3-4 months). I just dump it out a couple times a week. Makes these things last forever","06 26, 2014",A2H83XMHUVDLJY,Ashley,Waste Receptacles,1403740800
663,B00006IX59,"[1, 1]",5,"I've had a chuckit for 15 years, yes it has lasted that long. I got a new one because of the handle looking more comfortable. The handle is a bit shorter then the original one I had and I love it. My arm isn't sore after useing it with my 2 big dogs and I have much better control.","05 4, 2013",A1K9BV3RXPRV3V,Lisa,fantastic,1367625600
765,B00006JHRE,"[0, 0]",1,Not worth the money. The individual dishes are very small. Maybe if you have a cat this would work well but defiantly not for dogs.,"02 7, 2013",A3809LKNGB175R,Ashley,Not a good buy.,1360195200
950,B000084E6V,"[0, 0]",3,"My enthusiastic chewer has barely put a dent in his dinosaur after a month of chewing. Unfortunately, my vet doesn't recommend this type of toy as it may break his teeth. So, I won't buy another one. Luckily, it looks like this one will last forever.","03 5, 2013",A2P8VF13PRUPV6,Dee,Extremely durable,1362441600
1075,B000084E6V,"[0, 0]",5,"This is a great product for heavy chewers. My dogs love these and they keep their teeth clean. They last a long time, but you'll want to be sure to discard them when they are so warn that your dog can tear little pieces off.","02 21, 2013",A1XPUPKHHPO7ZR,Lisa,Great Product,1361404800


In [339]:
df_top_10.shape

(1270, 9)

In [340]:
# Reserving 25% of the dataset for testing
df_train = df_top_10[:953]
df_test = df_top_10[954:1270]

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

What's interesting about this part of the assignment is that I already know the classification of the data from the DataFrame. Each row clearly states which reviewer is the author of each review. However, I assume that I am meant to abandon this knowledge and cluster the data as if I had no foreknowledge about the information. 

First, it's probably for the best to clean the data. The most obvious thing I need to do is remove all of the unnecessary columns. All that matters are the review and the reviewer name.

## Text Cleaning

In [341]:
df_train.columns

Index(['asin', 'helpful', 'overall', 'reviewText', 'reviewTime', 'reviewerID',
       'reviewerName', 'summary', 'unixReviewTime'],
      dtype='object')

In [342]:
asin = pd.DataFrame(df_train['asin'])

In [254]:
df_train = df_train.drop(['asin', 'helpful', 'overall', 'reviewTime', 'reviewerID', 'summary', 'unixReviewTime'], 1)

In [255]:
df_test = df_test.drop(['asin', 'helpful', 'overall', 'reviewTime', 'reviewerID', 'summary', 'unixReviewTime'], 1)

In [256]:
df_test.head()

Unnamed: 0,reviewText,reviewerName
113447,"Not only does this doggy wash leave our dog smelling delicious but it also leaves his super curly coat soft and manageable. Our last order we only received one of the two bottles, but it's not clear whether that is a seller or Amazon packaging issue and it was handled quickly via Amazon support. Regardless of the receipt issue, this shampoo is awesome. I'd buy it for the super fresh scent alone!",Sarah
113455,Cloud Star products are really great for your dog. Smells wonderful and leaves his coat soft and fresh for at least a week.,Chris
113484,"Love this hoodie! So soft, thick, and great attention to detail like the soft fuzzy fabric which has been added to the inside along the velcro, so it won't rub. However, do NOT go by the measurements in the description or the one review with the various sizes. Based on both of those, I ordered the XXS for my 4 pound 2 ounce foster Chihuahua. The hoodie actually measures 6.5"" inches long from the neckline to the waist band, not including the hood. That is quite a bit shorter than 8 inches. The waist band is super tiny! It measures 7 inches in diameter, that's the same as my wrist. I will upload two photos with a tape measure. Can't exchange it since its not sold by, only fulfilled by, Amazon. Will be returning and reordering an XS, hopefully the waist will be big enough without the chest getting to big, that was tight but almost fit.",Jennifer
113945,"I have a dog who barks a lot and we have been training him not to bark as much by using the ""Quiet"" command and giving him a treat every time he is quiet (first we conditioned him before he barked by saying ""quiet and giving him a treat every time we said that word - now when we say that word he immediately stops whatever he is doing and looks for a treat).Treats we have used in the past were good but because they were dry, they weren't as stinky and didn't get his attention during one of his barking rants. These treats are soo stinky and my dog loves them!I love them because they are easy to break up into 10+ pieces for training since they are soft. Also, compared to most other snacks they are healthy with duck meat being the first ingredient. Although I wouldn't consider these treats to be the healthiest you can get, they are certainly not the unhealthiest. We use the duck flavor because my dog has diarrhea when he has chicken, beef, or salmon and these work well for him.",Sarah
114042,"I ordered this as a gift to entertain my parents' two dogs. Dog #1 is a 45 pound, 9 year old mutt. It took about 5 minutes for him to figure out he could fit his entire mouth around the yellow piece and then lift it out. Now that he knows the trick, it takes him about 10 seconds. Dog #2 is a 15 pound, 1 year old Chihuahua. He gave up after about a minute and now won't even try the toy. He does like to chew on the yellow cups though, so it can't be left out.After dog #2 gave up, my 1 year old extremely food motivated cat tried it out. It took her at least 3 minutes to work the cup out, but she persevered and was rewarded with the treat. I may end up keeping it for her since the it isn't working out for the dogs.I would recommend this toy to dogs under 30-35 pounds only. Bigger dogs that can fit their mouth around it have no challenge. Also only dogs that are inquisitive and intelligent. It works out well for some cats too.",Emily


In [258]:
# Next I will remove the stop words from the reviewText by lambda

stop = stopwords.words('english')

df_test['reviewText'] = df_test['reviewText'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df_train['reviewText'] = df_train['reviewText'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [260]:
# Instantiate tokenizer and lemmatizer

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

In [261]:
def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

In [262]:
df_train['text_lemmatized'] = df_train.reviewText.apply(lemmatize_text)

In [263]:
df_test['text_lemmatized'] = df_test.reviewText.apply(lemmatize_text)

In [264]:
df_train.head()

Unnamed: 0,reviewText,reviewerName,text_lemmatized
117,"I love litter box. I use lids, keep using receptacle tears cracks. (Usually 3-4 months). I dump couple times week. Makes things last forever",Ashley,"[I, love, litter, box., I, use, lids,, keep, using, receptacle, tear, cracks., (Usually, 3-4, months)., I, dump, couple, time, week., Makes, thing, last, forever]"
663,"I've chuckit 15 years, yes lasted long. I got new one handle looking comfortable. The handle bit shorter original one I I love it. My arm sore useing 2 big dogs I much better control.",Lisa,"[I've, chuckit, 15, years,, yes, lasted, long., I, got, new, one, handle, looking, comfortable., The, handle, bit, shorter, original, one, I, I, love, it., My, arm, sore, useing, 2, big, dog, I, much, better, control.]"
765,Not worth money. The individual dishes small. Maybe cat would work well defiantly dogs.,Ashley,"[Not, worth, money., The, individual, dish, small., Maybe, cat, would, work, well, defiantly, dogs.]"
950,"My enthusiastic chewer barely put dent dinosaur month chewing. Unfortunately, vet recommend type toy may break teeth. So, I buy another one. Luckily, looks like one last forever.",Dee,"[My, enthusiastic, chewer, barely, put, dent, dinosaur, month, chewing., Unfortunately,, vet, recommend, type, toy, may, break, teeth., So,, I, buy, another, one., Luckily,, look, like, one, last, forever.]"
1075,"This great product heavy chewers. My dogs love keep teeth clean. They last long time, want sure discard warn dog tear little pieces off.",Lisa,"[This, great, product, heavy, chewers., My, dog, love, keep, teeth, clean., They, last, long, time,, want, sure, discard, warn, dog, tear, little, piece, off.]"


## Vectorizing

Clustering can only be done on numerical data. So I will have to convert my cleaned, tokenized, lemmatized column of reviews into vectors. I'll try Bag of Words and TF-IDF. Once I have that, I'll create the clusters using various methods (k-means, mean-shift, spectral affinity, and Latent Dirichlet Allocation.

In [266]:
#remove "reviewText" and pass in the raw df_train['reviewText'] column without a header

df_train_vector = df_train['reviewText']
df_test_vector = df_test['reviewText']

In [267]:
df_train_vector.head()

117     I love litter box. I use lids, keep using receptacle tears cracks. (Usually 3-4 months). I dump couple times week. Makes things last forever                                           
663     I've chuckit 15 years, yes lasted long. I got new one handle looking comfortable. The handle bit shorter original one I I love it. My arm sore useing 2 big dogs I much better control.
765     Not worth money. The individual dishes small. Maybe cat would work well defiantly dogs.                                                                                                
950     My enthusiastic chewer barely put dent dinosaur month chewing. Unfortunately, vet recommend type toy may break teeth. So, I buy another one. Luckily, looks like one last forever.     
1075    This great product heavy chewers. My dogs love keep teeth clean. They last long time, want sure discard warn dog tear little pieces off.                                               
Name: reviewText, dtype: object

In [268]:
df_train_vector.shape

(953,)

In [269]:
df_train_vector.as_matrix()

array(['I love litter box. I use lids, keep using receptacle tears cracks. (Usually 3-4 months). I dump couple times week. Makes things last forever',
       "I've chuckit 15 years, yes lasted long. I got new one handle looking comfortable. The handle bit shorter original one I I love it. My arm sore useing 2 big dogs I much better control.",
       'Not worth money. The individual dishes small. Maybe cat would work well defiantly dogs.',
       'My enthusiastic chewer barely put dent dinosaur month chewing. Unfortunately, vet recommend type toy may break teeth. So, I buy another one. Luckily, looks like one last forever.',
       'This great product heavy chewers. My dogs love keep teeth clean. They last long time, want sure discard warn dog tear little pieces off.',
       'This one chewy bones dog destroy day. She months, slowly working head tail gone body remained. She would continued chewing loved it, disappeared somewhere. A lot chew toys claim durable, I found one live claim mos

## Vectorization Methods

## Bag of Words

Rather than use the Bag of Words function from the lesson I'll instantiate the CountVectorizer from sklearn. 

In [270]:
vectorizer = CountVectorizer()

In [271]:
# The CountVectorizer needs a list

df_train_vector_list = df_train_vector.tolist()

In [272]:
bag_of_words = vectorizer.fit(df_train_vector_list)

In [273]:
bag_of_words = vectorizer.transform(df_train_vector_list)

In [274]:
print(bag_of_words)

  (0, 879)	1
  (0, 1585)	1
  (0, 1599)	1
  (0, 2108)	1
  (0, 2680)	1
  (0, 3547)	1
  (0, 3642)	1
  (0, 3736)	1
  (0, 3794)	1
  (0, 3852)	1
  (0, 3909)	1
  (0, 4177)	1
  (0, 5213)	1
  (0, 6499)	1
  (0, 6594)	1
  (0, 6665)	1
  (0, 6981)	1
  (0, 6988)	1
  (0, 6991)	1
  (0, 7199)	1
  (1, 26)	1
  (1, 479)	1
  (1, 745)	1
  (1, 751)	1
  (1, 779)	1
  :	:
  (952, 3157)	1
  (952, 3311)	1
  (952, 3495)	1
  (952, 3586)	1
  (952, 3720)	1
  (952, 3846)	1
  (952, 4175)	1
  (952, 4237)	1
  (952, 4340)	1
  (952, 4431)	1
  (952, 4439)	1
  (952, 4788)	1
  (952, 5228)	1
  (952, 5445)	2
  (952, 5535)	1
  (952, 5830)	1
  (952, 5998)	1
  (952, 6393)	1
  (952, 6796)	3
  (952, 6895)	1
  (952, 6991)	1
  (952, 7014)	1
  (952, 7018)	2
  (952, 7123)	1
  (952, 7251)	1


Take a look at this printed list of the bag of words. What this last line means is that in review 952 word 7251, or the word that 7251 is assigned to, occurs once. In that same review, words 6796 occurs three times. What is this word that occurs three times?

In [275]:
print(df_train['reviewText'][-1:])

113141    I tried one 6-month old kittens soon I got it. He usually wants play various brushes I've tried (not surprising) I got couple good brushes flopped front let brush him. I impressed. My friend recommended certainly right said got whole lot hair brush I've tried since gets right undercoat.
Name: reviewText, dtype: object


My bet is that word 6796 is "tried"

In [276]:
print(vectorizer.vocabulary_.get("tried"))

6796


And there you have it. 

In [277]:
type(bag_of_words)

scipy.sparse.csr.csr_matrix

So now that I have my bag of words, how do I create clusters for it? Well to start, I want to make an array out of all this. 

In [278]:
df_bow = pd.DataFrame(bag_of_words.toarray(), columns=vectorizer.get_feature_names()).head()

In [279]:
df_bow.head()

Unnamed: 0,00,000,00this,011,03,04,0mm,10,100,10gph,...,zipper,zippers,zips,zogoflex,zone,zoo,zoomed,zoomgroom,zuke,zymox
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Note for the reader:

Below, beneath the "TF-IDF Vectorizing" header I have compared the Bag of Words and TF-IDF for feature generation and then clustered using K-Means. Knowing the result of that I've come back to this sbag of words to cluster using Mean Shift. This is just to spread out the methods used to cluster and see how it works on this data. So keep that in mind if you get to the section below and wonder why I'm repeating code and seemingly starting over with feature generation and clustering. 

## Mean Shift Clustering

In [280]:
# I'll beed to convert my BoW into an array if I want Mean Shift to process it. 
bag_of_words_array = bag_of_words.toarray()

In [281]:
bag_of_words_array

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [282]:
# Here we set the bandwidth. This function automatically derives a bandwidth
# number based on an inspection of the distances among points in the data.
bandwidth = estimate_bandwidth(bag_of_words_array, quantile=0.2, n_samples=500)
bandwidth

9.064613238052992

In [283]:
# Declare and fit the model.
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(bag_of_words_array)

MeanShift(bandwidth=9.064613238052992, bin_seeding=True, cluster_all=True,
     min_bin_freq=1, n_jobs=1, seeds=None)

In [284]:
# Extract cluster assignments for each data point.
labels = ms.labels_

In [285]:
# Coordinates of the cluster centers.
cluster_centers = ms.cluster_centers_

In [286]:
# Count our clusters.
n_clusters_ = len(np.unique(labels))

print("Number of estimated clusters: {}".format(n_clusters_))

Number of estimated clusters: 1


Clearly a single cluster is not ideal given that I know I have multiple distinct categories for these reviews. Perhaps a K-Means model will better fit my data. 

But before that I want to try out a TF-IDF to see how it vectorizes my data versus the Bag of Words.

## TF-IDF Vectorizing

In [287]:
#create a tokenizer to make lowercase and extract all of my tokens. THis will be used in my Bag of Words

def textblob_tokenizer(str_input):
    blob = TextBlob(str_input.lower())
    tokens = blob.words
    words = [token.stem() for token in tokens]
    return words

In [288]:
#Bag of words
vec = CountVectorizer(tokenizer=textblob_tokenizer, stop_words='english')
matrix = vec.fit_transform(df_train_vector_list)
pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names()).head()

Unnamed: 0,'bottom,'d,'edibl,'essenti,'get,'it,'ll,'m,'o,'perfect,...,zipper,zogoflex,zone,zone.i,zoo,zoo-m,zoom,zoomgroom,zuke,zymox
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


By using the tokenizer function I was able to eliminate upwards of 1200 features that would have been made up of stop words and variations on lemmas. I have removed the cells that showed the higher number of features created without this processing. 

## Term Frequency

In [289]:
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      use_idf=False)
matrix = vec.fit_transform(df_train_vector_list)
df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
df.head()

Unnamed: 0,'bottom,'d,'edibl,'essenti,'get,'it,'ll,'m,'o,'perfect,...,zipper,zogoflex,zone,zone.i,zoo,zoo-m,zoom,zoomgroom,zuke,zymox
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


It seems that the TFID Vectorizer produced an array identical to the CountVectorizer. 

Which review is most about dogs?

In [290]:
df.sort_values(by='dog', ascending=False).head()

Unnamed: 0,'bottom,'d,'edibl,'essenti,'get,'it,'ll,'m,'o,'perfect,...,zipper,zogoflex,zone,zone.i,zoo,zoo-m,zoom,zoomgroom,zuke,zymox
534,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
697,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
854,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
524,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With all these features I can't see the ones I'm looking for. Let's see what reviews are most about "dog", "cat", "fish", and... "breakfast", sorted by 'dog'.

In [291]:
df[['dog', 'cat', 'fish', 'breakfast']].sort_values(by='dog', ascending=False).head()

Unnamed: 0,dog,cat,fish,breakfast
534,0.75,0.0,0.0,0.0
697,0.727607,0.0,0.0,0.0
854,0.666667,0.0,0.0,0.0
603,0.632456,0.0,0.0,0.0
524,0.625543,0.0,0.0,0.0


In [292]:
df[['dog', 'cat', 'fish', 'breakfast']].sort_values(by='cat', ascending=False).head()

Unnamed: 0,dog,cat,fish,breakfast
526,0.0,0.707107,0.0,0.0
403,0.0,0.688247,0.0,0.0
676,0.0,0.6742,0.0,0.0
272,0.0,0.632456,0.0,0.0
877,0.0,0.6,0.0,0.0


In [293]:
df[['dog', 'cat', 'fish', 'breakfast']].sort_values(by='fish', ascending=False).head()

Unnamed: 0,dog,cat,fish,breakfast
41,0.0,0.0,0.57735,0.0
67,0.043685,0.043685,0.524222,0.0
65,0.0,0.0,0.5,0.0
183,0.0,0.0,0.499484,0.0
32,0.0,0.0,0.428571,0.0


In [294]:
df[['dog', 'cat', 'fish', 'breakfast']].sort_values(by='breakfast', ascending=False).head()

Unnamed: 0,dog,cat,fish,breakfast
537,0.0,0.3849,0.0,0.19245
627,0.0,0.0,0.110432,0.0
628,0.0,0.0,0.0,0.0
629,0.0,0.0,0.0,0.0
630,0.27735,0.0,0.0,0.0


Only one instance of 'breakfast'. Well that' not a surprise. But I am surprised at the low weights given to such common terms: dog, cat, & fish. However, these are reviews of pet supplies. So I want to try this again for the terms "chew", "litter", & "treat"

### Meta Note for the reader: 

I orginally ran these with an L1 normlization. Switching back to the default L2 normalization (euclidean) gave me MUCH stronger results for these words. So keep that in mind when reading and saying "These results look good to me, what's your problem?"

In [295]:
df[['chew', 'litter', 'treat']].sort_values(by='chew', ascending=False).head()

Unnamed: 0,chew,litter,treat
806,0.609994,0.0,0.0
860,0.601929,0.0,0.0
796,0.566947,0.0,0.0
700,0.566947,0.0,0.188982
715,0.509175,0.0,0.0


In [296]:
df[['chew', 'litter', 'treat']].sort_values(by='litter', ascending=False).head()

Unnamed: 0,chew,litter,treat
289,0.0,0.800641,0.0
763,0.0,0.625543,0.0
288,0.0,0.508001,0.0
290,0.0,0.508001,0.0
28,0.0,0.481543,0.0


In [297]:
df[['chew', 'litter', 'treat']].sort_values(by='treat', ascending=False).head()

Unnamed: 0,chew,litter,treat
380,0.0,0.0,0.625543
161,0.0,0.0,0.5547
737,0.0,0.0,0.534522
739,0.0,0.0,0.534522
151,0.0,0.0,0.516398


You can't see because I've written over it, but the L1 normalization results were MUCH lower. Before I create my TF-IDF I want to run K-means on my TF DataFrame.

In [298]:
number_of_clusters=5

km = KMeans(n_clusters=number_of_clusters)

km.fit(matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [299]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: cat love litter like use
Cluster 1: food dog love like eat
Cluster 2: use work 's thi great
Cluster 3: toy dog love chew play
Cluster 4: dog love like use 's


I can see too much overlap in the top terms per cluster. I'ii increase the clusters to see if they differentiate. But I expect that using Tf without IDF is not the way to go. Still, prove, don't speculate. 

In [300]:
number_of_clusters=9

km = KMeans(n_clusters=number_of_clusters)

km.fit(matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=9, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [301]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: work thi product great 's
Cluster 1: love great like 's look
Cluster 2: use 's easi work great
Cluster 3: treat dog love like buy
Cluster 4: tank water filter use product
Cluster 5: toy dog love chew play
Cluster 6: dog like great use 's
Cluster 7: food dog love like cat
Cluster 8: cat love litter like box


While I can see that the clusters are becoming more distinct, the prevalence of certain terms as "like", "love", "great", and "dog" is a problem for me. This may be solved by instantiating my vectorizer including the IDF. 

## Inverse Document Frequency

In [302]:
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      norm='l1',
                      use_idf=True)
matrix = vec.fit_transform(df_train_vector_list)
idf_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
idf_df.head()

Unnamed: 0,'bottom,'d,'edibl,'essenti,'get,'it,'ll,'m,'o,'perfect,...,zipper,zogoflex,zone,zone.i,zoo,zoo-m,zoom,zoomgroom,zuke,zymox
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now I want to compare the two:

In [303]:
# Original 
pd.DataFrame({
    'dog': df.dog,
    'cat': df.cat,
    'dog + cat': df.dog + df.cat
}).sort_values(by='dog', ascending=False).head(8)

Unnamed: 0,cat,dog,dog + cat
534,0.0,0.75,0.75
697,0.0,0.727607,0.727607
854,0.0,0.666667,0.666667
603,0.0,0.632456,0.632456
524,0.0,0.625543,0.625543
905,0.0,0.617213,0.617213
447,0.0,0.612372,0.612372
807,0.0,0.610847,0.610847


In [304]:
# New 
pd.DataFrame({
    'dog': idf_df.dog,
    'cat': idf_df.cat,
    'dog + cat': idf_df.dog + idf_df.cat
}).sort_values(by='dog', ascending=False).head(8)

Unnamed: 0,cat,dog,dog + cat
534,0.0,0.16034,0.16034
697,0.0,0.150294,0.150294
603,0.0,0.144942,0.144942
447,0.0,0.128717,0.128717
799,0.0,0.11951,0.11951
737,0.0,0.113425,0.113425
143,0.0,0.102164,0.102164
250,0.0,0.093591,0.093591


So the Inverse Document Frequency is showing a much lower occurance of these words in the document compared to the term frequency within the reviews. These lower weights should aid in clustering. 

## K-Means Clustering

In [305]:
# Reset the matrix!
vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      use_idf=True)
matrix = vec.fit_transform(df_train_vector_list)
idf_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
idf_df.head()

Unnamed: 0,'bottom,'d,'edibl,'essenti,'get,'it,'ll,'m,'o,'perfect,...,zipper,zogoflex,zone,zone.i,zoo,zoo-m,zoom,zoomgroom,zuke,zymox
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [329]:
matrix.shape

(953, 6237)

In [306]:
number_of_clusters=2

km = KMeans(n_clusters=number_of_clusters)

km.fit(matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [307]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: dog cat use food love
Cluster 1: toy dog chew play love


I don't like how the clusters are overalapping their most common words. I'll increase the clusters.

In [308]:
number_of_clusters=5

km = KMeans(n_clusters=number_of_clusters)

km.fit(matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [309]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: food dog eat love cat
Cluster 1: tank water filter use fish
Cluster 2: cat litter box love like
Cluster 3: toy dog chew play ball
Cluster 4: dog use work love great


I like this a LOT more. Each cluster seems to have a distinct theme:

- 0 - cat litter & miscellaneous enjoyment
- 1 - dog chew/play toys
- 2 - fish tank
- 3 - dog product usage and enjoyment
- 4 - dog food

So now that I have my clusters, I'll add them back into my DataFrame with each review having an assigned cluster.

In [310]:
results = pd.DataFrame()
results['review'] = df_train_vector_list
results['cluster'] = km.labels_
results.head(10)

Unnamed: 0,review,cluster
0,"I love litter box. I use lids, keep using receptacle tears cracks. (Usually 3-4 months). I dump couple times week. Makes things last forever",4
1,"I've chuckit 15 years, yes lasted long. I got new one handle looking comfortable. The handle bit shorter original one I I love it. My arm sore useing 2 big dogs I much better control.",4
2,Not worth money. The individual dishes small. Maybe cat would work well defiantly dogs.,2
3,"My enthusiastic chewer barely put dent dinosaur month chewing. Unfortunately, vet recommend type toy may break teeth. So, I buy another one. Luckily, looks like one last forever.",3
4,"This great product heavy chewers. My dogs love keep teeth clean. They last long time, want sure discard warn dog tear little pieces off.",4
5,"This one chewy bones dog destroy day. She months, slowly working head tail gone body remained. She would continued chewing loved it, disappeared somewhere. A lot chew toys claim durable, I found one live claim most.",3
6,"Taking consideration single Perfect supplement creature - human otherwise, Nupro much proud of. Unlike many supplements Nupro manufactured United States first quality sources. Recent findings Pharmaceutical giant Pfizer.Who sourced manufacturing Pet-Tabs India, along newly published report Consumer labs levels lead approaching toxic zone discovered.Nupro based whole food concept. You well could feed dog nothing Nupro expect nutritional needs met. However specific proteins essential vitamins trace minerals would missed.I mention enforce although Nupro replacement healthy diet - enhance ever food currently feeding pooch.As dog lover never without dog three 50 plus years. I making dog food thirty.Long Internet I teaching I could canine nutrition. My self education continues - I remain highly suspect commercial dog food.Especially consider literally thousands brands sub brands - mere 14 considered safe healthy. Since founding Nupro 25 years ago never single instance recall, tainted ingredient unfortunate mishap.Using palatable desiccated (powdered) liver base Nupro adds following:1) Kelp - rich trace minerals iodine Kelp proven improve glandular functions provide rich source natural vitamins A, B1, B2, C, E.2) Amino Acids Enzymes - proven strengthen immune system dogs humans.3) Flax Seed - Great source Omega 3 Fats. Unlike fish oils smell more, chance mercury.4) Lecithin - long applauded effect brain, nerve healthy liver function.5) Allicin (highly condensed garlic oil) Proven fight cardiovascular disease. A powerhouse anti oxidants.6) Lactobacillus Acidophilus - Helps regulate digestive system, reduces gas, bloating stomach upset. Can compliment use whole fat yogurt use it.Nupro powdered form need refrigerated. Over past year, I exchanged emails founder eager answer questions support claims factual studies information.My real concern Nupro I believe initial dose recommendations far excessive.The company would benefit adding label information addressed dogs sensitive stomachs digestive problems.But common sense prevail. Never overload new ingredient dogs diet. Acclimation reduced smallest denominator.Rather one large scoop (included) I would advise teaspoon dogs 50 lbs significantly less smaller dogs. For finicky eaters Nupro mixed little warm water make gravy - like gravy?By adding Nupro dogs diet expect see:1) Dramatic improvement coat. More shine less shedding.2) Less stomach upset3) Firmer stools4) Higher balanced behavior energy5) Improved breathNupro excellent time proven product.Used conservatively beginning beloved pooch pooches experience overall improvement many areas dogs overall health well being.",0
7,"Our Bulldogs LOVE toy we. First shape great design full sorts curves areas pooch get good hold settle long chew-fest. Easy carry around even goliath sized Souper presents mobility problem 10 month old puppy.Our oldest 95 pound 6 year old Old English Bull dog practically weaned toy adores them. I can't even begin ti count many we've give good idea much dogs love these. The material newer extra tough Nylabone - tough enough hold Harley's massive jaw strength relentless, merct chewing fests.Like Nylabones discard ends fray. This one best toys time favorite.",3
8,"I feed elderly cat Wellness canned dry. She gets beef chicken chicken canned food. Fish flavors question seizure disorder. Her last vet checkup went well, even kidney issues. I would recommend Wellness anybody looking better cat food option. Doesn't well cats digestive problems. My younger cat can't eat Wellness without getting foul gas diarrhea.",0
9,It breaks apart smaller sharp pieces I step living room without shoes on. Much like Lego dark. Sad.,4


In [400]:
results['cluster'].value_counts()

4    459
2    140
1    121
3    119
0    114
Name: cluster, dtype: int64

Let's take a look at a few of these. Starting with the reviews assigned to cluster 0. The first one is clearly about a cat's litter box. The next one is about a dish that might have been meant for a dog but would work for a cat better. And that is still about some kind of vessel or container that works for a cat. The third cluster 0 review is about cat food, particularly a special blend that helps manage a cat's medical conditions. it's not about a container of any kind. But it does illustrate how the cat3 enjoys (can toerate) the food. 

Cluster 1 reviews are SQUARELY about dog chew toys. Excellent clustering here. 

Cluster 3 reviews here seem to be focused on reviews of the products for their durability. One is about  chew toy, which might be better suited for cluster 1. But the review is written more from the perspective of the owner's value of the product, not the dog's use. I think it may be on the edge, but is still appropriately assigned. 

There is only one review in this head() for cluster 4. But that review is clearly all about dog food. 

I am satisfied with this clustering for its ability to determine the subject of the review. But how does it help with predicting the author of the review? Each of the ten authors are presumably American English speakers with a common vernacular. The author is going to be very difficult to predict. However, the subject of the review is altogether different. I've shown that I can cluster the training set around specific content, like dog food and fish supplies. And the column 'asin' is the inventory classifier for a particular product. I expect that if I integrate my clustering back into my vectorized matrix I can run supervised learning models that predict the class of product. 

## Returning to the Test Group

In [330]:
# join asin to results (df_new = pd.concat([df_])) - DONE
# Supervised leanring models setting asin for y, matrix for X - DONE
# try with and without the cluster feature - IN PROGRESS
# filter out a single cluster and review the text versus asin for similiarities - value_counts vs unique per cluster

In [359]:
# reset my indices so I can join together the matrix, asin, and clusters
idf_df.reset_index(drop=True, inplace=True)
asin.reset_index(drop=True, inplace=True)
results.reset_index(drop=True, inplace=True)
train_sup = pd.concat([idf_df, results['cluster'], asin], axis=1)

In [361]:
train_sup.head()

Unnamed: 0,'bottom,'d,'edibl,'essenti,'get,'it,'ll,'m,'o,'perfect,...,zone,zone.i,zoo,zoo-m,zoom,zoomgroom,zuke,zymox,cluster,asin
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,B00005MF9W
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,B00006IX59
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,B00006JHRE
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,B000084E6V
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,B000084E6V


In [363]:
Y = train_sup['asin']
X = train_sup.drop(['asin'], 1)

In [364]:
Y.head()

0    B00005MF9W
1    B00006IX59
2    B00006JHRE
3    B000084E6V
4    B000084E6V
Name: asin, dtype: object

In [365]:
X.head()

Unnamed: 0,'bottom,'d,'edibl,'essenti,'get,'it,'ll,'m,'o,'perfect,...,zogoflex,zone,zone.i,zoo,zoo-m,zoom,zoomgroom,zuke,zymox,cluster
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4


In [366]:
# Yes I have the reserve batch, but I'll train, test split what I have for testing anyway
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

In [367]:
# Graident boosting has proven the most effective model for other NLP so far. I'll start with it this time. 
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.9982486865148862

Test set score: 0.005235602094240838


Can't say I've seen this before. 99.8% training score and 0.5% test score. 

In [368]:
rfc = ensemble.RandomForestClassifier()
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9947460595446584

Test set score: 0.013089005235602094


Okay... let's leave the cluster out and see if that changes anything. 

In [369]:
X_noCluster = train_sup.drop(['asin', 'cluster'], 1)

In [370]:
X_train, X_test, y_train, y_test = train_test_split(X_noCluster, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

In [371]:
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9912434325744308

Test set score: 0.010471204188481676


In [372]:
lr = LogisticRegression()
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(571, 6237) (571,)
Training set score: 0.9632224168126094

Test set score: 0.028795811518324606


Clearly, the models are having a VERY tough time predicting the asin given the matrix and the cluster. Removing the cluster has not helped. Perhaps the models will be more effective targeting the clusters. How many asin's are there anyway?

In [385]:
train_sup['asin'].value_counts()

B000JQALA4    6
B0002AR0II    5
B0009X29WK    5
B00008DFGY    5
B0002AR19Q    5
B0006N9I68    4
B001LWRFW2    4
B000BQN9LA    4
B001TI0XRW    4
B0009ZBKGE    4
B000GFI7UY    4
B00020SVDG    4
B0002AB9FS    3
B00251EPL2    3
B0002DK4IS    3
B000LPOUNW    3
B000F4AVPA    3
B0009X63SQ    3
B000255NCI    3
B0002566H4    3
B002JVUAM6    3
B000FS4OYA    3
B0002DHV16    3
B0002ARUKQ    3
B000MD3NLS    3
B000AUJFHE    3
B0002IEYIE    3
B000CMKHDG    3
B000084E6V    3
B000OX64P8    3
             ..
B002ABKBU6    1
B003982KVM    1
B001WOOC9S    1
B0002Z15UM    1
B000RSSCJ6    1
B0002602S2    1
B00028ZLTU    1
B0017JBHMS    1
B000NHSLVU    1
B0002DHZ9Y    1
B0040BJBC8    1
B0018CJJ9W    1
B0002DK6W2    1
B0002DIQD8    1
B0002563QI    1
B0002DHOJA    1
B003V4ARLE    1
B003TLUZ16    1
B000WFIVSQ    1
B0032GCISQ    1
B000ALY0OQ    1
B000VK33C6    1
B0002DIXOK    1
B0034DT2L8    1
B003BYQ09C    1
B000O39TE6    1
B000A8CUSM    1
B002FYZ0UY    1
B000FPKZLO    1
B001T8MCIU    1
Name: asin, Length: 765,

Well there you have the answer. There are simply too many asin values with no more than 6 reviews for one, and most with fewer than 3. Predicting against that was lost before it started. On to targeting the clusters. 

In [376]:
Y = train_sup['cluster']
X = train_sup.drop(['cluster', 'asin'], 1)

In [377]:
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

In [378]:
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(571, 6237) (571,)
Training set score: 0.9019264448336253

Test set score: 0.7513089005235603


Overfitted, but a HUGE improvement

In [379]:
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 1.0

Test set score: 0.7801047120418848


100% training score is a pretty big red flag for overfitting. But perhaps not necessarily. 

In [380]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 1.0

Test set score: 0.8979057591623036


90% accuracy, once again using gradient boosting. Let's look at a cross validation. 

In [381]:
cross_val_score(clf, X_test, y_test, cv=5)

array([0.88311688, 0.92207792, 0.87012987, 0.92105263, 0.85333333])

Much more encouraging. 

Now I'll run through much the same process (streamlined to leave out all the digging around) for the reserve batch.

In [387]:
df_test_vector_list = df_test_vector.tolist()

vec = TfidfVectorizer(tokenizer=textblob_tokenizer,
                      stop_words='english',
                      norm='l1',
                      use_idf=True)
matrix = vec.fit_transform(df_test_vector_list)
df_test_tfidf = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
df_test_tfidf.head()

Unnamed: 0,'blue,'d,'dog,'ll,'m,'medium,'mobil,'posh,'re,'rememb,...,york,yorki,young,younger,youngest,youtu.be/xwxbfzg9w5m,youtub,yr,ythe,zero
0,0.0,0.029571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [388]:
number_of_clusters=5

km = KMeans(n_clusters=number_of_clusters)

km.fit(matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [389]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: machin fight wash bed cover
Cluster 1: absolut pup love zero feral
Cluster 2: dog cat love like toy
Cluster 3: mess bone make rawhid busi
Cluster 4: girl med thank littl bell


These clusters do NOT match the clean and distinct clusters from the training set. Remember, the training clusters were categorized as:

- 0 - cat litter & miscellaneous enjoyment
- 1 - dog chew/play toys
- 2 - fish tank
- 3 - dog product usage and enjoyment
- 4 - dog food

I'll try playing with the number of clusters a bit.

In [394]:
number_of_clusters=3

km = KMeans(n_clusters=number_of_clusters)

km.fit(matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [395]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: absolut pup love zero feral
Cluster 1: dog treat love flavor size
Cluster 2: dog cat love like toy


8 clusters got a little too specific with top terms like '26', 'feral', 'stainless', and 'goldendoodl'. 3 left things too vague with overlap of terms. It seems that 5 is still a sweet spot ,but there is some definite change in the clusters. Let's try modelingn on the 5 and see how they test out.

In [396]:
number_of_clusters=5

km = KMeans(n_clusters=number_of_clusters)

km.fit(matrix)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [397]:
print("Top terms per cluster:")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vec.get_feature_names()
for i in range(number_of_clusters):
    top_ten_words = [terms[ind] for ind in order_centroids[i, :5]]
    print("Cluster {}: {}".format(i, ' '.join(top_ten_words)))

Top terms per cluster:
Cluster 0: terribl final start took litter
Cluster 1: dog love cat like toy
Cluster 2: absolut pup love zero feral
Cluster 3: 26 disappoint cat play toy
Cluster 4: gsd usa 6 hold puppi


In [398]:
results_test = pd.DataFrame()
results_test['review'] = df_test_vector_list
results_test['cluster'] = km.labels_
results_test.head(10)

Unnamed: 0,review,cluster
0,"Not doggy wash leave dog smelling delicious also leaves super curly coat soft manageable. Our last order received one two bottles, clear whether seller Amazon packaging issue handled quickly via Amazon support. Regardless receipt issue, shampoo awesome. I'd buy super fresh scent alone!",1
1,Cloud Star products really great dog. Smells wonderful leaves coat soft fresh least week.,1
2,"Love hoodie! So soft, thick, great attention detail like soft fuzzy fabric added inside along velcro, rub. However, NOT go measurements description one review various sizes. Based those, I ordered XXS 4 pound 2 ounce foster Chihuahua. The hoodie actually measures 6.5"" inches long neckline waist band, including hood. That quite bit shorter 8 inches. The waist band super tiny! It measures 7 inches diameter, that's wrist. I upload two photos tape measure. Can't exchange since sold by, fulfilled by, Amazon. Will returning reordering XS, hopefully waist big enough without chest getting big, tight almost fit.",1
3,"I dog barks lot training bark much using ""Quiet"" command giving treat every time quiet (first conditioned barked saying ""quiet giving treat every time said word - say word immediately stops whatever looks treat).Treats used past good dry, stinky get attention one barking rants. These treats soo stinky dog loves them!I love easy break 10+ pieces training since soft. Also, compared snacks healthy duck meat first ingredient. Although I consider treats healthiest get, certainly unhealthiest. We use duck flavor dog diarrhea chicken, beef, salmon work well him.",1
4,"I ordered gift entertain parents' two dogs. Dog #1 45 pound, 9 year old mutt. It took 5 minutes figure could fit entire mouth around yellow piece lift out. Now knows trick, takes 10 seconds. Dog #2 15 pound, 1 year old Chihuahua. He gave minute even try toy. He like chew yellow cups though, can't left out.After dog #2 gave up, 1 year old extremely food motivated cat tried out. It took least 3 minutes work cup out, persevered rewarded treat. I may end keeping since working dogs.I would recommend toy dogs 30-35 pounds only. Bigger dogs fit mouth around challenge. Also dogs inquisitive intelligent. It works well cats too.",1
5,"My German Shepherd auto-immune disease requires low-dose prednisone rest life, meaning, I need give extremely healthy low-fat treats prevent pancreatitis trips animal ER. A coworker brought back dog one occasion, absolutely hooked. I tear small bites large peices use treats. I unable use lieu raw hides since large aggressive chewer.I continue recommend product.",1
6,"I would suggest soak bowl minutes rinse running water. It seems residue coming filter, run water.Other that..........should fine. My kittys love new fountain fits in. Amazon greatest......has everything!!!",1
7,My cats love stainless steel pet fountain filters help keep fur pump. It easy replace filters.,1
8,"I ""5 Gal Repitat Reptile Habitat"" I house young Snake in. This light canopy fits great top. The thing I noticed, didnt give 5 stars, level. There 3 small pillars allow lock place using exo terra terrariums. You sort see picture. The lack forth one makes uneven flat surface. You I put adhesive chair leg pad solves problem, I show picture I added. But aware level need be.Also description says ""Use Exo Terra compact fluorescent (max 26W) incandescent (max 40W) light bulbs"".But box manual says incandescent (MAX 25W). Just something aware using incandescent bulbs.I think mentions description either unit used dimmer. It states manual comes it. I guess dimmer cause fail burn something.Great Canopy tho small Terrarium.",1
9,"I've keeping fish thirty five years - wonderfully innovative double bright LED lighting systems best advances I've seen aquarium lighting.I impressed largest sized units - 48-60 I 55 gallons way 125 gallons ( need Double uo units larger tanks avoid shadow ensure uniform lighting)This size perfect tanks 36"" across, regular breeder tank results absolutely stunning.Gone dull white haze expensive, energy draining full spectrum tubes. Not mention awfully shoddy fixtures changed least since I kid 40 years ago.Having looked alternative lighting past, setups either expensive, ran hot difficult install.One thing certain Marineland hit hands. The unit simplicity itself. A sleek smart design - functional beautiful. Solid build quality, need special wiring - installation breeze. Simply place glass canopy, turn get ready jaw drop.The light unit nothing short gorgeous, pours crystal, pristine brilliant. The ""shimmering effect"" subtle really quite remarkable - Lunar mode almost surreal beautiful.I've read forums folks using lights without versa top. All takes one good splash short system - mention void warrantee.The unit incredibly thin, trouble fitting canopy top tank. The lights rest 1/4 inch glass tops - course want place back. The end brackets fully adjustable make perfect, snap fit. However, I suggest using size smaller tanks, say 40 gallon ignore brackets position unit without brackets extended dead center glass versa top.The reason go nuts trying get system perfectly centered. In addition look without brackets extended sensational, clean high tech.There's lot bang buck :* The day light LED lights consume 35 watts opposed 80 single fluorescent bulb.* The Lunar LED lights spectacular - allow watch fish without disturbing natural sleep/rest cycle.* Each LED covered Polycarbonate lens - protects bulb, helps create ""shimmer effect"" hobbyists raving about.* Everything need controlled single side switch. Daylight, Lunar off.Nothing I've seen re-creates look actual sunlight perfectly - spectacular.You get impressive 17,000 lifetime hours, Daylight mode, Lunar mode, incredible Shimmer effect that's bound knock - whopping 1800 Lumens clear, clean light. The Marineland Double Bright LED lighting system exceptional.Once turn know immediately system worth every penny. Just brilliant product - meet expectations, exceeds them.",1


In [399]:
results_test['cluster'].value_counts()

1    311
3    2  
4    1  
2    1  
0    1  
Name: cluster, dtype: int64

Uh oh. That's not good at all. Looking at how all ten of my .head() display were assigned to cluster 1 I checked the value counts. Only 1 review assigned to clusters 0, 2, & 4. And only 2 for cluster 3. All 311 others are in cluster 1. This is also decidedly different from my training set. There seems no point bothering to model a supervised learning model onto this. 

At this point I think it wiser to keep the dataset put together (training and reserve). In fact, since I'm able to run my mode with high scores on the clusters and I can see that there is no way (that I know) to train on the author or the asin, I should be able to run this model on the entire Pet Supplies dataset of 157,836 instead of the batch of top 10 most prolific authors in that genre, making up only 1,270. 

Let's check how a single cluster (0) came out as far as the content of its reviews. Remember, the top terms of cluster 0 were food, dog, eat, love, & cat.

In [404]:
results_0 = results.loc[results['cluster'] == 0]

In [405]:
results_0.head(10)

Unnamed: 0,review,cluster,text_lemmatized,author
6,"Taking consideration single Perfect supplement creature - human otherwise, Nupro much proud of. Unlike many supplements Nupro manufactured United States first quality sources. Recent findings Pharmaceutical giant Pfizer.Who sourced manufacturing Pet-Tabs India, along newly published report Consumer labs levels lead approaching toxic zone discovered.Nupro based whole food concept. You well could feed dog nothing Nupro expect nutritional needs met. However specific proteins essential vitamins trace minerals would missed.I mention enforce although Nupro replacement healthy diet - enhance ever food currently feeding pooch.As dog lover never without dog three 50 plus years. I making dog food thirty.Long Internet I teaching I could canine nutrition. My self education continues - I remain highly suspect commercial dog food.Especially consider literally thousands brands sub brands - mere 14 considered safe healthy. Since founding Nupro 25 years ago never single instance recall, tainted ingredient unfortunate mishap.Using palatable desiccated (powdered) liver base Nupro adds following:1) Kelp - rich trace minerals iodine Kelp proven improve glandular functions provide rich source natural vitamins A, B1, B2, C, E.2) Amino Acids Enzymes - proven strengthen immune system dogs humans.3) Flax Seed - Great source Omega 3 Fats. Unlike fish oils smell more, chance mercury.4) Lecithin - long applauded effect brain, nerve healthy liver function.5) Allicin (highly condensed garlic oil) Proven fight cardiovascular disease. A powerhouse anti oxidants.6) Lactobacillus Acidophilus - Helps regulate digestive system, reduces gas, bloating stomach upset. Can compliment use whole fat yogurt use it.Nupro powdered form need refrigerated. Over past year, I exchanged emails founder eager answer questions support claims factual studies information.My real concern Nupro I believe initial dose recommendations far excessive.The company would benefit adding label information addressed dogs sensitive stomachs digestive problems.But common sense prevail. Never overload new ingredient dogs diet. Acclimation reduced smallest denominator.Rather one large scoop (included) I would advise teaspoon dogs 50 lbs significantly less smaller dogs. For finicky eaters Nupro mixed little warm water make gravy - like gravy?By adding Nupro dogs diet expect see:1) Dramatic improvement coat. More shine less shedding.2) Less stomach upset3) Firmer stools4) Higher balanced behavior energy5) Improved breathNupro excellent time proven product.Used conservatively beginning beloved pooch pooches experience overall improvement many areas dogs overall health well being.",0,"[Taking, consideration, single, Perfect, supplement, creature, -, human, otherwise,, Nupro, much, proud, of., Unlike, many, supplement, Nupro, manufactured, United, States, first, quality, sources., Recent, finding, Pharmaceutical, giant, Pfizer.Who, sourced, manufacturing, Pet-Tabs, India,, along, newly, published, report, Consumer, lab, level, lead, approaching, toxic, zone, discovered.Nupro, based, whole, food, concept., You, well, could, feed, dog, nothing, Nupro, expect, nutritional, need, met., However, specific, protein, essential, vitamin, trace, mineral, would, missed.I, mention, enforce, although, Nupro, replacement, healthy, diet, -, enhance, ever, food, currently, feeding, pooch.As, dog, lover, never, without, dog, three, 50, plus, years., I, making, dog, food, thirty.Long, Internet, I, teaching, I, ...]",
8,"I feed elderly cat Wellness canned dry. She gets beef chicken chicken canned food. Fish flavors question seizure disorder. Her last vet checkup went well, even kidney issues. I would recommend Wellness anybody looking better cat food option. Doesn't well cats digestive problems. My younger cat can't eat Wellness without getting foul gas diarrhea.",0,"[I, feed, elderly, cat, Wellness, canned, dry., She, get, beef, chicken, chicken, canned, food., Fish, flavor, question, seizure, disorder., Her, last, vet, checkup, went, well,, even, kidney, issues., I, would, recommend, Wellness, anybody, looking, better, cat, food, option., Doesn't, well, cat, digestive, problems., My, younger, cat, can't, eat, Wellness, without, getting, foul, gas, diarrhea.]",
10,cats seem like food seems stop urinary tract infections said seems like great cat food,0,"[cat, seem, like, food, seems, stop, urinary, tract, infection, said, seems, like, great, cat, food]",
18,"dog does!.When rescued puppy North Shore Animal League, said use brand dog food. Our dog recently graduated puppy formula &#34;adult&#34; version, thankfully, digestion issues.",0,"[dog, does!.When, rescued, puppy, North, Shore, Animal, League,, said, use, brand, dog, food., Our, dog, recently, graduated, puppy, formula, &#34;adult&#34;, version,, thankfully,, digestion, issues.]",
19,This good food kittens. It's nice small pieces kittens love. It's got nutrients growing kitten needs. I love it.,0,"[This, good, food, kittens., It's, nice, small, piece, kitten, love., It's, got, nutrient, growing, kitten, needs., I, love, it.]",
21,"Good seal; problems woith bugs food. I keep garage. I like see much food left; helps know re-purchase.Good capacity, too.",0,"[Good, seal;, problem, woith, bug, food., I, keep, garage., I, like, see, much, food, left;, help, know, re-purchase.Good, capacity,, too.]",
22,This holds fairly large bag dog food - enough two pugs one month. It fits narrow cabinet. Very convenient.,0,"[This, hold, fairly, large, bag, dog, food, -, enough, two, pug, one, month., It, fit, narrow, cabinet., Very, convenient.]",
41,I like vary fish's diet flake food time. I give fish twice week LOVE it. Definitely bettas! My tropical community fish cichlids love these!,0,"[I, like, vary, fish's, diet, flake, food, time., I, give, fish, twice, week, LOVE, it., Definitely, bettas!, My, tropical, community, fish, cichlid, love, these!]",
66,"I've used Hikari thirty years great results. Some Hikari formulas graded protein carry MSG - However Cichlid Gold one them.The best things food first ingredient white fish meal. It's important evaluating pet food source animal protein identified, case white fish. In addition many companies manufactures pet foods get around certain guidelines weighing protein source prior cooking. Being raw meat 90% water prior cooking often listed first ingredient - fact much lower list.This Hikari formula lists ""white fish meal"" first ingredient making food step quality. I use food conjunction Zoo-Meds Spirulina 20 incredibly rich protein, amino acids essentials virtually zero filler. This combination - addition fresh blanched kale dark greens brought color dramatically, fish wonderfully healthy highly resistant bacterial infections illness.My 12"" Red Devil equally large Flowerhorn Texas Red virtually head stands see Hiraki bag. My complaint amount filler including wheat. The first three ingredients fish food without question animal based protein. To credit free Ethoxyquin often listed preservative fact pesticide. Which read studies especially lethal fish - even trace amounts. A good article found Wiki-pedia links one mentioned studies.",0,"[I've, used, Hikari, thirty, year, great, results., Some, Hikari, formula, graded, protein, carry, MSG, -, However, Cichlid, Gold, one, them.The, best, thing, food, first, ingredient, white, fish, meal., It's, important, evaluating, pet, food, source, animal, protein, identified,, case, white, fish., In, addition, many, company, manufacture, pet, food, get, around, certain, guideline, weighing, protein, source, prior, cooking., Being, raw, meat, 90%, water, prior, cooking, often, listed, first, ingredient, -, fact, much, lower, list.This, Hikari, formula, list, ""white, fish, meal"", first, ingredient, making, food, step, quality., I, use, food, conjunction, Zoo-Meds, Spirulina, 20, incredibly, rich, protein,, amino, acid, essential, virtually, zero, filler., ...]",
67,"I've keeping Fancy Goldfish close 40 years I've learned anything marvelous creatures need room provide, massive water changes wide variety sinking foods.Here ingrediants TetraFin:Fish meal - You want see specific fish named salmon meal white fish meal (as pet foods) ""Fish meal"" means MFG using whatever cheapest market including rancid diseased fish fish products.Ground brown rice - The first three ingrediants ANY pet fish food specific animal protien - period. Rice filler used weight volume packaging.Torula dried yeast - cheap protien enhancerOat meal - binding agent fillerShrimp meal - Excellent found trace amounts foodWheat gluten - The number one source pet food related deaths recalls past 20 years. Works binding agent artifically pumps crude protien percentageSoybean oil - Few Omega fatty acidsFish oil - Again want see specific fish named like salmon. When specific animal mentioned means manufacturer buying whatever cheapest time. What more, leaves free use rancid diseased fish.Algae meal - Excellent low arounts here.Sorbitol- Artifical sweetner makes food palatable fish, potentially toxic.Artifical colors including yellow 5, red 3, blue 2, - Zero nutrional valueethoxyquin. - Ethoxyquin usually listed ""preservtive - however lethal pesticide. The FDA banned dog cat foods. Studies show ethoxyquin especially lethal fish. research Google simply look Wiki-Pedia follow links research findings.What more, flake foods avoided costs. Most Fancy Goldfish especially deep chested fish like Oranda, Ranch butterflys extremly susceptable bouyancy issues swim bladder disease. One significant causes this, taking air. That said fish infinatley better sinking foods.Much better commercial foods include ZooMed, Aqueon, Omega One, New Spectrum Hikari.",0,"[I've, keeping, Fancy, Goldfish, close, 40, year, I've, learned, anything, marvelous, creature, need, room, provide,, massive, water, change, wide, variety, sinking, foods.Here, ingrediants, TetraFin:Fish, meal, -, You, want, see, specific, fish, named, salmon, meal, white, fish, meal, (as, pet, foods), ""Fish, meal"", mean, MFG, using, whatever, cheapest, market, including, rancid, diseased, fish, fish, products.Ground, brown, rice, -, The, first, three, ingrediants, ANY, pet, fish, food, specific, animal, protien, -, period., Rice, filler, used, weight, volume, packaging.Torula, dried, yeast, -, cheap, protien, enhancerOat, meal, -, binding, agent, fillerShrimp, meal, -, Excellent, found, trace, amount, foodWheat, gluten, -, The, number, one, source, ...]",


Pretty outstanding results, really. Be it dog food or cat food, this cluster seems to be about pet food. Heck, that 10th review is about fish food! This also shows that you can't always take your gut reaction from the top terms that form. I can look back on the code now and see that there was less reason to believe this was about cats than pet food. 

## Conclusion

The use of clustering on this data allowed me to take thousands of Amazon product reviews and batch them together on the content of the review itself instead of some labled data about the content of the review. Using K-Means, I was able to dial in my number of clusters, assuming 10 to start because of the number of authors, then reduced to 5 for the most distinct categorization of the subject of those reviews. 

I tried to model for the asin, the inventory control number for the specific product. But this ran into a problem somewhat opposite of the issue with the authors. For the authors, I did indeed have 10 to categorize. But they were so similar in their vernacular, and their individual reviews so short that there is very little chance I can predict who is talking about a certain kind of chew toy based on the language used. 

On the other hand, the asins were so many and had so few assigned to each that it made it impossible to categorize. I considered researching the format to see if it was categorical. But there were several formats to the asin. 

Instead I created clusters for the content of the reviews, which feels much more unsupervise anyway. So modeling to the clusters with supervised learning models was the right way to go. And I see that I can use this method to greatly expand the reviews to the entire Pet Supply department, maybe even other departments. Why not, if the Unsupervised Learning clustering methods are able to separate them by content. It's just a matter of choosingn clusters and tuning parameters. 