# Simple Popularity, Item-Item Collaborative Filtering  and Matrix Factorization models using Turicreate

TuriCreate was developed by Apple and uses it's own type of dataframe called SFrame.  The package has it's own visualization capabilities which are briefly tried below.

The code below shows methodology testing a simple popularity model and two more advanced recommendation methods using cosine similarity and matrix factorization within TuriCreate.

In the simiple popularity model, we simply find the most popular items (those most often rated 5.0) and recommend these items to users.  I think this method may be useful in a true cold start scenario, but not useful in more in depth situations.

More advanced methods are shown by finding the cosine similarity of items and recommending to a user a 'similar' item to an item which they've previously purchased.  This method only produced a RMSE of 4.301, which can be improved. 

The final method we tried was a matrix factorization approach in which both similar items and similar users are found based on rating histories. This model was able to produce stronger recommendations and resulted in a 1.27 RMSE



importing and cleaning data

In [201]:
#import ratings data
import gzip
path = '/Users/marcushimelhoch/Downloads/'

def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

df = getDF('reviews_Grocery_and_Gourmet_Food_5.json.gz')

In [202]:
df.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A1VEELTKS8NLZB,616719923X,Amazon Customer,"[0, 0]",Just another flavor of Kit Kat but the taste i...,4.0,Good Taste,1370044800,"06 1, 2013"
1,A14R9XMZVJ6INB,616719923X,amf0001,"[0, 1]",I bought this on impulse and it comes from Jap...,3.0,"3.5 stars, sadly not as wonderful as I had hoped",1400457600,"05 19, 2014"
2,A27IQHDZFQFNGG,616719923X,Caitlin,"[3, 4]",Really good. Great gift for any fan of green t...,4.0,Yum!,1381190400,"10 8, 2013"
3,A31QY5TASILE89,616719923X,DebraDownSth,"[0, 0]","I had never had it before, was curious to see ...",5.0,Unexpected flavor meld,1369008000,"05 20, 2013"
4,A2LWK003FFMCI5,616719923X,Diana X.,"[1, 2]",I've been looking forward to trying these afte...,4.0,"Not a very strong tea flavor, but still yummy ...",1369526400,"05 26, 2013"


In [203]:
df= df.drop(columns = [ 'reviewerName','helpful', 'unixReviewTime', 'reviewTime'])

In [204]:
df = df.drop(columns = ['reviewText','summary'])

In [205]:
df.head()

Unnamed: 0,reviewerID,asin,overall
0,A1VEELTKS8NLZB,616719923X,4.0
1,A14R9XMZVJ6INB,616719923X,3.0
2,A27IQHDZFQFNGG,616719923X,4.0
3,A31QY5TASILE89,616719923X,5.0
4,A2LWK003FFMCI5,616719923X,4.0


In [207]:
import numpy as np

8713

In [212]:
import turicreate
tu_data = turicreate.SFrame(df)

In [213]:
tu_data['overall'].show()

1. Create a simple popularity model: all users have the same recommendation based on the most popular choices

In [214]:
#create instance
popularity_model = turicreate.popularity_recommender.create(tu_data, user_id = 'reviewerID', item_id = 'asin', target = 'overall')

In [215]:
#find top 5 products for first 5 users
#in the event that user already rated that product, it's not proposed again

popularity_recomm = popularity_model.recommend(users = [1,2,3,4,5], k = 5)
popularity_recomm.print_rows(num_rows = 25)

+------------+------------+-------+------+
| reviewerID |    asin    | score | rank |
+------------+------------+-------+------+
|     1      | B0000CNU15 |  5.0  |  1   |
|     1      | B0000CFLIL |  5.0  |  2   |
|     1      | B0000CFLCT |  5.0  |  3   |
|     1      | B0000CDBQN |  5.0  |  4   |
|     1      | B00005C2M2 |  5.0  |  5   |
|     2      | B0000CNU15 |  5.0  |  1   |
|     2      | B0000CFLIL |  5.0  |  2   |
|     2      | B0000CFLCT |  5.0  |  3   |
|     2      | B0000CDBQN |  5.0  |  4   |
|     2      | B00005C2M2 |  5.0  |  5   |
|     3      | B0000CNU15 |  5.0  |  1   |
|     3      | B0000CFLIL |  5.0  |  2   |
|     3      | B0000CFLCT |  5.0  |  3   |
|     3      | B0000CDBQN |  5.0  |  4   |
|     3      | B00005C2M2 |  5.0  |  5   |
|     4      | B0000CNU15 |  5.0  |  1   |
|     4      | B0000CFLIL |  5.0  |  2   |
|     4      | B0000CFLCT |  5.0  |  3   |
|     4      | B0000CDBQN |  5.0  |  4   |
|     4      | B00005C2M2 |  5.0  |  5   |
|     5    

In [274]:
meta.loc[meta['asin']== 'B0000CNU15']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
19,B0000CNU15,{'Grocery & Gourmet Food': 2619},http://ecx.images-amazon.com/images/I/51YAihJn...,[['Grocery & Gourmet Food']],Lee Kum Kee Chiu Chow Chili Oil,An authentic hot chili sauce originated in Chi...,7.09,"{'also_bought': ['B000F06ZCW', 'B0001WOSQY', '...",Unknown


In [275]:
meta.loc[meta['asin']== 'B0000CFLIL']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
13,B0000CFLIL,{},http://ecx.images-amazon.com/images/I/51cnoigB...,[['Grocery & Gourmet Food']],"Melitta Cone Coffee Filters, Natural Brown, No...","Thicker, textured, high quality paper with pat...",29.99,"{'also_bought': ['B000MIT2OK', 'B000BUDDTY', '...",Melitta


In [276]:
meta.loc[meta['asin']== 'B0000CFLCT']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
14,B0000CFLCT,,http://ecx.images-amazon.com/images/I/51R86XH4...,[['Grocery & Gourmet Food']],"Melitta Coffee Maker, 6 Cup Pour-Over Brewer w...",CM6/4 Features: -Coffee maker.-Prepares a full...,9.09,"{'also_bought': ['B00006IUTQ', 'B0014CX7KI', '...",Melitta


In [277]:
meta.loc[meta['asin']== 'B0000CDBQN']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
10,B0000CDBQN,{'Grocery & Gourmet Food': 46351},http://ecx.images-amazon.com/images/I/41RRpfr9...,[['Grocery & Gourmet Food']],Chef Paul Prudhomme's Magic Seasoning Blends ~...,Chef Paul Prudhommes Magic Seasoning Blends ha...,3.5,"{'also_bought': ['B0000CDBQL', 'B0000CDBPW', '...",Magic Seasoning Blends


In [278]:
meta.loc[meta['asin']== 'B00005C2M2']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
6,B00005C2M2,{},http://ecx.images-amazon.com/images/I/518Pt3s4...,[['Grocery & Gourmet Food']],American Outdoor Products Astronaut Ice Cream ...,Funkyfoodshop is the #1 seller of space food o...,23.5,"{'also_bought': ['B001CCQCR0', 'B00005C2M3', '...",American Outdoor Products


In [69]:
#verifying if these product are indeed the top 5 highly rated products. 
#looks like there is an issue because there are many products with mean rating of 5.0
# a more sophisticated model is likely needed

df.groupby(by = 'asin')['overall'].mean().sort_values(ascending = False).head(20)

asin
B0024KGQJI    5.0
B0050IM4MY    5.0
B004YZSJLO    5.0
B004Z4PKP2    5.0
B000N49OWS    5.0
B000MT8FK6    5.0
B004ZWRALQ    5.0
B000MOEUNC    5.0
B0050ILOZW    5.0
B0050MMMMW    5.0
B000NU4VSO    5.0
B000MAK3UK    5.0
B00515JKXW    5.0
B0051QZM60    5.0
B00522AFRE    5.0
B005258A2I    5.0
B0052AHU38    5.0
B0052LDET6    5.0
B000NERTSE    5.0
B004YTV5S4    5.0
Name: overall, dtype: float64

In [216]:
#create a train test split
training_data, validation_data = turicreate.recommender.util.random_split_by_user(tu_data, 'reviewerID', 'asin',item_test_proportion=0.2)


In [217]:
#Create a model based on item-item similarity

#create an instance
item_sim_model = turicreate.item_similarity_recommender.create(training_data, user_id = 'reviewerID', item_id = 'asin', target = 'overall', similarity_type = 'cosine')

In [218]:
items_similarity = item_sim_model.get_similar_items()

In [219]:
#before evealuating the model, empirically test what the model thinks are similar products to ASIN '616719923X'
(items_similarity[(items_similarity['asin'] == '616719923X' )]).sort('rank', ascending = True).print_rows()


+------------+------------+---------------------+------+
|    asin    |  similar   |        score        | rank |
+------------+------------+---------------------+------+
| 616719923X | B004MFNGEQ |  0.2685350179672241 |  1   |
| 616719923X | B0007LXU86 |  0.1521187424659729 |  2   |
| 616719923X | B00374ZKQ0 |  0.1464167833328247 |  3   |
| 616719923X | B0052589L0 |  0.1366724967956543 |  4   |
| 616719923X | B005VBDBT0 |  0.1310752034187317 |  5   |
| 616719923X | B002KRVSNY | 0.12708216905593872 |  6   |
| 616719923X | B00A9OE03A | 0.12343311309814453 |  7   |
| 616719923X | B000E63LP6 | 0.12209659814834595 |  8   |
| 616719923X | B005Q8BIAC | 0.11579620838165283 |  9   |
| 616719923X | B002VT3GXG | 0.11560887098312378 |  10  |
+------------+------------+---------------------+------+
[10 rows x 4 columns]



In [220]:
meta.loc[meta['asin']== '616719923X']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
0,616719923X,{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,[['Grocery & Gourmet Food']],Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",


In [221]:
meta.loc[meta['asin']== 'B004MFNGEQ']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
6403,B004MFNGEQ,{'Grocery & Gourmet Food': 19281},http://ecx.images-amazon.com/images/I/41Fwx2gT...,[['Grocery & Gourmet Food']],Republic of Tea: RED VELVET CHOCOLATE (36 unbl...,Republic of Tea: RED VELVET CHOCOLATE (36 unbl...,6.45,"{'also_bought': ['B00JPK1A7I', 'B003SO58Y8', '...",The Republic of Tea


In [92]:
meta.loc[meta['asin']== 'B0007LXU86']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
306,B0007LXU86,{'Grocery & Gourmet Food': 130147},http://ecx.images-amazon.com/images/I/41I2T39T...,"[[['grocery , gourmet food']]]","Kashi GOLEAN Bar, Chocolate Almond Toffee, 2.7...",Kashi Company was founded in 1984 on the belie...,,,


In [243]:
meta.loc[meta['asin']== 'B00374ZKQ0']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
5169,B00374ZKQ0,{'Grocery & Gourmet Food': 16522},http://ecx.images-amazon.com/images/I/51wf2mHY...,[['Grocery & Gourmet Food']],"Stevia Sweetener In The Raw, 50-Count Packages...","Stevia Sweetener In The Raw, comes in 50 count...",42.05,"{'also_bought': ['B006XEGXCG', 'B003ZFG7E0', '...",Stevia


In [244]:
meta.loc[meta['asin']== 'B0052589L0']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
6880,B0052589L0,{'Grocery & Gourmet Food': 10027},http://ecx.images-amazon.com/images/I/51nUqART...,[['Grocery & Gourmet Food']],"PUR gum Pomegranate Mint Gum-Aspartame Free, 9...","PUR Gum is Vegan, Gluten-free, Non-GMO, Nut-fr...",14.68,"{'also_bought': ['B005258A0K', 'B005258A2S', '...",PUR Gum


In [245]:
meta.loc[meta['asin']== 'B005VBDBT0']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
7267,B005VBDBT0,{'Grocery & Gourmet Food': 9869},http://ecx.images-amazon.com/images/I/517XfSOk...,[['Grocery & Gourmet Food']],"Habitant French-Canadian Pea Soup, 14 Ounce Ca...",Habitant&#xA0;soups have been made using tradi...,29.37,"{'also_bought': ['B001682QB6', 'B001684OPM', '...",Habitant


In [246]:
meta.loc[meta['asin']== 'B0013TJB7K']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
2676,B0013TJB7K,{'Grocery & Gourmet Food': 258711},http://ecx.images-amazon.com/images/I/512tlZ3S...,[['Grocery & Gourmet Food']],"Mr. Z Premium Cuts Beef Jerky Peppered Flavor,...",Mr. Z's Premium Beef Jerky is the ultimate por...,,,


These products all belong to a desert category so they look similar to me. Furthermore, the item with a .26 score in similarity to the given item contains the word tea, which the given item also has.  This tells me that the model has conceptionally grouped similar items together

In [222]:
item_sim_model.evaluate(validation_data)


Precision and recall summary statistics by cutoff
+--------+----------------------+----------------------+
| cutoff |    mean_precision    |     mean_recall      |
+--------+----------------------+----------------------+
|   1    | 0.07079646017699107  | 0.03025406255494752  |
|   2    |  0.0518331226295828  | 0.042386083504920444 |
|   3    | 0.04846186262115462  | 0.05861563636532033  |
|   4    | 0.043299620733249014 | 0.07190444405994338  |
|   5    | 0.03893805309734513  |  0.0790247258009584  |
|   6    | 0.03581963758954908  | 0.08472669344982875  |
|   7    | 0.033411594726386176 | 0.09277557671995085  |
|   8    | 0.030973451327433624 | 0.09681657364969633  |
|   9    | 0.028655710071639282 | 0.09902349020553824  |
|   10   | 0.026927939317319845 | 0.10316402527830004  |
+--------+----------------------+----------------------+
[10 rows x 3 columns]


Overall RMSE: 4.301494995335515

Per User RMSE (best)
+----------------+--------------------+-------+
|   reviewerID   |       

{'precision_recall_by_user': Columns:
 	reviewerID	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 14238
 
 Data:
 +----------------+--------+-----------+--------+-------+
 |   reviewerID   | cutoff | precision | recall | count |
 +----------------+--------+-----------+--------+-------+
 | A31QY5TASILE89 |   1    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   2    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   3    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   4    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   5    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   6    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   7    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   8    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   9    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   10   |    0.0    |  0.0   |   1   |
 +----------------+--------+-----------+--------+-------+
 [14238 rows x 5 columns]
 Note: Only the head of t

RMSE = 4.301 (Very poor)

In [223]:
#This is the model's recommended products.  I don't like these recommendations because they are the same products
item_sim_recomm = item_sim_model.recommend(users = [1,2,3,4,5], k = 5)
item_sim_recomm.print_rows(num_rows = 25)

+------------+------------+---------------------+------+
| reviewerID |    asin    |        score        | rank |
+------------+------------+---------------------+------+
|     1      | B002IEVJRY | 0.12151253700256348 |  1   |
|     1      | B00934WBRO | 0.11005025506019592 |  2   |
|     1      | B006MONQMC | 0.10037628889083862 |  3   |
|     1      | B0041NYV8E | 0.09997156143188476 |  4   |
|     1      | B005HG9ERW | 0.09253469705581666 |  5   |
|     2      | B002IEVJRY | 0.12151253700256348 |  1   |
|     2      | B00934WBRO | 0.11005025506019592 |  2   |
|     2      | B006MONQMC | 0.10037628889083862 |  3   |
|     2      | B0041NYV8E | 0.09997156143188476 |  4   |
|     2      | B005HG9ERW | 0.09253469705581666 |  5   |
|     3      | B002IEVJRY | 0.12151253700256348 |  1   |
|     3      | B00934WBRO | 0.11005025506019592 |  2   |
|     3      | B006MONQMC | 0.10037628889083862 |  3   |
|     3      | B0041NYV8E | 0.09997156143188476 |  4   |
|     3      | B005HG9ERW | 0.0

In [224]:
df.head()

Unnamed: 0,reviewerID,asin,overall
0,A1VEELTKS8NLZB,616719923X,4.0
1,A14R9XMZVJ6INB,616719923X,3.0
2,A27IQHDZFQFNGG,616719923X,4.0
3,A31QY5TASILE89,616719923X,5.0
4,A2LWK003FFMCI5,616719923X,4.0


In [225]:
#Trying a matrix factorization approach takes into account users AND items.  It uses the latent features created to minimize the RMSE, and uses Stochastic Gradient Descent while optimizing thelearning rate

model = turicreate.recommender.ranking_factorization_recommender.create(training_data, user_id = 'reviewerID', item_id = 'asin', target = 'overall')

In [226]:
results = model.recommend(k=3)

In [227]:
results.sort(['reviewerID', 'rank'], ascending=True).print_rows(20)

+-----------------------+------------+--------------------+------+
|       reviewerID      |    asin    |       score        | rank |
+-----------------------+------------+--------------------+------+
| A00177463W0XWB16A9O05 | B0013TJB7K | 5.040018835631425  |  1   |
| A00177463W0XWB16A9O05 | B001AHFVHO |  4.94908312853914  |  2   |
| A00177463W0XWB16A9O05 | B001EO5Q64 | 4.862770238486345  |  3   |
| A022899328A0QROR32DCT | B000GAT6NG | 5.491548129883821  |  1   |
| A022899328A0QROR32DCT | B000ENUC3S |  5.32439167198282  |  2   |
| A022899328A0QROR32DCT | B001CGTN1I | 5.2390094488535475 |  3   |
| A04309042SDSL8YX2HRR7 | B000E1D7RS | 4.395323523846681  |  1   |
| A04309042SDSL8YX2HRR7 | B00014JNI0 | 4.146215596762712  |  2   |
| A04309042SDSL8YX2HRR7 | B001KTA03C |  4.14152509864908  |  3   |
| A068255029AHTHDXZURNU | B001D0GV4K | 5.2743754774008345 |  1   |
| A068255029AHTHDXZURNU | B00B18PAWI | 5.027715721694047  |  2   |
| A068255029AHTHDXZURNU | B0029XDZIK | 5.016996124354417  |  3

In [228]:
model.evaluate(validation_data)


Precision and recall summary statistics by cutoff
+--------+----------------------+-----------------------+
| cutoff |    mean_precision    |      mean_recall      |
+--------+----------------------+-----------------------+
|   1    | 0.011378002528445006 | 0.0038980193847450483 |
|   2    | 0.012642225031605555 |  0.00854437487483165  |
|   3    | 0.011378002528445012 |  0.01194272536547037  |
|   4    | 0.009481668773704176 |  0.013206947868630916 |
|   5    | 0.008091024020227558 |  0.014787225997581619 |
|   6    | 0.007796038769490098 |  0.01721031912863936  |
|   7    | 0.007404731804226114 |  0.019738764134960483 |
|   8    | 0.007269279393173198 |  0.021032585498314084 |
|   9    | 0.007023458350891975 |  0.022800490300352885 |
|   10   | 0.006700379266750943 |  0.024802175930357086 |
+--------+----------------------+-----------------------+
[10 rows x 3 columns]


Overall RMSE: 1.2913739200449972

Per User RMSE (best)
+---------------+-----------------------+-------+
|   revi

{'precision_recall_by_user': Columns:
 	reviewerID	str
 	cutoff	int
 	precision	float
 	recall	float
 	count	int
 
 Rows: 14238
 
 Data:
 +----------------+--------+-----------+--------+-------+
 |   reviewerID   | cutoff | precision | recall | count |
 +----------------+--------+-----------+--------+-------+
 | A31QY5TASILE89 |   1    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   2    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   3    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   4    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   5    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   6    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   7    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   8    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   9    |    0.0    |  0.0   |   1   |
 | A31QY5TASILE89 |   10   |    0.0    |  0.0   |   1   |
 +----------------+--------+-----------+--------+-------+
 [14238 rows x 5 columns]
 Note: Only the head of t

this produced a much better RMSE

In [261]:
meta.loc[meta['asin']== 'B000GAT6NG']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
1458,B000GAT6NG,{'Grocery & Gourmet Food': 45},http://ecx.images-amazon.com/images/I/41PytU9K...,[['Grocery & Gourmet Food']],"Nutiva Organic Virgin Coconut Oil, 54-Ounce Jar",Nutiva began in 1999 as an idea in the mind of...,24.4,"{'also_bought': ['B009324C0U', 'B008RJMXPQ', '...",Nutiva


In [262]:
meta.loc[meta['asin']== 'B000ENUC3S']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
911,B000ENUC3S,{'Grocery & Gourmet Food': 813},http://ecx.images-amazon.com/images/I/41r0TfGr...,[['Grocery & Gourmet Food']],"LARABAR Fruit &amp; Nut Food Bar, Apple Pie, G...",,21.99,"{'also_bought': ['B00426ATTK', 'B00BCNTCHG', '...",L&Auml;RABAR


In [263]:
meta.loc[meta['asin']== 'B001CGTN1I']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
3010,B001CGTN1I,{'Grocery & Gourmet Food': 54},http://ecx.images-amazon.com/images/I/41b%2B5Y...,[['Grocery & Gourmet Food']],"Navitas Naturals Organic Raw Chia Seeds, 1 Po...",,13.49,"{'also_bought': ['B000FFLHSY', 'B000FFLHU2', '...",Navitas Naturals


In [260]:
df.loc[df['reviewerID']=='A022899328A0QROR32DCT']

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
42689,A022899328A0QROR32DCT,B001ACMCNU,Rox,"[0, 1]","OK, am I doing something wrong here? I am trul...",1.0,What am I missing here,1357430400,"01 6, 2013"
86624,A022899328A0QROR32DCT,B003VKR0HM,Rox,"[4, 8]",I can't speak for the benefits of using this f...,1.0,My son would not drink this,1357776000,"01 10, 2013"
88130,A022899328A0QROR32DCT,B003YBH398,Rox,"[0, 0]","I was super hopeful to like these as a celiac,...",2.0,Was not impressed,1386633600,"12 10, 2013"
145576,A022899328A0QROR32DCT,B00DBWU2JS,Rox,"[0, 1]",This product tasted burnt and there wasn't eve...,2.0,Burnt flavor and bad taste,1395878400,"03 27, 2014"
143382,A022899328A0QROR32DCT,B00CMQDKES,Rox,"[0, 0]",These pancakes are gritty in texture. I have t...,3.0,There is better out there,1386633600,"12 10, 2013"
86023,A022899328A0QROR32DCT,B003TO9RSU,Rox,"[0, 0]",**UPDATE**- once I opened the box of these cra...,4.0,Great GF grahams,1363392000,"03 16, 2013"
14049,A022899328A0QROR32DCT,B000EVE3Y4,Rox,"[0, 0]",awesome texture for even the gluten eating eat...,5.0,awesome texture for even the gluten eating eat...,1404864000,"07 9, 2014"
86560,A022899328A0QROR32DCT,B003VIJI1A,Rox,"[1, 2]","Being a celiac & having to eat GF, I was sad w...",5.0,So grateful for these condensed cream soups!,1356739200,"12 29, 2012"
86574,A022899328A0QROR32DCT,B003VIJI38,Rox,"[0, 0]",As I have wrote a review for the cream of mush...,5.0,So grateful for these soups for cooking!,1362441600,"03 5, 2013"
86449,A022899328A0QROR32DCT,B003V8QGAG,Rox,"[0, 0]",The kettle takes the cake as my favorite flavo...,5.0,Love the kettle flavor,1363392000,"03 16, 2013"


In [264]:
meta.loc[meta['asin']== 'B003V8QGAG']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
5599,B003V8QGAG,{'Grocery & Gourmet Food': 1140},http://ecx.images-amazon.com/images/I/514yCT8h...,[['Grocery & Gourmet Food']],"Medora Snacks Popcorners Popped Corn Chips, Wh...",PopCorners are the delicious new snack with th...,24.99,"{'also_bought': ['B00IYYW2HS', 'B00FYR5HS4', '...",Popcorners


In [265]:
meta.loc[meta['asin']== 'B003VIJI38']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
5611,B003VIJI38,{'Grocery & Gourmet Food': 5117},http://ecx.images-amazon.com/images/I/51BpNpX5...,[['Grocery & Gourmet Food']],Pacific Natural Foods Organic Cream Of Chicken...,Chef-inspired heary soups that deliver fresh h...,33.99,"{'also_bought': ['B003VIJI1A', 'B002FYJTYW', '...",Pacific Natural Foods


In [267]:
meta.loc[meta['asin']== 'B000EVE3Y4']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
964,B000EVE3Y4,,http://ecx.images-amazon.com/images/I/513jBafB...,[['Grocery & Gourmet Food']],Glutino Gluten Free Pantry Yankee Cornbread Mi...,The Gluten-Free Pantry was founded by professi...,36.77,"{'also_bought': ['B00J8BI9FK', 'B000EVE3YE', '...",Glutino Gluten Free Pantry


In [269]:
meta.loc[meta['asin']=='B007OSBFY6']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
7655,B007OSBFY6,{'Grocery & Gourmet Food': 6302},http://ecx.images-amazon.com/images/I/51G8vIM0...,[['Grocery & Gourmet Food']],"Brown Gold 100% Colombian Coffee Capsules, 48-...",,26.9,"{'also_bought': ['B008I1XPKA', 'B007OSBEV0', '...",Brown Gold


In [268]:
meta.loc[meta['asin']=='B003TO9RSU']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
5575,B003TO9RSU,{'Grocery & Gourmet Food': 4859},http://ecx.images-amazon.com/images/I/51zFLMqE...,[['Grocery & Gourmet Food']],"Kinnikinnick, Smoreable Graham Style Crackers,...",A gluten free cracker for anytime snacks and s...,31.79,"{'also_bought': ['B000VK4F5A', 'B000LKZ5XQ', '...",Akmak


# Matrix Factorization, Stochastic Gradient Descent & Predicting User Ratings 

This process uses Matrix Factorization and SGD to predict a user's rating on a product.  The user would then be recommended their highest predicted rated products.  

The process begins by creating a pivot table with all products representing the columns, users representing the rows and any given ratings representing the values.  As there will likely be a lot of products each user did not rate, these values are represented by NaN.  
Because we cannot use NaN in calcuations, and also because the users were asked to rate products on a 1-5 scale, we can transform any NaN value into a 0.

Now that data has been transformed properly, Matrix Factorization and Stochastic Gradient Descent will be used to create predicted ratings for each product, for each user.  

Matrix Factorization on a high level is producing two matrices whose product is the original matrix given. The two matrices represent generated item and user features.  The features are inferred from their related rating patterns.  High correspondence between item and user factors lead to a recommendation.

Matrix factorization models map both users and items to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space.  The result is each item will have a vector Q and each user will have a vector P.  For each product Q, its' elements show the extent to which the items possess those factors.  For each user P, it's elements show the extent to which the user is interested in items high on the corresponding factors.  The resulting dot product Q * P captures the interaction between user and product and approximates the user's rating on that product.

Within the Matrix Factorization process, Stochastic Gradient Descent is used to avoid imputation and overfitting.  This approach helps to model directly the observed ratings only, and generalizes these ratings in such a way that predicts future ratings.  A minimum squared error is used to find the expected rating and to avoid overfitting, a constant is applied to control the extent of regularization.  SGD uses a magnitude in the opposite direction of the gradient to normalize the given rating.

A challenge of our model is that we were given 5core data which guarentees each user gave at least 5 ratings, and each product had at least 5 ratings.  So in some instances, we will be making inferences based on only 5 inputs by the user(s).

Althought not included in this model, a strength of matrix factorization is it allows for incorporation of additional information including implicit feedback, which can be defined as inferred user preferences based on observing user behavior including: browsing history, search patterns, mouse movements, and purchase history.  Implicit feedback usually represents the presence or absence of an event.

In [101]:
#using the ratings data, create a new table that has all products as the columns(8713), all users as the rowsm and any product that the rated appearing under the respective product.
#if that user did not provide a rating for a product, it will appear as NA

rating = pd.pivot_table(df, values = 'overall', index = ['reviewerID'], columns = ['asin'])

In [102]:
rating.sort_index(axis = 1, inplace = True)

In [103]:
rating.head()

asin,616719923X,9742356831,B00004S1C5,B0000531B7,B00005344V,B0000537AF,B00005C2M2,B00006IUTN,B0000CCZYY,B0000CD06J,...,B00IVT3LLW,B00IWBMCMS,B00J9IUCHA,B00JAXNMRG,B00JEL3N1E,B00JGPG60I,B00JL6LTMW,B00K00H9I6,B00KC0LGI8,B00KCJRVO2
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A00177463W0XWB16A9O05,,,,,,,,,,,...,,,,,,,,,,
A022899328A0QROR32DCT,,,,,,,,,,,...,,,,,,,,,,
A04309042SDSL8YX2HRR7,,,,,,,,,,,...,,,,,,,,,,
A068255029AHTHDXZURNU,,,,,,,,,,,...,,,,,,,,,,
A06944662TFWOKKV4GJKX,,,,,,,,,,,...,,,,,,,,,,


In [273]:
meta.loc[meta['asin']=='B00004S1C5']

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
2,B00004S1C5,{'Kitchen & Dining': 4494},http://ecx.images-amazon.com/images/I/41F75K9F...,"[['Grocery & Gourmet Food', 'Cooking & Baking'...","Ateco Food Coloring Kit, 6 colors","From Easter eggs to colorful cookies, Spectrum...",9.76,"{'also_bought': ['B0000CFMLT', 'B002PO3KBK', '...",HIC Harold Import Co.


In [104]:
#because the lowest rating is 1 (products were rated on a scale of 1-5), i can fill all NaN values with 0 to represent no rating
ratings = rating.fillna(0)

In [105]:
#this gives a DF that can have calculations performed on it
ratings.head()

asin,616719923X,9742356831,B00004S1C5,B0000531B7,B00005344V,B0000537AF,B00005C2M2,B00006IUTN,B0000CCZYY,B0000CD06J,...,B00IVT3LLW,B00IWBMCMS,B00J9IUCHA,B00JAXNMRG,B00JEL3N1E,B00JGPG60I,B00JL6LTMW,B00K00H9I6,B00KC0LGI8,B00KCJRVO2
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A00177463W0XWB16A9O05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A022899328A0QROR32DCT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A04309042SDSL8YX2HRR7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A068255029AHTHDXZURNU,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A06944662TFWOKKV4GJKX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [402]:
#creating the Matrix Factorization process

class MF():

    # Initializing the user-product rating matrix, no. of latent features, alpha and beta.
    def __init__(self, R, K, alpha, beta, iterations):
        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations

    # Initializing user-feature and product-feature matrix 
    def train(self):
        self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1./self.K, size=(self.num_items, self.K))

        # Initializing the bias terms
        self.b_u = np.zeros(self.num_users)
        self.b_i = np.zeros(self.num_items)
        self.b = np.mean(self.R[np.where(self.R != 0)])

        # List of training samples
        self.samples = [
        (i, j, self.R[i, j])
        for i in range(self.num_users)
        for j in range(self.num_items)
        if self.R[i, j] > 0
        ]

        # Stochastic gradient descent for given number of iterations
        training_process = []
        for i in range(self.iterations):
            np.random.shuffle(self.samples)
            self.sgd()
            mse = self.mse()
            training_process.append((i, mse))
            if (i+1) % 20 == 0:
                print("Iteration: %d ; error = %.4f" % (i+1, mse))

        return training_process

    # Computing total mean squared error
    def mse(self):
        xs, ys = self.R.nonzero()
        predicted = self.full_matrix()
        error = 0
        for x, y in zip(xs, ys):
            error += pow(self.R[x, y] - predicted[x, y], 2)
        return np.sqrt(error)

    # Stochastic gradient descent to get optimized P and Q matrix
    def sgd(self):
        for i, j, r in self.samples:
            prediction = self.get_rating(i, j)
            e = (r - prediction)

            self.b_u[i] += self.alpha * (e - self.beta * self.b_u[i])
            self.b_i[j] += self.alpha * (e - self.beta * self.b_i[j])

            self.P[i, :] += self.alpha * (e * self.Q[j, :] - self.beta * self.P[i,:])
            self.Q[j, :] += self.alpha * (e * self.P[i, :] - self.beta * self.Q[j,:])

    # Ratings for user i and product j
    def get_rating(self, i, j):
        prediction = self.b + self.b_u[i] + self.b_i[j] + self.P[i, :].dot(self.Q[j, :].T)
        return prediction

    # Full user-product rating matrix
    def full_matrix(self):
        return mf.b + mf.b_u[:,np.newaxis] + mf.b_i[np.newaxis:,] + mf.P.dot(mf.Q.T)

In [403]:
R = np.array(ratings)

In [407]:
mf = MF(R, K = 20, alpha=0.001, beta = 0.01, iterations = 100)
training_process = mf.train()
print()
print("P x Q:")
print(mf.full_matrix())
print()

Iteration: 20 ; error = 379.5982
Iteration: 40 ; error = 364.6512
Iteration: 60 ; error = 355.0153
Iteration: 80 ; error = 347.6986
Iteration: 100 ; error = 341.2795

P x Q:
[[4.19416718 4.68168383 4.27092143 ... 4.67948643 3.99989421 4.44777709]
 [3.48442614 4.02689956 3.6109085  ... 3.98100734 3.34345949 3.79389303]
 [3.86041338 4.4076995  4.02039345 ... 4.38876794 3.71067567 4.19191041]
 ...
 [4.36964002 4.8626674  4.4913279  ... 4.86629689 4.22978787 4.6842897 ]
 [4.50183423 5.00473647 4.65328508 ... 5.00375832 4.3171197  4.80932307]
 [3.83587924 4.34607603 3.95011198 ... 4.36916864 3.69791207 4.14307216]]



# LightFM 

LightFM is a matrix factorization method of producing recommendations.  In creating the two new matrixes from the original interaction matrix, a WARP loss function was used in this case.  The resulting efforts were strong item recommendations for a user, and user recommendations to an item.

A note on warp loss function- WARP stands for Weighted Approximate-Rank Pairwise loss and it's advantage among other loss functions is that it optimizes for the loss function's value relative to the other loss functions values.  Loss functions are designed to weight predictions on a scale, going further than a simple 0 or 1 for purchase/no purchase. WARP is designed to randomly sample output labels of the model until it finds a pair which it knows are wrongly labelled, and then applies an update only to those two incorrectly labelled examples.  The result is a model that learns to rank items it knows are positive above others and account for relative ranked items in terms of user preference.

In [110]:
from scipy import sparse

In [113]:
from lightfm import LightFM



In [148]:
merged = pd.read_csv('/Users/marcushimelhoch/Downloads/reviews_merged.csv', sep = ',')

In [149]:
merged.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,salesRank,imUrl,categories,title,description,price,related,brand
0,A1VEELTKS8NLZB,616719923X,Amazon Customer,"[0, 0]",Just another flavor of Kit Kat but the taste i...,4.0,Good Taste,1370044800,"06 1, 2013",{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,[['Grocery & Gourmet Food']],Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",
1,A14R9XMZVJ6INB,616719923X,amf0001,"[0, 1]",I bought this on impulse and it comes from Jap...,3.0,"3.5 stars, sadly not as wonderful as I had hoped",1400457600,"05 19, 2014",{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,[['Grocery & Gourmet Food']],Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",
2,A27IQHDZFQFNGG,616719923X,Caitlin,"[3, 4]",Really good. Great gift for any fan of green t...,4.0,Yum!,1381190400,"10 8, 2013",{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,[['Grocery & Gourmet Food']],Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",
3,A31QY5TASILE89,616719923X,DebraDownSth,"[0, 0]","I had never had it before, was curious to see ...",5.0,Unexpected flavor meld,1369008000,"05 20, 2013",{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,[['Grocery & Gourmet Food']],Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",
4,A2LWK003FFMCI5,616719923X,Diana X.,"[1, 2]",I've been looking forward to trying these afte...,4.0,"Not a very strong tea flavor, but still yummy ...",1369526400,"05 26, 2013",{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,[['Grocery & Gourmet Food']],Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",


In [115]:

def create_user_dict(ratings):
    
    user_id = list(ratings.index)
    user_dict = {}
    counter = 0 
    for i in user_id:
        user_dict[i] = counter
        counter += 1
    return user_dict
    
def create_item_dict(df,id_col,name_col):
 
    item_dict ={}
    for i in range(df.shape[0]):
        item_dict[(df.loc[i,id_col])] = df.loc[i,name_col]
    return item_dict

def runMF(ratings, n_components=30, loss='warp', k=15, epoch=30,n_jobs = 4):
  
    x = sparse.csr_matrix(ratings.values)
    model = LightFM(no_components= n_components, loss=loss,k=k)
    model.fit(x,epochs=epoch,num_threads = n_jobs)
    return model

def sample_recommendation_user(model, ratings, user_id, user_dict, 
                               item_dict,threshold = 0,nrec_items = 10, show = True):
   
    n_users, n_items = ratings.shape
    user_x = user_dict[user_id]
    scores = pd.Series(model.predict(user_x,np.arange(n_items)))
    scores.index = ratings.columns
    scores = list(pd.Series(scores.sort_values(ascending=False).index))
    
    known_items = list(pd.Series(ratings.loc[user_id,:] \
                                 [ratings.loc[user_id,:] > threshold].index) \
								 .sort_values(ascending=False))
    
    scores = [x for x in scores if x not in known_items]
    return_score_list = scores[0:nrec_items]
    known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
    scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
    if show == True:
        print("Known Likes:")
        counter = 1
        for i in known_items:
            print(str(counter) + '- ' + i)
            counter+=1

        print("\n Recommended Items:")
        counter = 1
        for i in scores:
            print(str(counter) + '- ' + i)
            counter+=1
    return return_score_list
    

def sample_recommendation_item(model,ratings,item_id,user_dict,item_dict,number_of_user):
   
    n_users, n_items = ratings.shape
    x = np.array(ratings.columns)
    scores = pd.Series(model.predict(np.arange(n_users), np.repeat(x.searchsorted(item_id),n_users)))
    user_list = list(ratings.index[scores.sort_values(ascending=False).head(number_of_user).index])
    return user_list 




In [116]:
# use ratings interaction matrix already created in matrix factorization step
mf_model = runMF(ratings = ratings,
                 n_components = 30,
                 loss = 'warp',
                 epoch = 30,
                 n_jobs = 4)

In [117]:
user_dict = create_user_dict(ratings = ratings)

In [154]:
item_dict = create_item_dict(df = merged,
                            id_col = 'asin',
                            name_col = 'title')

Recommending Items to a User

In [155]:
rec_list = sample_recommendation_user(model = mf_model,
                                     ratings = ratings,
                                     user_id = 'A00177463W0XWB16A9O05',
                                     user_dict = user_dict,
                                     item_dict = item_dict,
                                     threshold = 4,
                                     nrec_items = 10,
                                     show = True)

Known Likes:
1- The Organic Coffee Co. Java Love, 12 OneCup Single Serve Cups
2- Brooklyn Beans Breakfast Blend Decaffeinated CoffeeSingle-cup coffee for Keurig K-Cup Brewers for Keurig Brewers, 40 Count
3- Cameron's Donut Shop Single Serve Coffees,  12-Count
4- San Francisco Bay Coffee Organic Rainforest Blend, 80 OneCup Single Serve Cups
5- Martinson Coffee Capsules, Dark Roast Package compatible with Keurig K-Cup Brewers, 48 Count
6- Green Mountain Coffee, Vermont Country Blend, K-Cup Portion Pack for Keurig Brewers 24-Count
7- Keurig, The Original Donut Shop, 50 Count K-Cup Packs

 Recommended Items:
1- Brown Gold 100% Colombian Coffee Capsules, 48-Count Package compatible with Keurig K-Cup Brewers
2- Green Mountain Coffee Dark Magic (Extra Bold), K-cups For Keurig Brewers, 24-count Box
3- San Francisco Bay Coffee Breakfast Blend, 80 OneCup Single Serve Cups
4- Wolfgang Puck Coffee, Sorrento Fair Trade (Medium Roast), 24-Count K-Cups for Keurig Brewers
5- Grove Square Cappuccino, F

In [156]:
print(rec_list)

['B007OSBFY6', 'B000E1D7RS', 'B007TGDXMU', 'B003TBRF1O', 'B005K4Q1YA', 'B001D0IZBM', 'B0051SU0OW', 'B002HQCWYM', 'B00EDHW7F2', 'B007Y59HVM']


Recommending Users to an Item

In [157]:
sample_recommendation_item(model = mf_model,
                          ratings = ratings,
                          item_id = 'B007OSBFY6',
                          user_dict = user_dict,
                          item_dict = item_dict,
                          number_of_user = 10)

['A21R75GGTSDMKS',
 'A224O69F7AVXDR',
 'AGWU5LPTJHRYX',
 'A2GKWC2UIDRZ42',
 'A2TUD0VXUTNO7N',
 'AWW61BQBYZ401',
 'A32P1J924BM882',
 'A3P8QGAYQZHLTE',
 'A26POXLU9XDM11',
 'A1CRY0H4LM3SWV']

# Import Meta Data for Content-Based Methods

In [118]:
import pandas as pd
meta = pd.read_csv('/Users/marcushimelhoch/Downloads/meta_filtered.csv')

In [119]:
meta.head()

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
0,616719923X,{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,[['Grocery & Gourmet Food']],Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",
1,9742356831,{'Grocery & Gourmet Food': 3434},http://ecx.images-amazon.com/images/I/41pQp67A...,[['Grocery & Gourmet Food']],Mae Ploy Thai Green Curry Paste - 14 oz jar,Used to make various curry soups and stir fry ...,7.23,"{'also_bought': ['B000EI2LLO', 'B000EICISA', '...",Mae Ploy
2,B00004S1C5,{'Kitchen & Dining': 4494},http://ecx.images-amazon.com/images/I/41F75K9F...,"[['Grocery & Gourmet Food', 'Cooking & Baking'...","Ateco Food Coloring Kit, 6 colors","From Easter eggs to colorful cookies, Spectrum...",9.76,"{'also_bought': ['B0000CFMLT', 'B002PO3KBK', '...",HIC Harold Import Co.
3,B0000531B7,{'Grocery & Gourmet Food': 2858},http://ecx.images-amazon.com/images/I/519SuVj1...,[['Grocery & Gourmet Food']],"PowerBar Harvest Energy Bars, Double Chocolate...",,24.75,"{'also_bought': ['B000EC63PU', 'B00DZGEY44', '...",Powerbar
4,B00005344V,{'Grocery & Gourmet Food': 5034},http://ecx.images-amazon.com/images/I/51H54cd-...,[['Grocery & Gourmet Food']],"Traditional Medicinals Breathe Easy, 16-Count ...","For nearly forty years, we&#x2019;ve been pass...",21.74,"{'also_bought': ['B0009F3POE', 'B0009F3POO', '...",Traditional Medicinals


In [120]:
meta['categories'] = meta['categories'].map(lambda x: x.lower().split('&'))

In [121]:
meta.head()

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
0,616719923X,{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,"[[['grocery , gourmet food']]]",Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",
1,9742356831,{'Grocery & Gourmet Food': 3434},http://ecx.images-amazon.com/images/I/41pQp67A...,"[[['grocery , gourmet food']]]",Mae Ploy Thai Green Curry Paste - 14 oz jar,Used to make various curry soups and stir fry ...,7.23,"{'also_bought': ['B000EI2LLO', 'B000EICISA', '...",Mae Ploy
2,B00004S1C5,{'Kitchen & Dining': 4494},http://ecx.images-amazon.com/images/I/41F75K9F...,"[[['grocery , gourmet food', 'cooking , baki...","Ateco Food Coloring Kit, 6 colors","From Easter eggs to colorful cookies, Spectrum...",9.76,"{'also_bought': ['B0000CFMLT', 'B002PO3KBK', '...",HIC Harold Import Co.
3,B0000531B7,{'Grocery & Gourmet Food': 2858},http://ecx.images-amazon.com/images/I/519SuVj1...,"[[['grocery , gourmet food']]]","PowerBar Harvest Energy Bars, Double Chocolate...",,24.75,"{'also_bought': ['B000EC63PU', 'B00DZGEY44', '...",Powerbar
4,B00005344V,{'Grocery & Gourmet Food': 5034},http://ecx.images-amazon.com/images/I/51H54cd-...,"[[['grocery , gourmet food']]]","Traditional Medicinals Breathe Easy, 16-Count ...","For nearly forty years, we&#x2019;ve been pass...",21.74,"{'also_bought': ['B0009F3POE', 'B0009F3POO', '...",Traditional Medicinals


In [122]:
meta.isnull().sum()

asin              0
salesRank       402
imUrl            13
categories        0
title            13
description    1204
price          1347
related         257
brand          2764
dtype: int64

In [123]:
meta = meta.dropna(subset = ['description'])

In [124]:
meta.head()

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
0,616719923X,{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,"[[['grocery , gourmet food']]]",Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",
1,9742356831,{'Grocery & Gourmet Food': 3434},http://ecx.images-amazon.com/images/I/41pQp67A...,"[[['grocery , gourmet food']]]",Mae Ploy Thai Green Curry Paste - 14 oz jar,Used to make various curry soups and stir fry ...,7.23,"{'also_bought': ['B000EI2LLO', 'B000EICISA', '...",Mae Ploy
2,B00004S1C5,{'Kitchen & Dining': 4494},http://ecx.images-amazon.com/images/I/41F75K9F...,"[[['grocery , gourmet food', 'cooking , baki...","Ateco Food Coloring Kit, 6 colors","From Easter eggs to colorful cookies, Spectrum...",9.76,"{'also_bought': ['B0000CFMLT', 'B002PO3KBK', '...",HIC Harold Import Co.
4,B00005344V,{'Grocery & Gourmet Food': 5034},http://ecx.images-amazon.com/images/I/51H54cd-...,"[[['grocery , gourmet food']]]","Traditional Medicinals Breathe Easy, 16-Count ...","For nearly forty years, we&#x2019;ve been pass...",21.74,"{'also_bought': ['B0009F3POE', 'B0009F3POO', '...",Traditional Medicinals
5,B0000537AF,{'Health & Personal Care': 132146},http://ecx.images-amazon.com/images/I/41pmuVri...,"[[['grocery , gourmet food']]]","PowerBar ProteinPlus High Protein Bar, Vanilla...",The PowerBar ProteinPlus Protein Bar is a grea...,,"{'also_bought': ['B001U89ITK', 'B009VV7G60', '...",


In [113]:
meta.isnull().sum()

asin              0
salesRank       330
imUrl             0
categories        0
title             0
description       0
price          1079
related         190
brand          2323
dtype: int64

In [130]:
type(meta.title)

pandas.core.series.Series

# TFIDF , Truncated SVD and KMeans

The purpose of this method is to cluster similar item descriptions together and recommend items to users that are similar to what they've purchased previously.
TfIDF assigns a value to each word based on the frequency of its appearance.  It then accounts for unimportant words by assigning a negative weight to words such as 'the', 'it', (etc).  Finally, it creates a matrix with the columns representing all words appearing in all of the selected text, and each observation representing the rows.
This will be a very sparse matrix because not all words will appear in all item descriptions.  To work more effectively with a sparse matrix, TruncatedSVD reduces the dimensions by only remembering the non-zero entries.  We use this method because scikit-learn PCA doesn't support the alternate method, scipy.sparse.csr_matrix.  Dimension reduction is helpful because it retains high variances features, discards low variance features and assumes high variance features are important- resulting in a smaller matrix with just as much information.
Now that each observation contains a vector of values corresponding to the frequency of each of it's words' appearance in all item descriptions, we can now cluster similar vectors together using KMeans.  KMeans is able to cluster similar observations together using distance metrics for each column.  It is an efficient and simple model, however it requires the number of clusters to be stated, which could be a weak point of this method.

The conclusion of this process is: if a user bought a Kit Kat bar, we would look at what cluster the Kit Kat bar belongs to.  If it belongs to cluster #1, we would recommend other items in cluster #1 to that user.

In [13]:
meta.head()

Unnamed: 0,asin,salesRank,imUrl,categories,title,description,price,related,brand
0,616719923X,{'Grocery & Gourmet Food': 37305},http://ecx.images-amazon.com/images/I/51LdEao6...,"[[['grocery , gourmet food']]]",Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...,Green Tea Flavor Kit Kat have quickly become t...,,"{'also_bought': ['B00FD63L5W', 'B0047YG5UY', '...",
1,9742356831,{'Grocery & Gourmet Food': 3434},http://ecx.images-amazon.com/images/I/41pQp67A...,"[[['grocery , gourmet food']]]",Mae Ploy Thai Green Curry Paste - 14 oz jar,Used to make various curry soups and stir fry ...,7.23,"{'also_bought': ['B000EI2LLO', 'B000EICISA', '...",Mae Ploy
2,B00004S1C5,{'Kitchen & Dining': 4494},http://ecx.images-amazon.com/images/I/41F75K9F...,"[[['grocery , gourmet food', 'cooking , baki...","Ateco Food Coloring Kit, 6 colors","From Easter eggs to colorful cookies, Spectrum...",9.76,"{'also_bought': ['B0000CFMLT', 'B002PO3KBK', '...",HIC Harold Import Co.
4,B00005344V,{'Grocery & Gourmet Food': 5034},http://ecx.images-amazon.com/images/I/51H54cd-...,"[[['grocery , gourmet food']]]","Traditional Medicinals Breathe Easy, 16-Count ...","For nearly forty years, we&#x2019;ve been pass...",21.74,"{'also_bought': ['B0009F3POE', 'B0009F3POO', '...",Traditional Medicinals
5,B0000537AF,{'Health & Personal Care': 132146},http://ecx.images-amazon.com/images/I/41pmuVri...,"[[['grocery , gourmet food']]]","PowerBar ProteinPlus High Protein Bar, Vanilla...",The PowerBar ProteinPlus Protein Bar is a grea...,,"{'also_bought': ['B001U89ITK', 'B009VV7G60', '...",


In [14]:
#take the text out of the description column
text = meta.description.values

In [15]:
#take the text out of the title column
titles = meta.title.values

In [16]:
#
from sklearn.feature_extraction.text import TfidfVectorizer

#create an instance
tfidf = TfidfVectorizer()

#apply fit/transform to my text
csr_mat = tfidf.fit_transform(text)

#print the result of .toarray method
print(csr_mat.toarray())

#get just the words
words = tfidf.get_feature_names()

#print the words
print(words)


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [1]:
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from sklearn.manifold import TSNE

In [252]:
#from sklearn.cluster import KMeans
#wcss = []

#for i in range(1, 11):
 #   kmeans = KMeans(n_clusters = i, init = 'k-means++',
                #    max_iter = 400, n_init = 10, random_state = 0)
  #  kmeans.fit(csr_mat)
   # wcss.append(kmeans.inertia_)
    
#Plotting the results onto a line graph to observe 'The elbow'
#plt.plot(range(1, 11), wcss)
#plt.title('Elbow Method')
#plt.xlabel('Association')
#plt.ylabel('WCSS') #within cluster sum of squares
#plt.show()

In [229]:


#create an instance with 50 components
svd = TruncatedSVD(n_components = 50)


#create a KMeans instance with 15 clusters.
#Because the elbow method will take too long with this data, using an assumed 15 clusters
kmeans = KMeans(n_clusters = 15)

#create a pipeline for TruncatedSVD and Kmeans 
pipeline = make_pipeline(svd, kmeans)

In [230]:
#fit the pipelien to my matrix generated above
pipeline.fit(csr_mat)

Pipeline(memory=None,
         steps=[('truncatedsvd',
                 TruncatedSVD(algorithm='randomized', n_components=50, n_iter=5,
                              random_state=None, tol=0.0)),
                ('kmeans',
                 KMeans(algorithm='auto', copy_x=True, init='k-means++',
                        max_iter=300, n_clusters=15, n_init=10, n_jobs=None,
                        precompute_distances='auto', random_state=None,
                        tol=0.0001, verbose=0))],
         verbose=False)

In [185]:
#create the predictions/cluster labels (0,1,2...15)
labels = pipeline.predict(csr_mat)

In [186]:
labels 

array([9, 2, 2, ..., 2, 2, 8], dtype=int32)

In [212]:
#create a DF with cluster labels and title of item
df = pd.DataFrame({'label': labels, 'item': titles})

In [213]:
#desert items have been clustered in cluster #0, cocunut itmes clustered at cluster #14, etc
df.sort_values('label')

Unnamed: 0,label,item
5328,0,Ghirardelli Peppermint Bark Squares with Dark ...
6871,0,Sanders Dark Chocolate Sea Salt Caramels 36 Ou...
4754,0,"Monin Flavored Sauce, Dark Chocolate, 12-Ounce..."
6765,0,"Ghirardelli Milk Chocolate Easter Eggs, 3.5-Ou..."
4753,0,"Nestle Ovaltine Rich Chocolate, 12-Ounce Tubs ..."
4401,0,"Cafe Escapes Dark Chocolate Hot Cocoa K-Cups, ..."
5269,0,"Green &amp; Black Organic Hot Chocolate Mix, 5..."
5268,0,Cadbury Easter Royal Dark Chocolate Candy Coat...
4191,0,Arnott's Tim Tam Original
3606,0,"Ghirardelli Triple Chocolate Brownie Mix, Semi..."


In [232]:
df.head()

Unnamed: 0,label,item
0,9,Japanese Kit Kat Maccha Green Tea Flavor (5 Ba...
1,2,Mae Ploy Thai Green Curry Paste - 14 oz jar
2,2,"Ateco Food Coloring Kit, 6 colors"
3,1,"Traditional Medicinals Breathe Easy, 16-Count ..."
4,2,"PowerBar ProteinPlus High Protein Bar, Vanilla..."


# CountVectorizer with Description

This method is similar to TfIDF in that it assesses the frequency of words appearing in the item descriptions.  It differs from TfIDF by it's use of cosine similarity to determine the relatively 'closeness' of items to other items.

The method starts by transforming the item descriptions into a count matrix in which each observation contains the frequency of each word in the related columns.  
Next, it uses cosine similarity to determine the relationships between each observation and it's words.
Finally, a function is created to return the top 10 recommended items based on their related cosine similarities

In [234]:
#create an instance
count = CountVectorizer()

#fit and transform the "description column"
count_matrix = count.fit_transform(meta['description'])

In [235]:
count_matrix.shape

(7509, 17251)

In [236]:
#use cosine similarity to asses similarities between terms
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(count_matrix, count_matrix)
cosine_sim

array([[1.        , 0.08084521, 0.17393131, ..., 0.11012975, 0.1713649 ,
        0.10102241],
       [0.08084521, 1.        , 0.21248509, ..., 0.28758185, 0.25831064,
        0.13884203],
       [0.17393131, 0.21248509, 1.        , ..., 0.33468064, 0.29057907,
        0.23509295],
       ...,
       [0.11012975, 0.28758185, 0.33468064, ..., 1.        , 0.41397466,
        0.29158274],
       [0.1713649 , 0.25831064, 0.29057907, ..., 0.41397466, 1.        ,
        0.32831724],
       [0.10102241, 0.13884203, 0.23509295, ..., 0.29158274, 0.32831724,
        1.        ]])

In [237]:
#assign item titles to be the indicies
indices = pd.Series(meta['title'])

In [238]:
#define new function recommendation that recomends based on the previously completed cosine similiarity and uses item titles to call the recommendation

def recommendations(title, cosine_sim = cosine_sim):
    #create empty list
    recommended_product = []
    
    #create index using previously created indicies
    idx = indices[indices == title].index[0]
    
    #create a series of cosine similarity scores sorted
    score_series = pd.Series(cosine_sim[idx]).sort_values(ascending = False)
    
    #find top 10 products according to cosine sim
    top_10_indexes = list(score_series.iloc[1:11].index)
    
    #add these top 10 similar products to list: recommended_product
    for i in top_10_indexes:
        recommended_product.append(list(df.index)[i])
        
    return recommended_product

In [242]:
#find an item title from meta data
meta.title[1]

'Mae Ploy Thai Green Curry Paste - 14 oz jar'

In [241]:
#what are the recommended prodcuts if a user bought this green curry paste ?
recommendations('Mae Ploy Thai Green Curry Paste - 14 oz jar')

[723, 722, 724, 725, 3314, 2307, 1676, 5031, 4850, 6368]

In [247]:
#what is item 722?
meta.iloc[722].title

'Mae Ploy Thai Red Curry Paste - 14 ounce per jar'

In [248]:
meta.iloc[723].title

'Mae Ploy Thai Panang Curry Paste - 14 oz jar'

In [249]:
meta.iloc[725].title

'Mae Ploy Thai Yellow Curry Paste - 14 oz jar'

In [250]:
meta.iloc[5031].title

"Libby's Lima Beans, 15-Ounce Cans (Pack of 12)"

The recommended items all hold up conceptually