## Simple Product Recommendations

By leveraging text frequency metrics and vector similarity, products can be matched with one another.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel

## The Data

The data looks something like this:

In [9]:
products = pd.read_csv('data/sample-data.csv')
products.head()
# To-Do, split heads from the data to allow queries from data!

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


## tfidf vectorising

This converts words to their tfidfs.

To-Do: Add explanation and theory behind tfidfs

In [10]:
# Initialize the vectorizer to be word-based and to consider uni-, bi-, and tri-grams
tfidf = TfidfVectorizer(analyzer = 'word', ngram_range = (1,3), min_df = 0, stop_words = 'english')
tfidf_matrix = tfidf.fit_transform(products.description)

## Cosine Similarities

Go back to vector algebra! The cosine of two vectors will be 1 if the angle between them is 0 degrees, i.e; they are similar! 

So if the cosine-similarity is closer to 1, the more similar the vectors are and thus, the more similar are the products in content of their description. Very basic!

In [22]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Let's look at our amazing similarity matrix:

In [28]:
pd.DataFrame(cosine_similarities).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,1.0,0.101106,0.064874,0.054205,0.045668,0.043036,0.038365,0.033483,0.065326,0.023683,...,0.055639,0.044734,0.049726,0.16939,0.138845,0.138136,0.116579,0.060974,0.065469,0.069556
1,0.101106,1.0,0.418166,0.05454,0.05834,0.040264,0.043855,0.032784,0.03704,0.027327,...,0.049276,0.037292,0.048259,0.113034,0.072082,0.07473,0.054444,0.0355,0.069364,0.064805
2,0.064874,0.418166,1.0,0.050032,0.063913,0.045049,0.051047,0.062232,0.043588,0.044433,...,0.048307,0.039313,0.044638,0.081965,0.110537,0.061168,0.059371,0.034024,0.045514,0.050385
3,0.054205,0.05454,0.050032,1.0,0.099679,0.104469,0.065436,0.033349,0.057038,0.042155,...,0.0443,0.036827,0.040935,0.065442,0.047781,0.062331,0.049173,0.072626,0.047726,0.05847
4,0.045668,0.05834,0.063913,0.099679,1.0,0.115886,0.080627,0.033233,0.054776,0.056843,...,0.039783,0.032425,0.043029,0.052072,0.043866,0.053034,0.062629,0.107587,0.042696,0.05481


In [55]:
# To-Do: add notes
similarities = {}

for index, row in products.iterrows():
    similar_indices = cosine_similarities[index].argsort()[:-100:-1]
    # add product title after you separate it!
    similar_items = [(cosine_similarities[index][i], products['id'][i]) for i in similar_indices]
    similarities[row['id']] = similar_items[1:]

In [58]:
# To-do: Make this great again 
pd.DataFrame(similarities)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,491,492,493,494,495,496,497,498,499,500
0,"(0.220379214726, 19)","(0.418166399216, 3)","(0.418166399216, 2)","(0.825385675995, 159)","(0.955003649316, 308)","(0.301900561799, 438)","(0.266268232308, 354)","(0.913268218046, 220)","(0.375508408435, 417)","(0.302470602334, 425)",...,"(0.401490868812, 116)","(0.981262273383, 286)","(0.417995838571, 138)","(0.528361268034, 19)","(0.311684484525, 494)","(0.615185258188, 173)","(0.704989080363, 22)","(0.237471568777, 302)","(0.386247565848, 462)","(0.36281626186, 499)"
1,"(0.16938950913, 494)","(0.115463820986, 19)","(0.11401848122, 299)","(0.20769755385, 184)","(0.183044200891, 96)","(0.293119445145, 184)","(0.254988934946, 104)","(0.449363734332, 262)","(0.116546926409, 469)","(0.216625222869, 466)",...,"(0.400336582155, 72)","(0.254601905813, 116)","(0.38993391691, 116)","(0.311684484525, 495)","(0.2400276437, 496)","(0.415822638361, 22)","(0.63344523705, 360)","(0.226974058613, 267)","(0.384221675213, 463)","(0.318046459929, 462)"
2,"(0.167694580653, 18)","(0.113033922454, 494)","(0.110537294466, 495)","(0.188279918017, 438)","(0.180399268592, 281)","(0.184122871365, 382)","(0.254573476606, 403)","(0.446206375211, 255)","(0.108203020376, 474)","(0.190674483329, 428)",...,"(0.391326437078, 139)","(0.251968784786, 56)","(0.356114740564, 347)","(0.217157070817, 496)","(0.222285329056, 173)","(0.385800979652, 23)","(0.608114612943, 359)","(0.22636494599, 386)","(0.383771537386, 32)","(0.31778345313, 463)"
3,"(0.164855277456, 172)","(0.112478545211, 300)","(0.109176400166, 300)","(0.165740268287, 343)","(0.157542170023, 293)","(0.16468574577, 415)","(0.242631339147, 464)","(0.38174009592, 291)","(0.103801046495, 475)","(0.167516510742, 408)",...,"(0.376283960326, 98)","(0.240331749948, 372)","(0.346087059195, 98)","(0.214422798401, 173)","(0.222213654254, 19)","(0.383677745095, 359)","(0.567262700882, 23)","(0.184041254207, 212)","(0.36281626186, 500)","(0.315561344229, 32)"
4,"(0.148126154606, 442)","(0.111470179244, 299)","(0.101723204487, 156)","(0.163738275363, 384)","(0.152097992726, 210)","(0.150837910762, 387)","(0.231217902502, 437)","(0.369299287912, 240)","(0.101759256605, 230)","(0.153008463261, 465)",...,"(0.375307991886, 397)","(0.23862073502, 138)","(0.330714640509, 56)","(0.185013957657, 497)","(0.209197225592, 23)","(0.383561421069, 497)","(0.530864978094, 175)","(0.179860082443, 415)","(0.283497236892, 34)","(0.256628673385, 34)"
5,"(0.145778632844, 171)","(0.101106417012, 1)","(0.100468654961, 318)","(0.160573220239, 379)","(0.151214243544, 364)","(0.150293224215, 268)","(0.222388324826, 436)","(0.368140595673, 219)","(0.0971748646245, 459)","(0.130002424701, 135)",...,"(0.373547483693, 124)","(0.227441859125, 397)","(0.329607208576, 26)","(0.175388122992, 22)","(0.204739935194, 497)","(0.372303808111, 360)","(0.514832978499, 174)","(0.179049140747, 382)","(0.21831892567, 453)","(0.234254644175, 483)"
6,"(0.141376423654, 21)","(0.0991219664716, 318)","(0.0994017094776, 155)","(0.15095357534, 353)","(0.140482415814, 97)","(0.146019382017, 212)","(0.211795216032, 393)","(0.335665677801, 261)","(0.0960472039411, 414)","(0.110135362311, 17)",...,"(0.367779223578, 26)","(0.226893182143, 72)","(0.329028871694, 289)","(0.16938950913, 1)","(0.200120725316, 22)","(0.338743376303, 175)","(0.383561421069, 496)","(0.172783761085, 274)","(0.217008656335, 303)","(0.216607899285, 303)"
7,"(0.138844634262, 495)","(0.0882901384989, 155)","(0.0922480131726, 165)","(0.142800330471, 216)","(0.14014100135, 216)","(0.141379996745, 216)","(0.204461998427, 31)","(0.328735991843, 254)","(0.0940791373569, 461)","(0.103455454116, 13)",...,"(0.3626377951, 73)","(0.225028191282, 333)","(0.328418638177, 237)","(0.169159320092, 359)","(0.191519987957, 359)","(0.335415854042, 174)","(0.372267314822, 173)","(0.168242135979, 352)","(0.214288524812, 483)","(0.201515257636, 482)"
8,"(0.138795333314, 25)","(0.0882217484489, 214)","(0.0919951896154, 164)","(0.141957072879, 120)","(0.139815759168, 207)","(0.140385600725, 120)","(0.15202059228, 389)","(0.289259815254, 221)","(0.0937959299595, 460)","(0.10166227909, 312)",...,"(0.360657733483, 36)","(0.223805928384, 491)","(0.327737078794, 333)","(0.167947713015, 23)","(0.186814653713, 360)","(0.30026359112, 442)","(0.303843941939, 441)","(0.166801024269, 327)","(0.192695177237, 481)","(0.20115947894, 481)"
9,"(0.138135502991, 496)","(0.0873103968644, 301)","(0.0904754695913, 258)","(0.1313713738, 331)","(0.139626470502, 120)","(0.131837611921, 302)","(0.151310410216, 456)","(0.25918402555, 241)","(0.0922100615964, 248)","(0.0747966554577, 29)",...,"(0.352830943768, 95)","(0.220499739806, 322)","(0.324766654346, 491)","(0.164707435462, 360)","(0.18291858869, 175)","(0.255209064612, 440)","(0.260824305938, 443)","(0.165330146601, 67)","(0.191456566735, 482)","(0.199109620966, 37)"


Function to query for an item given it's id and return the name:

In [71]:
def query_item(item_id):
    return products.loc[products['id'] == item_id]['description'].tolist()[0].split(' - ')[0]

Generate top n similar items given an item's id and n (of course):

In [76]:
def recommend(item_id, n):
    print(str(n) + " products similar to " + query_item(item_id) + " :")
    print("------------------")
    recommendations = similarities[item_id][:n]
    for r in recommendations:
        print(query_item(r[1]) + " (score:" + str(r[0]) + ")")

In [78]:
recommend(3,5)

5 products similar to Active sport briefs :
------------------
Active sport boxer briefs (score:0.418166399216)
Active boy shorts (score:0.11401848122)
Active briefs (score:0.110537294466)
Active briefs (score:0.109176400166)
Active mesh bra (score:0.101723204487)


In [None]:
/recommend "Active sport briefs" 5