Item-to-item CF matches item purchased or rated by a target user to similar items and combines those similar items in a recommendation list.

Similarity computed based on the following features:
    1. Average rating
    2. Product Description
    3. Co-occurrence of items in a bag of other customers
    

Steps:
    1. Scan the products, for all customers that bought any product from the basket, identify the other products bought by those customers
    2. Find what products are bought together more often
    3. Compute the similarity of the target user basket with the products identified in step 2.
    4. Select top N similar to recommend
    
    
    
There are some #TODOs in the code

In [475]:
import pandas as pd
import numpy as np
import gzip
import time

In [476]:
# high-level parameters
example_id = 'AHQRU3MRORIWQ' # user ID (for this user recommendation is made)
topN = 10 # how many products to recommend

In [480]:
# ==== Reading data ====== #
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield eval(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')


# Tools and Home Improvement	reviews (1,926,047 reviews) metadata (269,120 products)
start_time = time.time()

data_review = getDF('reviews_Tools_and_Home_Improvement_5.json.gz')
data_meta = getDF('meta_Tools_and_Home_Improvement.json.gz')

print("--- %s seconds ---" % (time.time() - start_time))  

--- 65.97688794136047 seconds ---


In [481]:
data_review.head(n=1) # preview of the reviews data

Unnamed: 0,reviewerID,reviewTime,overall,summary,reviewerName,reviewText,helpful,asin,unixReviewTime
0,A4IL0CLL27Q33,"01 29, 2014",5.0,Perfect for collar stay management,D. Brennan,"I hate it when my shirt collars, not otherwise...","[0, 1]",104800001X,1390953600


In [482]:
data_meta.set_index('asin', inplace = True)
del data_meta.index.name
data_meta.head(n=1) # preview of the items data

Unnamed: 0,title,imUrl,categories,related,price,brand,description,salesRank
001212835X,Everett's Cottage Table Lamp,http://ecx.images-amazon.com/images/I/41R2RAs9...,"[[Tools & Home Improvement, Lighting & Ceiling...",,,,,


In [483]:
# lists of the all users IDs and all product IDs (unique)
#users_ids = list(set(data_review['reviewerID']))
#products_asins = list(set(data_review['asin']))

In [484]:
# grouping of data by users and by products
reviews_grouped_by_product = data_review[['reviewerID','asin','overall']].groupby('asin')
reviews_grouped_by_user = data_review[['reviewerID','asin','overall']].groupby('reviewerID')

In [485]:
def who_also_bought(product_id):
    # input - product ID, output - dataframe with data = IDs of users, who also bought this product
    return reviews_grouped_by_product.get_group(product_id)['reviewerID']

def get_product_list(customer_id):
    # input - user ID, output - dataframe with data = what products was rated by the given user
    return reviews_grouped_by_user.get_group(customer_id)['asin']

def do_df(asin):
    # input - product ID, output - dataframe with data = ratings for this product from the users who have rated it
    tmp = reviews_grouped_by_product.get_group(asin)[['reviewerID','overall']]
    tmp.set_index('reviewerID', inplace = True)
    tmp.columns = [asin]
    del tmp.index.name
    return tmp

In [486]:
# catalog of the items that user has rated
example_user_catalog = reviews_grouped_by_user.get_group(example_id)['asin']

In [487]:
# list of dataframes; 
#each dataframe is with: index = users ID, column = product ID, data = rating for product from users
reviews_matrix = [] 

# dictionary of dictionaries; 
# format: item_X = {item_A: # of times user rated item_A AND item_X}
purchase_frequency = {}

start_time = time.time()

cutomers_already_covered = [example_id]
products_already_covered = list(example_user_catalog.values)

#TODO: find a better way than nested for loops

for item_id in example_user_catalog:
    customers = who_also_bought(item_id)
    cutomers_already_covered.append(list(customers.values))
    purchase_frequency[item_id] = {}
    
    for c_id in customers:
        if np.all(c_id not in cutomers_already_covered):
            products = get_product_list(c_id)
            products_already_covered.append(list(products.values))
            #products = products[products != item_id] 
            
            for prod in products:
                if np.all(prod not in products_already_covered):
                    reviews_matrix.append( do_df(prod) )
                    try: 
                        purchase_frequency[item_id][prod] = purchase_frequency[item_id][prod] + 1
                    except KeyError:     
                        purchase_frequency[item_id][prod] = 1

# matrix - collecting the reviews from the users about the products, not used now.
matrix = pd.concat(reviews_matrix, axis = 1, join='outer')

sorted_purchase_frequency = {}
for val in purchase_frequency.keys():
    sorted_purchase_frequency[val] = sorted(purchase_frequency[val], key=purchase_frequency[val].__getitem__, reverse=True)

print("--- %s seconds ---" % (time.time() - start_time))  

--- 30.34269618988037 seconds ---


In [488]:
matrix.head(n = 1) # preview of the matrix, that collected the revies from the users

Unnamed: 0,B0000224L6,B0000225OD,B0007VYL48,B000CFNCKS,B000I1EFKM,B001EYU97A,B001PTGBRQ,B002B56CUO,B002B56CUY,B0055HPIOQ,...,B00443I32G,B004SKY73O,B008186IAY,B00APL6Q0W,B00G5R4E1S,B00GWBYDTU,B00I5CKF0A,B00IL62XK0,B00JALS9Y4,B00JALSF7K
A00473363TJ8YSZ3YAGG9,,,,,,,,,,,...,,,,,,,,,,


Define the top N (to be set) products based on the how often bought together

In [489]:
topN_products = []
for val in sorted_purchase_frequency.keys():
    topN_products = topN_products + (sorted_purchase_frequency[val][1:topN])

# TODO: the information that the product is a duplicate can be usefull, need to include into analysis
# delete the duplicates
topN_products = list(set(topN_products))

In top N define M most similar products to ones that are in the user basket

To define it, need to look at the products similarity. The possible features:
- average rating
- price
- category
- brand

In [490]:
def consructDataFeatures(data_in, item_list):
    data_out = data_in.loc[item_list, ['categories','price','brand','salesRank']]
    for el in item_list:
        data_out.loc[el,'Rating'] = reviews_grouped_by_product.get_group(el)['overall'].mean()
        try:
            data_out.loc[el,'SalesCategory'] = data_out.loc[el, 'salesRank'].keys()
            k = list(data_out.loc[el,'SalesCategory'])
            data_out.loc[el,'SalesRank'] = data_out.loc[el, 'salesRank'][k[0]]
        except AttributeError:
            data_out.loc[el,'SalesCategory'] = np.nan
            data_out.loc[el,'SalesRank'] = np.nan

    data_out = data_out.loc[:,['Rating','price','SalesCategory','SalesRank','brand', 'categories']]

    # treat NA as 0s
    data_out.fillna(0, inplace = True)
    return data_out

In [491]:
user_products_data = consructDataFeatures(data_meta, example_user_catalog)
topN_products_data = consructDataFeatures(data_meta, topN_products)

Each item is characterized by the vector (average rating, price, sales category, sales rank, brand, categories)

#TODO: there are ways to improve

To find the similarity:

For the numeric variables - rating and price: how close are the values

For the categorical variables - sales category, brand, categories:
- sales category and brand: 1 if coincide fully, -> 0 if sales category mismatch or ranks are distant
- categories: measure the overlap of the subcategoris, as % out of the max possible overlap


In [511]:
def similarity_rate(product1, product2):
    # TODO: implement other measures for the categorical variables of SalesCategory, SalesRank, brand, categories_overlap
    # for the categorical values the weighted average overlap measure is applied 
    
    # numeric meaures
    rating_similarity = np.exp(-abs(product1['Rating'] - product2['Rating'])) #closer - higher similarity
    price_similarity = np.exp(-abs(product1['price'] - product2['price'])) #closer - higher similarity
    
    # overlap meaures
    brand_similarity = product1['brand'] == product2['brand']
    
    sales_category_similarity = product1['SalesCategory'] == product2['SalesCategory']
    sales_rank_similarity = np.exp(-abs(product1['SalesRank'] - product2['SalesRank']))
    sales_similarity = (sales_category_similarity * sales_rank_similarity) if sales_category_similarity!=0 else 0.
    
    #Category similarity defines as a % out of the max possible similarity
    max_similarity = max(len(product2['categories'][0]), len(product1['categories'][0]))
    category_similarity = len(set(product2['categories'][0]).intersection(product1['categories'][0]))/ max_similarity
    
    # return average rate
    return np.mean([rating_similarity, price_similarity, brand_similarity, sales_similarity, category_similarity])
    

In [512]:
similarity_matrix = pd.DataFrame([], index = topN_products_data.index.values, columns = user_products_data.index.values)

start_time = time.time()
for topN_el in similarity_matrix.index.values:
    p2 = topN_products_data.loc[topN_el,:]
    for user_prod in similarity_matrix.columns.values:
        p1 = user_products_data.loc[user_prod,:]
        similarity_matrix.loc[topN_el,user_prod] = similarity_rate(p1,p2)
        
print("--- %s seconds ---" % (time.time() - start_time))          

--- 0.9321529865264893 seconds ---


In [513]:
similarity_matrix.head(n = 2)

Unnamed: 0,B00005A1JN,B00009YUHK,B001RQG6Z4,B0030A85M2,B0037NXKY0,B003MP8MGO,B003MP8MGY,B0048WPV3M,B0064MRP0G,B007BE9OQ4,B009GMJOG4,B00FIYJXAQ,B00FZKTRPY,B00GZGC3IK
B0000223QY,0.679923,0.12262,0.38888,0.372214,0.371544,0.240166,0.436032,0.376634,0.437469,0.501531,0.425378,0.40537,0.440933,0.383801
B001GAOO6Y,0.578426,0.103003,0.40008,0.394463,0.380099,0.222685,0.426591,0.39372,0.425112,0.494664,0.437883,0.419391,0.421753,0.356081


In [516]:
# calculate the average similarity of the product to the whole busket of user's reviewed products
sorted_average_similarity = (similarity_matrix.mean(axis = 1)).sort_values(ascending = False)

# get topN products that are the most similar to the whole busket
prod_IDs_to_recommend = sorted_average_similarity[1:topN]
prod_IDs_to_recommend

B00DHU85NE    0.467945
B00APB0IX8    0.465836
B00JPBDL9W    0.461515
B008U3R9OE    0.445298
B00GM477G8    0.441877
B009UVDE5S    0.440528
B005E6H7BU    0.438174
B0080PLOE8    0.431286
B001I9TI4Q    0.429026
dtype: float64

In [517]:
# View the list of reccomended products
topN_products_data.loc[prod_IDs_to_recommend.index.values,:]

Unnamed: 0,Rating,price,SalesCategory,SalesRank,brand,categories
B00DHU85NE,4.492308,12.95,0,0.0,0,"[[Tools & Home Improvement, Electrical, Light ..."
B00APB0IX8,4.4,14.24,0,0.0,0,"[[Tools & Home Improvement, Electrical, Light ..."
B00JPBDL9W,4.444444,35.99,0,0.0,0,"[[Tools & Home Improvement, Electrical, Light ..."
B008U3R9OE,4.37037,21.99,0,0.0,0,"[[Tools & Home Improvement, Lighting & Ceiling..."
B00GM477G8,4.125,24.99,0,0.0,0,"[[Tools & Home Improvement, Lighting & Ceiling..."
B009UVDE5S,4.5,49.95,0,0.0,0,"[[Tools & Home Improvement, Lighting & Ceiling..."
B005E6H7BU,4.666667,69.95,0,0.0,0,"[[Tools & Home Improvement, Lighting & Ceiling..."
B0080PLOE8,4.307692,37.8,0,0.0,GE Lighting,"[[Tools & Home Improvement, Electrical, Light ..."
B001I9TI4Q,4.585586,15.99,0,0.0,TerraLUX,"[[Tools & Home Improvement, Electrical, Light ..."


In [518]:
# original user dataset
user_products_data

Unnamed: 0,Rating,price,SalesCategory,SalesRank,brand,categories
B00005A1JN,4.666667,11.89,0,0.0,Stanley,"[[Tools & Home Improvement, Power & Hand Tools..."
B00009YUHK,3.222222,166.66,(Home Improvement),12232.0,0,"[[Tools & Home Improvement, Painting Supplies ..."
B001RQG6Z4,4.6,123.17,0,0.0,Hunter Fan Company,"[[Tools & Home Improvement, Lighting & Ceiling..."
B0030A85M2,4.6,7.25,0,0.0,Neiko,"[[Tools & Home Improvement, Safety & Security,..."
B0037NXKY0,4.733333,24.6,0,0.0,Leviton,"[[Tools & Home Improvement, Electrical, Plugs]]"
B003MP8MGO,4.285714,5.59,(Home Improvement),9404.0,Dorcy,"[[Tools & Home Improvement, Electrical, Light ..."
B003MP8MGY,4.307692,4.21,0,0.0,Dorcy,"[[Tools & Home Improvement, Electrical, Light ..."
B0048WPV3M,4.692308,15.75,0,0.0,Leviton,"[[Tools & Home Improvement, Electrical, Outlet..."
B0064MRP0G,4.3,79.99,0,0.0,Black &amp; Decker,"[[Tools & Home Improvement, Painting Supplies ..."
B007BE9OQ4,4.512821,24.0,0,0.0,Coast,"[[Tools & Home Improvement, Power & Hand Tools..."
