# Intro to Recommender Systems Lab

Complete the exercises below to solidify your knowledge and understanding of recommender systems.

For this lab, we are going to be putting together a user similarity based recommender system in a step-by-step fashion. Our data set contains customer grocery purchases, and we will use similar purchase behavior to inform our recommender system. Our recommender system will generate 5 recommendations for each customer based on the purchases they have made.

In [342]:
import pandas as pd
import numpy as np

# note we're importing some libs from scipy
from scipy.spatial.distance import pdist, squareform


* __pdist__: Pairwise distances between observations in n-dimensional space.
Source: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

* __squareeform__: Converts a vector-form distance vector to a square-form distance matrix, and vice-versa --

Source: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.squareform.html

In [343]:
data = pd.read_csv('../data/customer_product_sales.csv')

In [344]:
data.head()

Unnamed: 0,CustomerID,FirstName,LastName,SalesID,ProductID,ProductName,Quantity
0,61288,Rosa,Andersen,134196,229,Bread - Hot Dog Buns,16
1,77352,Myron,Murray,6167892,229,Bread - Hot Dog Buns,20
2,40094,Susan,Stevenson,5970885,229,Bread - Hot Dog Buns,11
3,23548,Tricia,Vincent,6426954,229,Bread - Hot Dog Buns,6
4,78981,Scott,Burch,819094,229,Bread - Hot Dog Buns,20


## Step 1: Create a data frame that contains the total quantity of each product purchased by each customer.

You will need to group by CustomerID and ProductName and then sum the Quantity field.

In [345]:
# groupby CustomerID', and 'ProductName', return the 'Quantity' var and reset the index
df_prod = data.groupby(['CustomerID', 'ProductName']).sum()['Quantity'].reset_index()

In [346]:
df_prod.head(1)

Unnamed: 0,CustomerID,ProductName,Quantity
0,33,Apricots - Dried,1


## Step 2: Use the `pivot_table` method to create a product by customer matrix.

The rows of the matrix should represent the products, the columns should represent the customers, and the values should be the quantities of each product purchased by each customer. You will also need to replace nulls with zeros, which you can do using the `fillna` method.

In [347]:
# create a utility matrix with the sum of 'Quantity'
pivot_customer = data.pivot_table(index='ProductName', 
                               columns='CustomerID', 
                               values='Quantity',
                               aggfunc='sum',
                               fill_value=0)

In [348]:
pivot_customer.head(1)

CustomerID,33,200,264,356,412,464,477,639,649,669,...,97697,97753,97769,97793,97900,97928,98069,98159,98185,98200
ProductName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anchovy Paste - 56 G Tube,0,0,0,0,0,0,0,1,0,0,...,0,25,0,0,0,0,0,0,0,0


In [349]:
# df_customer has 1000 customers
pivot_customer.shape

(452, 1000)

## Step 3: Create a customer similarity matrix using `squareform` and `pdist`. For the distance metric, choose "euclidean."

In [350]:
dist = squareform(
    # Find distances between all users: the smaller the distance more similar customers are
    pdist(
        # receive transposed pivot_customer table as data
        pivot_customer.T,
        # calculate the euclidean distance
        metric='euclidean'))

In [351]:
dist

array([[  0.        ,  11.91637529,  10.48808848, ..., 228.62851966,
        239.        , 229.77380181],
       [ 11.91637529,   0.        ,  11.74734012, ..., 228.01096465,
        239.03765394, 229.70415756],
       [ 10.48808848,  11.74734012,   0.        , ..., 228.08112592,
        238.26665734, 229.77380181],
       ...,
       [228.62851966, 228.01096465, 228.08112592, ...,   0.        ,
        304.13812651, 305.16389039],
       [239.        , 239.03765394, 238.26665734, ..., 304.13812651,
          0.        , 303.10889132],
       [229.77380181, 229.70415756, 229.77380181, ..., 305.16389039,
        303.10889132,   0.        ]])

In [352]:
# return an array with the distances
type(dist)

numpy.ndarray

In [353]:
# the array lengths 1000 cases, one for each customer
len(dist)

1000

In [354]:
from sklearn.neighbors import DistanceMetric as dm
# get the euclidean distance using sklearn
eucl = dm.get_metric('euclidean')

# applyt pairwise method
# note the df is transposed, as the function receives a different type of data
dist2 = eucl.pairwise(pivot_customer.T)

In [355]:
# note dist and dis2 are equals
dist == dist2

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

In [356]:
# Now that we have the distance, let's create a matrix
# select unique customer ID
cust_id = pivot_customer.columns

# convert the by summing up 1 and divide by 1
convert_dist = 1 / (1 + dist)

In [357]:
convert_dist

array([[1.        , 0.0774211 , 0.08704668, ..., 0.00435486, 0.00416667,
        0.00433325],
       [0.0774211 , 1.        , 0.07844774, ..., 0.0043666 , 0.00416601,
        0.00433456],
       [0.08704668, 0.07844774, 1.        , ..., 0.00436527, 0.00417944,
        0.00433325],
       ...,
       [0.00435486, 0.0043666 , 0.00436527, ..., 1.        , 0.0032772 ,
        0.00326622],
       [0.00416667, 0.00416601, 0.00417944, ..., 0.0032772 , 1.        ,
        0.0032883 ],
       [0.00433325, 0.00433456, 0.00433325, ..., 0.00326622, 0.0032883 ,
        1.        ]])

In [358]:
# Dataframe + Conversion - Big distance between similar customers (if equal, then 1)
distances_df = pd.DataFrame(convert_dist, 
                            # both columns and index are customerID
                            index=cust_id, columns=cust_id)


In [359]:
distances_df.head()

CustomerID,33,200,264,356,412,464,477,639,649,669,...,97697,97753,97769,97793,97900,97928,98069,98159,98185,98200
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
33,1.0,0.077421,0.087047,0.0818,0.080634,0.082709,0.074573,0.08302,0.081503,0.08007,...,0.004811,0.004669,0.004412,0.005019,0.004312,0.004515,0.004583,0.004355,0.004167,0.004333
200,0.077421,1.0,0.078448,0.076435,0.073693,0.075255,0.075956,0.076435,0.077674,0.076923,...,0.004824,0.004681,0.004431,0.005047,0.004311,0.004521,0.004614,0.004367,0.004166,0.004335
264,0.087047,0.078448,1.0,0.08007,0.0818,0.08035,0.076923,0.080634,0.0821,0.078448,...,0.004822,0.004674,0.004416,0.005035,0.004322,0.004543,0.004595,0.004365,0.004179,0.004333
356,0.0818,0.076435,0.08007,1.0,0.076435,0.078187,0.075025,0.082403,0.077171,0.075956,...,0.004816,0.004671,0.004416,0.005038,0.00431,0.004526,0.004578,0.004365,0.004175,0.004339
412,0.080634,0.073693,0.0818,0.076435,1.0,0.078711,0.075025,0.082403,0.078187,0.078448,...,0.00481,0.004702,0.004414,0.005034,0.004318,0.00453,0.004578,0.004367,0.004177,0.004349


## Step 4: Check your results by generating a list of the top 5 most similar customers for a specific CustomerID.

In [360]:
# take the first customerID
cust_33 = distances_df.loc[33]

In [361]:
# select the first 5 most similar customers
top5 = list(cust_33.sort_values(ascending=False).index[1:6])

In [362]:
top5

[264, 3535, 3317, 2503, 3305]

## Step 5: From the data frame you created in Step 1, select the records for the list of similar CustomerIDs you obtained in Step 4.

In [363]:
# create data frame with CustomerID from top5
top5_CustomerID = df_prod[
    # boolean mask with costuomer from top5 distancies
    df_prod['CustomerID'].isin(top5)]

In [364]:
top5_CustomerID.head(1)

Unnamed: 0,CustomerID,ProductName,Quantity
131,264,Apricots - Halves,1


In [365]:
# reset index from top5_CustomerID
top5_similar_customer = top5_CustomerID.reset_index(
    # dop the old index column
    drop=True)

In [366]:
top5_similar_customer.head(1)

Unnamed: 0,CustomerID,ProductName,Quantity
0,264,Apricots - Halves,1


## Step 6: Aggregate those customer purchase records by ProductName, sum the Quantity field, and then rank them in descending order by quantity.

This will give you the total number of each product purchased by the 5 most similar customers to the customer you selected in order from most purchased to least.

In [367]:
# groupby the 'ProductName' using sum(), sort the vallues by quantity and only return the quantity column
items_df = top5_similar_customer.groupby('ProductName').sum().sort_values(by='Quantity', ascending=False)['Quantity']
items_df.head()

ProductName
Butter - Unsalted                3
Wine - Ej Gallo Sierra Valley    3
Towels - Paper / Kraft           3
Soup - Campbells Bean Medley     3
Wine - Blue Nun Qualitatswein    3
Name: Quantity, dtype: int64

## Step 7: Filter the list for products that the chosen customer has not yet purchased and then recommend the top 5 products with the highest quantities that are left.

- Merge the ranked products data frame with the customer product matrix on the ProductName field.
- Filter for records where the chosen customer has not purchased the product.
- Show the top 5 results.

In [368]:
# Merge the 'items_df': df with the quantities of the most similar customers
# with the similarity of customer 33 from pivot_customer
merged = pd.merge(items_df, pivot_customer[33], on='ProductName').reset_index()

# rename columns
merged.columns = ['ProductName', 'Quantity_similar_33', 'Quantity_33'] 

In [369]:
# Filter only quantities == 0: the customer hasn't bought the product
filtered = merged[merged['Quantity_33'] == 0]

In [370]:
# Top 5
filtered.head()

Unnamed: 0,ProductName,Quantity_similar_33,Quantity_33
0,Butter - Unsalted,3,0
1,Wine - Ej Gallo Sierra Valley,3,0
3,Soup - Campbells Bean Medley,3,0
4,Wine - Blue Nun Qualitatswein,3,0
6,Chicken - Soup Base,2,0


## Step 8: Now that we have generated product recommendations for a single user, put the pieces together and iterate over a list of all CustomerIDs.

- Create an empty dictionary that will hold the recommendations for all customers.
- Create a list of unique CustomerIDs to iterate over.
- Iterate over the customer list performing steps 4 through 7 for each and appending the results of each iteration to the dictionary you created.

In [417]:
data[data.Quantity == 0]

Unnamed: 0,CustomerID,FirstName,LastName,SalesID,ProductID,ProductName,Quantity


In [371]:
recommender_dict = {customer: [] for customer in set(data['CustomerID'])}

In [372]:
customer_id

65535

In [420]:
# Empty dictionary
recommender_dict = {customer: [] for customer in set(data['CustomerID'])}

# Iteration
for customer_id in set(data['CustomerID']):
    # Step 4
    top_5 = list(distances_df[customer_id].sort_values(ascending=False)[1:6].index)
    # Step 5
    items_of_similar_customers_df = top5_similar_customer[top5_similar_customer['CustomerID'].isin(top_5)].reset_index(drop=True)
    # Step 6
    items_df = items_of_similar_customers_df.groupby('ProductName').sum().sort_values(by='Quantity', ascending=False)['Quantity']
    # Step 7
    # Merge the 'items_df': df with the quantities of the most similar customers
    # with the similarity of customer 33 from pivot_customer
    merged = pd.merge(items_df, pivot_customer[33], on='ProductName').reset_index()

    # rename columns
    merged.columns = ['ProductName', 'Quantity_similar_33', 'Quantity_33'] 
    filtered = merged[merged['Quantity_33'] == 0]
    # Results
    recommender_dict[customer_id] = list(filtered['ProductName'].head())

##  Step 9: Store the results in a Pandas data frame. The data frame should a column for Customer ID and then a column for each of the 5 product recommendations for each customer.

In [424]:
# transform dictionary into a dataframe
df_final = pd.DataFrame.from_dict(
    # passing dict
    recommender_dict,
    # column = index
    orient='index',
    # name columns
    columns=['rec_1', 'rec_2', 'rec_3', 'rec_4', 'rec_5']
)

In [404]:
df_final.reset_index(inplace=True)

In [405]:
df_final

Unnamed: 0,index,rec_1,rec_2,rec_3,rec_4,rec_5
0,83973,,,,,
1,59399,,,,,
2,92168,,,,,
3,49159,,,,,
4,18441,,,,,
5,22536,"Mushrooms - Black, Dried",Guinea Fowl,Chocolate - Dark,Apricots - Halves,Snapple Lemon Tea
6,86028,,,,,
7,75791,,,,,
8,96272,,,,,
9,32785,,,,,


In [392]:
df_final.columns = ['CustomerID', 'rec_1', 'rec_2', 'rec_3', 'rec_4', 'rec_5']

In [394]:
df_final = df_final.sort_values(by='rec_1')

In [401]:
df_final.head()

Unnamed: 0,CustomerID,rec_1,rec_2,rec_3,rec_4,rec_5
308,31373,Anchovy Paste - 56 G Tube,"Sole - Dover, Whole, Fresh",Milk Powder,Muffin Batt - Blueberry Passion,Ocean Spray - Kiwi Strawberry
223,4595,Anchovy Paste - 56 G Tube,"Sole - Dover, Whole, Fresh",Milk Powder,Muffin Batt - Blueberry Passion,Ocean Spray - Kiwi Strawberry
875,17458,Anchovy Paste - 56 G Tube,"Sole - Dover, Whole, Fresh",Milk Powder,Muffin Batt - Blueberry Passion,Ocean Spray - Kiwi Strawberry
246,4644,Anchovy Paste - 56 G Tube,"Sole - Dover, Whole, Fresh",Milk Powder,Muffin Batt - Blueberry Passion,Ocean Spray - Kiwi Strawberry
256,14913,Anchovy Paste - 56 G Tube,"Sole - Dover, Whole, Fresh",Milk Powder,Muffin Batt - Blueberry Passion,Ocean Spray - Kiwi Strawberry


## Step 10: Change the distance metric used in Step 3 to something other than euclidean (correlation, cityblock, consine, jaccard, etc.). Regenerate the recommendations for all customers and note the differences.

*  Metrics we can use as distance parameters: ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’, ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’, ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’

In [409]:
def recommend_system(data, metric='euclidean'):
    # groupby elements by customerID and ProductName and return Quantity
    group_cust_product = data.groupby(['CustomerID', 'ProductName']).sum()['Quantity'].reset_index()
    # create pivot_table
    pivot_data = data.pivot_table(
        index='ProductName',
        columns='CustomerID',
        values='Quantity',
        aggfunc='sum',
        fill_value=0
    )
    # calculate distance
    dist = squareform(
    # Find distances between all users: the smaller the distance more similar customers are
    pdist(
        # receive transposed pivot_customer table as data
        pivot_customer.T,
        # calculate the euclidean distance
        metric=metric))
    # Now that we have the distance, let's create a matrix
    # select unique customer ID
    cust_id = pivot_customer.columns
    # convert the by summing up 1 and divide by 1
    convert_dist = 1 / (1 + dist)
    # Dataframe + Conversion - Big distance between similar customers (if equal, then 1)
    distances_df = pd.DataFrame(convert_dist, 
                            # both columns and index are customerID
                            index=cust_id, columns=cust_id)
    # take the first customerID
    cust_33 = distances_df.loc[33]
    # select the first 5 most similar customers
    top5 = list(cust_33.sort_values(ascending=False).index[1:6])
    # create data frame with CustomerID from top5
    top5_CustomerID = df_prod[
    # boolean mask with costuomer from top5 distancies
    df_prod['CustomerID'].isin(top5)]
    # reset index from top5_CustomerID
    top5_similar_customer = top5_CustomerID.reset_index(
    # dop the old index column
    drop=True)
    # groupby the 'ProductName' using sum(), sort the vallues by quantity and only return the quantity column
    items_df = top5_similar_customer.groupby('ProductName').sum().sort_values(by='Quantity', ascending=False)['Quantity']
    # Merge the 'items_df': df with the quantities of the most similar customers
    # with the similarity of customer 33 from pivot_customer
    merged = pd.merge(items_df, pivot_customer[33], on='ProductName').reset_index()
    # rename columns
    merged.columns = ['ProductName', 'Quantity_similar_33', 'Quantity_33'] 
    # Filter only quantities == 0: the customer hasn't bought the product
    filtered = merged[merged['Quantity_33'] == 0]
    # Empty dictionary
    recommender_dict = {customer: [] for customer in set(data['CustomerID'])}        
    recommender_dict[customer_id] = list(filtered['ProductName'].head())
    # transform dictionary into a dataframe
    df_final = pd.DataFrame.from_dict(
        # passing dict
        recommender_dict,
        # column = index
        orient='index',
        # name columns
        columns=['rec_1', 'rec_2', 'rec_3', 'rec_4', 'rec_5'])
    # reset index
    df_final.reset_index(inplace=True)
    # rename columns
    df_final.columns = ['CustomerID', 'rec_1', 'rec_2', 'rec_3', 'rec_4', 'rec_5']
    # sort values
#    df_final = df_final.sort_values(by='rec_1')
    return(df_final)

In [425]:
recommend_system(data, metric='euclidean').sort_values(by='rec_1').head(1)

Unnamed: 0,CustomerID,rec_1,rec_2,rec_3,rec_4,rec_5
999,65535,Butter - Unsalted,Wine - Ej Gallo Sierra Valley,Soup - Campbells Bean Medley,Wine - Blue Nun Qualitatswein,Chicken - Soup Base
