# Intro to Recommender Systems Lab

Complete the exercises below to solidify your knowledge and understanding of recommender systems.

For this lab, we are going to be putting together a user similarity based recommender system in a step-by-step fashion. Our data set contains customer grocery purchases, and we will use similar purchase behavior to inform our recommender system. Our recommender system will generate 5 recommendations for each customer based on the purchases they have made.

In [1]:
import pandas as pd
from scipy.spatial.distance import pdist, squareform
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_csv('../data/customer_product_sales.csv')

In [3]:
data.head()

Unnamed: 0,CustomerID,FirstName,LastName,SalesID,ProductID,ProductName,Quantity
0,61288,Rosa,Andersen,134196,229,Bread - Hot Dog Buns,16
1,77352,Myron,Murray,6167892,229,Bread - Hot Dog Buns,20
2,40094,Susan,Stevenson,5970885,229,Bread - Hot Dog Buns,11
3,23548,Tricia,Vincent,6426954,229,Bread - Hot Dog Buns,6
4,78981,Scott,Burch,819094,229,Bread - Hot Dog Buns,20


In [4]:
data = data.sort_values("CustomerID",ascending=True).reset_index(drop=True)
data.head()

Unnamed: 0,CustomerID,FirstName,LastName,SalesID,ProductID,ProductName,Quantity
0,33,Lindsay,Santana,2005605,162,Sauce - Demi Glace,1
1,33,Lindsay,Santana,5638266,214,French Pastry - Mini Chocolate,1
2,33,Lindsay,Santana,5056183,387,Fondant - Icing,1
3,33,Lindsay,Santana,1888258,53,Cassis,1
4,33,Lindsay,Santana,140335,245,Grouper - Fresh,1


In [5]:
data.CustomerID.value_counts().head()

33759    95
60862    94
29287    93
8711     93
63086    92
Name: CustomerID, dtype: int64

In [6]:
data.ProductName.value_counts().head()

Spinach - Baby                      186
Sole - Dover, Whole, Fresh          182
Oil - Shortening - All - Purpose    181
Tea - Jasmin Green                  181
Bandage - Flexible Neon             180
Name: ProductName, dtype: int64

In [7]:
data.shape

(68584, 7)

In [8]:
data.isnull().sum()

CustomerID     0
FirstName      0
LastName       0
SalesID        0
ProductID      0
ProductName    0
Quantity       0
dtype: int64

In [9]:
'''
#CASTEAMOS LOS IDs porque después nos daría error
data['CustomerID'] = data['CustomerID'].astype(str)
data['ProductID'] = data['ProductID'].astype(str)
data['SalesID'] = data['SalesID'].astype(str)
data.dtypes
'''
##### NO HACE FALTA SI SE HACE LA BÚSQUEDA SIN COMILLAS. #####
#similarities = distances['33'].sort_values(ascending=False)[1:6]  --> NO   distances['33']
#similarities = distances[33].sort_values(ascending=False)[1:6]    --> SI   distances[33]

"\n#CASTEAMOS LOS IDs porque después nos daría error\ndata['CustomerID'] = data['CustomerID'].astype(str)\ndata['ProductID'] = data['ProductID'].astype(str)\ndata['SalesID'] = data['SalesID'].astype(str)\ndata.dtypes\n"

## Step 1: Create a data frame that contains the total quantity of each product purchased by each customer.

You will need to group by CustomerID and ProductName and then sum the Quantity field.

In [10]:
dfcust = data.groupby(['CustomerID','ProductName']).sum().sort_values('CustomerID').drop(columns=['SalesID','ProductID'])
dfcust.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
CustomerID,ProductName,Unnamed: 2_level_1
33,Apricots - Dried,1
33,"Pepper - White, Ground",1
33,Phyllo Dough,1
33,"Pork - Bacon, Double Smoked",1
33,Pork - Hock And Feet Attached,1


## Step 2: Use the `pivot_table` method to create a product by customer matrix.

The rows of the matrix should represent the products, the columns should represent the customers, and the values should be the quantities of each product purchased by each customer. You will also need to replace nulls with zeros, which you can do using the `fillna` method.

In [11]:
matrix = pd.pivot_table(dfcust,values="Quantity",
                        index="ProductName",
                        columns="CustomerID",
                        aggfunc=np.sum,
                        fill_value=0)
print(matrix.shape)
matrix.head()

(452, 1000)


CustomerID,33,200,264,356,412,464,477,639,649,669,...,97697,97753,97769,97793,97900,97928,98069,98159,98185,98200
ProductName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anchovy Paste - 56 G Tube,0,0,0,0,0,0,0,1,0,0,...,0,25,0,0,0,0,0,0,0,0
"Appetizer - Mini Egg Roll, Shrimp",0,0,0,0,0,0,0,0,0,0,...,25,25,0,0,0,0,0,0,0,0
Appetizer - Mushroom Tart,0,0,0,0,0,0,0,1,0,0,...,25,0,0,0,0,0,0,0,25,0
Appetizer - Sausage Rolls,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,25,25,25,0,25,0
Apricots - Dried,1,0,0,0,1,0,0,0,0,0,...,0,25,0,0,0,0,0,0,0,0


## Step 3: Create a customer similarity matrix using `squareform` and `pdist`. For the distance metric, choose "euclidean."

In [12]:
distances = pd.DataFrame(1/(1 + squareform(pdist(matrix.T, 'euclidean'))), 
                         index=matrix.columns, 
                         columns=matrix.columns)

distances.head()

CustomerID,33,200,264,356,412,464,477,639,649,669,...,97697,97753,97769,97793,97900,97928,98069,98159,98185,98200
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
33,1.0,0.077421,0.087047,0.0818,0.080634,0.082709,0.074573,0.08302,0.081503,0.08007,...,0.004811,0.004669,0.004412,0.005019,0.004312,0.004515,0.004583,0.004355,0.004167,0.004333
200,0.077421,1.0,0.078448,0.076435,0.073693,0.075255,0.075956,0.076435,0.077674,0.076923,...,0.004824,0.004681,0.004431,0.005047,0.004311,0.004521,0.004614,0.004367,0.004166,0.004335
264,0.087047,0.078448,1.0,0.08007,0.0818,0.08035,0.076923,0.080634,0.0821,0.078448,...,0.004822,0.004674,0.004416,0.005035,0.004322,0.004543,0.004595,0.004365,0.004179,0.004333
356,0.0818,0.076435,0.08007,1.0,0.076435,0.078187,0.075025,0.082403,0.077171,0.075956,...,0.004816,0.004671,0.004416,0.005038,0.00431,0.004526,0.004578,0.004365,0.004175,0.004339
412,0.080634,0.073693,0.0818,0.076435,1.0,0.078711,0.075025,0.082403,0.078187,0.078448,...,0.00481,0.004702,0.004414,0.005034,0.004318,0.00453,0.004578,0.004367,0.004177,0.004349


## Step 4: Check your results by generating a list of the top 5 most similar customers for a specific CustomerID.

In [13]:
def giveMe5(CustomerID):
    similarities = pd.DataFrame(distances[CustomerID].sort_values(ascending=False)[1:6])
    return similarities

In [14]:
customer = 63086
giveMe5(customer)

Unnamed: 0_level_0,63086
CustomerID,Unnamed: 1_level_1
8367,0.005865
12529,0.005865
15986,0.005859
15165,0.005853
25779,0.005836


## Step 5: From the data frame you created in Step 1, select the records for the list of similar CustomerIDs you obtained in Step 4.

In [15]:
dfcust.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
CustomerID,ProductName,Unnamed: 2_level_1
33,Apricots - Dried,1
33,"Pepper - White, Ground",1
33,Phyllo Dough,1
33,"Pork - Bacon, Double Smoked",1
33,Pork - Hock And Feet Attached,1


In [16]:
## Esto sirve para elimiar uno de los multiples index, ProductName ahora pasa a ser column.
## No es necesario para hacer el match con los productos pero es un atributo interesante.

topprod_2 = dfcust.reset_index(level=["ProductName"])
topprod_2.head()

Unnamed: 0_level_0,ProductName,Quantity
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
33,Apricots - Dried,1
33,"Pepper - White, Ground",1
33,Phyllo Dough,1
33,"Pork - Bacon, Double Smoked",1
33,Pork - Hock And Feet Attached,1


In [17]:
#NO HACE FALTA lst=list(giveMe5(customer).index.values)
topprod = dfcust.loc[giveMe5(customer).index]
topprod


Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
CustomerID,ProductName,Unnamed: 2_level_1
8367,Sausage - Liver,3
8367,Steam Pan - Half Size Deep,3
8367,Sponge Cake Mix - Chocolate,3
8367,"Soup - Canadian Pea, Dry Mix",3
8367,Sobe - Tropical Energy,3
...,...,...
25779,Juice - Apple Cider,7
25779,Halibut - Fletches,7
25779,Grenadine,7
25779,Foam Dinner Plate,7


## Step 6: Aggregate those customer purchase records by ProductName, sum the Quantity field, and then rank them in descending order by quantity.

This will give you the total number of each product purchased by the 5 most similar customers to the customer you selected in order from most purchased to least.

In [18]:
topprod = topprod.groupby("ProductName").sum().sort_values("Quantity",ascending=False)
topprod.head(15)

Unnamed: 0_level_0,Quantity
ProductName,Unnamed: 1_level_1
Wine - Redchard Merritt,22
"Pepsi - Diet, 355 Ml",20
Sausage - Liver,19
Beef - Top Sirloin - Aaa,19
Oil - Shortening - All - Purpose,18
Sardines,18
"Cheese - Brie, Triple Creme",17
"Wine - Red, Colio Cabernet",17
"Pepper - White, Ground",16
Mayonnaise - Individual Pkg,16


## Step 7: Filter the list for products that the chosen customer has not yet purchased and then recommend the top 5 products with the highest quantities that are left.

- Merge the ranked products data frame with the customer product matrix on the ProductName field.
- Filter for records where the chosen customer has not purchased the product.
- Show the top 5 results.

In [19]:
custprod = matrix[matrix[customer] != 0]
custprod.head()

CustomerID,33,200,264,356,412,464,477,639,649,669,...,97697,97753,97769,97793,97900,97928,98069,98159,98185,98200
ProductName,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Apricots Fresh,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,25
Baking Powder,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Beans - Kidney, Canned",1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Beans - Kidney, Red Dry",0,0,0,0,0,0,0,0,1,0,...,25,0,0,0,0,25,0,0,0,0
Beef - Texas Style Burger,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,25,0,0,0,0,25


In [20]:
q=[]
for i in topprod.index.values:
    if i in custprod:
        print("SI "+i)
        q.append(i[customer])
    else:
        #print("NO "+i)
        q.append(0)


'''
if sum(q)!=0:
    topprod['CustomerQuantity']=q
    #Nos quedamos con los productos que aún no se han comprado
    topprod = topprod[topprod.q == 0]
    display(topprod.head(5).drop(columns=['Quantity','CustomerQuantity']))
else:
    #Ningún elemento de la lista de top productos se ha comprado"
    display(topprod.head(5).drop(columns="Quantity"))
'''

print("Conoces estos productos:")
topprod['CustomerQuantity']=q
#Nos quedamos con los productos que aún no se han comprado
topprod = topprod[topprod.CustomerQuantity == 0]
display(topprod.head(5).drop(columns=['Quantity','CustomerQuantity']))

recommended = topprod.head(5).drop(columns=['Quantity','CustomerQuantity'])
list(recommended.index)

Conoces estos productos:


Wine - Redchard Merritt
"Pepsi - Diet, 355 Ml"
Sausage - Liver
Beef - Top Sirloin - Aaa
Oil - Shortening - All - Purpose


['Wine - Redchard Merritt',
 'Pepsi - Diet, 355 Ml',
 'Sausage - Liver',
 'Beef - Top Sirloin - Aaa',
 'Oil - Shortening - All - Purpose']

## Step 8: Now that we have generated product recommendations for a single user, put the pieces together and iterate over a list of all CustomerIDs.

- Create an empty dictionary that will hold the recommendations for all customers.
- Create a list of unique CustomerIDs to iterate over.
- Iterate over the customer list performing steps 4 through 7 for each and appending the results of each iteration to the dictionary you created.

In [42]:
lst = []

for customer in data.CustomerID.drop_duplicates():
    #print(customer)
    topprod = dfcust.loc[giveMe5(customer).index]
    topprod = topprod.groupby("ProductName").sum().sort_values("Quantity",ascending=False)
    custprod = matrix[matrix[customer] != 0]
    q=[]
    for i in topprod.index.values:
        if i in custprod:
            #print("SI "+i)
            q.append(i[customer])
        else:
            #print("NO "+i)
            q.append(0)
    topprod['CustomerQuantity']=q
    topprod = topprod[topprod.CustomerQuantity == 0]
    topprod = topprod.head(5).drop(columns=['Quantity','CustomerQuantity'])
    dctone={}
    dctone["customer"]= customer
    dcttwo={}
    for ind,val in enumerate(list(topprod.index)):
        dcttwo[f'prod_{ind+1}']= val
    dctone["prod_reco"]=dcttwo
    lst.append(dctone)
print(lst[:5])

[{'customer': 33, 'prod_reco': {'prod_1': 'Butter - Unsalted', 'prod_2': 'Wine - Ej Gallo Sierra Valley', 'prod_3': 'Towels - Paper / Kraft', 'prod_4': 'Soup - Campbells Bean Medley', 'prod_5': 'Wine - Blue Nun Qualitatswein'}}, {'customer': 200, 'prod_reco': {'prod_1': 'Soup - Campbells Bean Medley', 'prod_2': 'Muffin - Carrot Individual Wrap', 'prod_3': 'Bay Leaf', 'prod_4': 'Pork - Kidney', 'prod_5': 'Pepper - Black, Whole'}}, {'customer': 264, 'prod_reco': {'prod_1': 'Soupfoamcont12oz 112con', 'prod_2': 'Wine - Two Oceans Cabernet', 'prod_3': 'Bread - Italian Roll With Herbs', 'prod_4': 'Veal - Inside, Choice', 'prod_5': 'Potatoes - Idaho 100 Count'}}, {'customer': 356, 'prod_reco': {'prod_1': 'Veal - Inside, Choice', 'prod_2': 'Wine - Ej Gallo Sierra Valley', 'prod_3': 'Lamb - Ground', 'prod_4': 'Wine - Blue Nun Qualitatswein', 'prod_5': 'Pomello'}}, {'customer': 412, 'prod_reco': {'prod_1': 'Olive - Spread Tapenade', 'prod_2': 'Sprouts - Baby Pea Tendrils', 'prod_3': 'Wine - Blue

##  Step 9: Store the results in a Pandas data frame. The data frame should a column for Customer ID and then a column for each of the 5 product recommendations for each customer.

In [38]:
dfreco = pd.DataFrame(lst)
dfreco.head()

Unnamed: 0,customer,prod_reco
0,33,"{'prod_1': 'Butter - Unsalted', 'prod_2': 'Win..."
1,200,"{'prod_1': 'Soup - Campbells Bean Medley', 'pr..."
2,264,"{'prod_1': 'Soupfoamcont12oz 112con', 'prod_2'..."
3,356,"{'prod_1': 'Veal - Inside, Choice', 'prod_2': ..."
4,412,"{'prod_1': 'Olive - Spread Tapenade', 'prod_2'..."


In [39]:
#dfreco = dfreco.explode('prod_reco')
dfexpand = dfreco[["prod_reco"]].apply(lambda r: r.prod_reco, result_type="expand", axis=1)
dfexpand.head()

Unnamed: 0,prod_1,prod_2,prod_3,prod_4,prod_5
0,Butter - Unsalted,Wine - Ej Gallo Sierra Valley,Towels - Paper / Kraft,Soup - Campbells Bean Medley,Wine - Blue Nun Qualitatswein
1,Soup - Campbells Bean Medley,Muffin - Carrot Individual Wrap,Bay Leaf,Pork - Kidney,"Pepper - Black, Whole"
2,Soupfoamcont12oz 112con,Wine - Two Oceans Cabernet,Bread - Italian Roll With Herbs,"Veal - Inside, Choice",Potatoes - Idaho 100 Count
3,"Veal - Inside, Choice",Wine - Ej Gallo Sierra Valley,Lamb - Ground,Wine - Blue Nun Qualitatswein,Pomello
4,Olive - Spread Tapenade,Sprouts - Baby Pea Tendrils,Wine - Blue Nun Qualitatswein,"Veal - Inside, Choice","Pepper - Black, Whole"


In [40]:
dfreco = pd.concat([dfreco,dfexpand], axis=1)
dfreco = dfreco.drop(columns="prod_reco")

In [41]:
dfreco.head()

Unnamed: 0,customer,prod_1,prod_2,prod_3,prod_4,prod_5
0,33,Butter - Unsalted,Wine - Ej Gallo Sierra Valley,Towels - Paper / Kraft,Soup - Campbells Bean Medley,Wine - Blue Nun Qualitatswein
1,200,Soup - Campbells Bean Medley,Muffin - Carrot Individual Wrap,Bay Leaf,Pork - Kidney,"Pepper - Black, Whole"
2,264,Soupfoamcont12oz 112con,Wine - Two Oceans Cabernet,Bread - Italian Roll With Herbs,"Veal - Inside, Choice",Potatoes - Idaho 100 Count
3,356,"Veal - Inside, Choice",Wine - Ej Gallo Sierra Valley,Lamb - Ground,Wine - Blue Nun Qualitatswein,Pomello
4,412,Olive - Spread Tapenade,Sprouts - Baby Pea Tendrils,Wine - Blue Nun Qualitatswein,"Veal - Inside, Choice","Pepper - Black, Whole"


In [44]:
#Comprobamos que funciona
dfreco[dfreco.customer==63086]

Unnamed: 0,customer,prod_1,prod_2,prod_3,prod_4,prod_5
646,63086,Wine - Redchard Merritt,"Pepsi - Diet, 355 Ml",Sausage - Liver,Beef - Top Sirloin - Aaa,Oil - Shortening - All - Purpose


## Step 10: Change the distance metric used in Step 3 to something other than euclidean (correlation, cityblock, cosine, jaccard, etc.). Regenerate the recommendations for all customers and note the differences.

In [48]:
def giveMeAll(data,distance):
    dfcust = data.groupby(['CustomerID','ProductName']).sum().sort_values('CustomerID').drop(columns=['SalesID','ProductID'])
    matrix = pd.pivot_table(dfcust,values="Quantity",
                        index="ProductName",
                        columns="CustomerID",
                        aggfunc=np.sum,
                        fill_value=0)
    distances = pd.DataFrame(1/(1 + squareform(pdist(matrix.T, distance))), 
                         index=matrix.columns, 
                         columns=matrix.columns)
    lst = []
    for customer in data.CustomerID.drop_duplicates():
        #print(customer)
        topprod = dfcust.loc[giveMe5(customer).index]
        topprod = topprod.groupby("ProductName").sum().sort_values("Quantity",ascending=False)
        custprod = matrix[matrix[customer] != 0]
        q=[]
        for i in topprod.index.values:
            if i in custprod:
                #print("SI "+i)
                q.append(i[customer])
            else:
                #print("NO "+i)
                q.append(0)
        topprod['CustomerQuantity']=q
        topprod = topprod[topprod.CustomerQuantity == 0]
        topprod = topprod.head(5).drop(columns=['Quantity','CustomerQuantity'])
        dctone={}
        dctone["customer"]= customer
        dcttwo={}
        for ind,val in enumerate(list(topprod.index)):
            dcttwo[f'prod_{ind+1}']= val
        dctone["prod_reco"]=dcttwo
        lst.append(dctone)
    dfreco = pd.DataFrame(lst)
    dfexpand = dfreco[["prod_reco"]].apply(lambda r: r.prod_reco, result_type="expand", axis=1)
    dfreco = pd.concat([dfreco,dfexpand], axis=1)
    dfreco = dfreco.drop(columns="prod_reco")
    
    return dfreco

In [47]:
giveMeAll(data,"euclidean")

Unnamed: 0,customer,prod_1,prod_2,prod_3,prod_4,prod_5
0,33,Butter - Unsalted,Wine - Ej Gallo Sierra Valley,Towels - Paper / Kraft,Soup - Campbells Bean Medley,Wine - Blue Nun Qualitatswein
1,200,Soup - Campbells Bean Medley,Muffin - Carrot Individual Wrap,Bay Leaf,Pork - Kidney,"Pepper - Black, Whole"
2,264,Soupfoamcont12oz 112con,Wine - Two Oceans Cabernet,Bread - Italian Roll With Herbs,"Veal - Inside, Choice",Potatoes - Idaho 100 Count
3,356,"Veal - Inside, Choice",Wine - Ej Gallo Sierra Valley,Lamb - Ground,Wine - Blue Nun Qualitatswein,Pomello
4,412,Olive - Spread Tapenade,Sprouts - Baby Pea Tendrils,Wine - Blue Nun Qualitatswein,"Veal - Inside, Choice","Pepper - Black, Whole"
...,...,...,...,...,...,...
995,97928,Bouq All Italian - Primerba,Tea - Jasmin Green,"Soup - Campbells, Lentil",Arizona - Green Tea,"Cheese - Brie,danish"
996,98069,Skirt - 29 Foot,Chocolate - Dark,Beans - Kidney White,Milk - 1%,Longos - Grilled Salmon With Bbq
997,98159,Chips Potato All Dressed - 43g,Table Cloth 81x81 White,"Lamb - Whole, Fresh",Pernod,"Ice - Clear, 300 Lb For Carving"
998,98185,Halibut - Steaks,Cod - Black Whole Fillet,Wine - Pinot Noir Latour,Chicken - Wieners,Crackers - Trio


In [73]:
def giveMeAll(data):
    
    dfcust = data.groupby(['CustomerID','ProductName']).sum().sort_values('CustomerID').drop(columns=['SalesID','ProductID'])
    matrix = pd.pivot_table(dfcust,values="Quantity",
                        index="ProductName",
                        columns="CustomerID",
                        aggfunc=np.sum,
                        fill_value=0)
    lst = []
    distance=['euclidean','correlation', 'cityblock', 'cosine', 'jaccard']
    for dist in distance:
        distances = pd.DataFrame(1/(1 + squareform(pdist(matrix.T, dist))), 
                             index=matrix.columns, 
                             columns=matrix.columns)
        
        for customer in data.CustomerID.drop_duplicates():
            #print(customer)
            similarities = pd.DataFrame(distances[customer].sort_values(ascending=False)[1:6])
            topprod = dfcust.loc[similarities.index]
            topprod = topprod.groupby("ProductName").sum().sort_values("Quantity",ascending=False)
            custprod = matrix[matrix[customer] != 0]
            
            q=[]
            for i in topprod.index.values:
                if i in custprod:
                    #print("SI "+i)
                    q.append(i[customer])
                else:
                    #print("NO "+i)
                    q.append(0)
                    
            topprod['CustomerQuantity']=q
            topprod = topprod[topprod.CustomerQuantity == 0]
            topprod = topprod.head(5).drop(columns=['Quantity','CustomerQuantity'])
            
            dctone={}
            dctone["customer"]= customer
            dctone["distance"]= dist
            dcttwo={}
            for ind,val in enumerate(list(topprod.index)):
                dcttwo[f'prod_{ind+1}']= val
            dctone["prod_reco"]=dcttwo
            
            lst.append(dctone)
            
    dfreco = pd.DataFrame(lst)
    dfexpand = dfreco[["prod_reco"]].apply(lambda r: r.prod_reco, result_type="expand", axis=1)
    dfreco = pd.concat([dfreco,dfexpand], axis=1)
    dfreco = dfreco.drop(columns="prod_reco")

    return dfreco

In [74]:
df = giveMeAll(data)

In [79]:
#Comprobamos que funciona y vemos las diferencias.
df63086 = df[df.customer==63086]
df63086

Unnamed: 0,customer,distance,prod_1,prod_2,prod_3,prod_4,prod_5
646,63086,euclidean,Wine - Redchard Merritt,"Pepsi - Diet, 355 Ml",Sausage - Liver,Beef - Top Sirloin - Aaa,Oil - Shortening - All - Purpose
1646,63086,correlation,Ecolab - Mikroklene 4/4 L,Foam Dinner Plate,Meldea Green Tea Liquor,Flavouring - Orange,Mussels - Frozen
2646,63086,cityblock,"Bar Mix - Pina Colada, 355 Ml","Salsify, Organic",Pork - Kidney,Ice Cream Bar - Hageen Daz To,Black Currants
3646,63086,cosine,Flavouring - Orange,Ecolab - Mikroklene 4/4 L,Pecan Raisin - Tarts,Mangoes,"Wine - Red, Cooking"
4646,63086,jaccard,Mangoes,Flavouring - Orange,Cumin - Whole,Cheese - Parmesan Cubes,Cheese - Cottage Cheese


In [None]:
#Cada distancia devuelve un resultado diferente.