<a href="https://colab.research.google.com/github/karinboc/Recommenders/blob/main/Basic_Recom_Karin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### As part of my recommender system project I implemented a few different approaches: 
* **Heuristics**:
** Recommend to each user the product that he bought most often (Acc=31.24%)
** Recommend to each user the product that he bought most, taking into account the quantity
* **Matrix factorization**
* **Microsoft's SotA algorithms (sliRec, Caser)**



In [None]:
!git clone https://github.com/urigoren/recom-day-2020-challenge.git

Cloning into 'recom-day-2020-challenge'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 12 (delta 3), reused 5 (delta 2), pack-reused 0[K
Unpacking objects: 100% (12/12), done.


In [None]:
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from scipy import sparse
#import jovian
import pandas as pd
import urllib.request
from functools import partial
#import sys
#sys.path.insert(0, './recom-day-2020-challenge')
%cd recom-day-2020-challenge
import argmaxml

/content/recom-day-2020-challenge/recom-day-2020-challenge


Add your full name, email, and phone number.
We will contact you with a free ticket in case you win

In [None]:
# Fill in your full name, email and phone number
my_submit = partial(argmaxml.submit, "Karin", "karinboc@gmail.com", "052")


## Reading the data

In [None]:
df = pd.read_csv("jul_train.csv")
test_users = df.user_id.unique()

## Creating a submission


## Evaluation 1 

##### Recommend to each user the product that he bought most often, regardless of the quantity that he bought in each time.  

In [None]:
test_submission = df.groupby("user_id")["product_id"]\
                    .apply(lambda x: x.value_counts().idxmax())
test_submission

user_id
1       1621
2       2879
3       1978
4       3917
5       2196
        ... 
4407    3277
4408     564
4409    3917
4412     985
4414    1188
Name: product_id, Length: 2609, dtype: int64

In [None]:
submission_name = "already_bought_best"
my_accuracy = my_submit(submission_name, test_submission)
print ("Submission Accuracy for {s} : {a:0.2f}%".format(a=100*my_accuracy, s=submission_name))

Submission Accuracy for already_bought_best : 31.24%


##### Select the top-k frequent products, and then, sample at random one of them.  

In [None]:
test_submission = df.groupby("user_id")["product_id"]\
                    .apply(lambda x: x.value_counts().nlargest(2).sample(1).idxmax())
test_submission

user_id
1       1621
2       2879
3       1978
4       2686
5       2196
        ... 
4407    3277
4408     564
4409     167
4412     985
4414    4112
Name: product_id, Length: 2609, dtype: int64

In [None]:
submission_name = "already_bought_best"
my_accuracy = my_submit(submission_name, test_submission)
print ("Submission Accuracy for {s} : {a:0.2f}%".format(a=100*my_accuracy, s=submission_name))

Submission Accuracy for already_bought_best : 26.41%


## Evaluation 2 

##### Recommend to each user the product that he bought most, taking into account the quantity.  

In [None]:
test_submission = df.groupby("user_id")["product_id", "quantity"]\
                    .apply(lambda x: list(x.groupby("product_id").sum().idxmax())[0])
test_submission

  """Entry point for launching an IPython kernel.


user_id
1       3917
2       2879
3       3917
4       3917
5        167
        ... 
4407    3277
4408     167
4409     167
4412      31
4414    3165
Length: 2609, dtype: int64

In [None]:
submission_name = "already_bought_best"
my_accuracy = my_submit(submission_name, test_submission)
print ("Submission Accuracy for {s} : {a:0.2f}%".format(a=100*my_accuracy, s=submission_name))

Submission Accuracy for already_bought_best : 29.32%







## Evaluation 3 - Matrix Factorization

In [None]:
df.head()

Unnamed: 0,user_id,segment_id,order_id,quantity,product_id,category_id,category_name,order_year,order_month,order_day
0,2832,4,26369,4,1906,1,תינוקות וילדים,2018,10,28
1,2832,4,26369,2,3029,1,תינוקות וילדים,2018,10,28
2,2230,4,21747,2,3298,1,תינוקות וילדים,2018,10,31
3,2230,4,21747,1,1815,1,תינוקות וילדים,2018,10,31
4,2908,4,33134,1,3666,1,תינוקות וילדים,2018,10,31


##Train-Valid Split

In [None]:
df_totals = df.groupby("user_id")["product_id", "quantity"]\
                    .apply(lambda x: x.groupby("product_id").sum())

  """Entry point for launching an IPython kernel.


In [None]:
train_df, valid_df = train_test_split(df_totals.copy(), test_size=0.02)

#resetting indices to avoid indexing errors in the future
train_df = train_df.reset_index()[['user_id', 'product_id', 'quantity']]
valid_df = valid_df.reset_index()[['user_id', 'product_id', 'quantity']]

In [None]:
valid_df.shape
train_df.shape

(81137, 3)

##Training

###Encoding columns with continuous ids
Because we'll be using PyTorch's embedding layers to create our user and item embeddings, we need continuous IDs to be able to index into the embedding matrix and access each user/item embedding.

In [None]:
def encode_column(column):
    """ Encodes a pandas column with continuous IDs"""
    keys = column.unique()
    key_to_id = {key:idx for idx,key in enumerate(keys)}
    id_to_key = {idx:key for idx,key in enumerate(keys)}
    return key_to_id, id_to_key, np.array([key_to_id[x] for x in column]), len(keys)

def encode_df(df):
    """Encodes quantity data with continuous user and prod ids"""

    prod_ids, id_prods, df['product_id'], num_prod = encode_column(df['product_id'])
    user_ids, id_users, df['user_id'], num_users = encode_column(df['user_id'])
    return df, num_users, num_prod, user_ids, prod_ids, id_prods, id_users



In [None]:
dfp, num_users, num_prod, user_ids, prod_ids, id_prods, id_users = encode_df(train_df)
print("Number of users :", num_users)
print("Number of prods :", num_prod)
dfp.head()

Number of users : 2609
Number of prods : 4227


Unnamed: 0,user_id,product_id,quantity
0,0,0,23
1,1,1,2
2,2,2,1
3,3,3,2
4,4,4,2


####Initializing user and item embeddings

In [None]:
def create_embeddings(n, d):
    """
    Creates a random numpy matrix of shape n, d with uniform values in (0, 11/d)
    n: number of items/users
    d: number of features in the embedding 
    """
    return 11*np.random.random((n, d)) / d

def create_sparse_matrix(df, rows, cols, column_name="quantity"): # we treat quantity as rating of a product
    """ Returns a sparse utility matrix""" 
    return sparse.csc_matrix((df[column_name].values,(df['user_id'].values, df['product_id'].values)),shape=(rows, cols))

In [None]:
#Y is a sprase matrix whose values are the quantities, and its row (col) index is user (prod) id 
Y = create_sparse_matrix(dfp, num_users, num_prod)


In [None]:
Y.todense()

matrix([[23,  0,  0, ...,  0,  0,  0],
        [ 2,  2,  0, ...,  0,  0,  0],
        [ 0,  0,  1, ...,  0,  0,  0],
        ...,
        [ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ...,  0,  0,  0]], dtype=int64)

#####Making Predictions

In [None]:
def predict(df, emb_user, emb_prod):
    """ This function computes df["prediction"] without doing (U*V^T).
    
    Computes df["prediction"] by using elementwise multiplication of the corresponding embeddings and then 
    sum to get the prediction u_i*v_j. This avoids creating the dense matrix U*V^T.
    """
    df['prediction'] = np.sum(np.multiply(emb_prod[df['product_id']],emb_user[df['user_id']]), axis=1)
    return df

#####Loss

In [None]:
lmbda = 0.0002

In [None]:
def cost(df, emb_user, emb_prod):
    """ Computes mean square error"""
    Y = create_sparse_matrix(df, emb_user.shape[0], emb_prod.shape[0])
    predicted = create_sparse_matrix(predict(df, emb_user, emb_prod), emb_user.shape[0], emb_prod.shape[0], 'prediction')
    return np.sum((Y-predicted).power(2))/df.shape[0] 

#####Gradient Descent

In [None]:
def gradient(df, emb_user, emb_prod):
    """ Computes the gradient for user and anime embeddings"""
    Y = create_sparse_matrix(df, emb_user.shape[0], emb_prod.shape[0])
    predicted = create_sparse_matrix(predict(df, emb_user, emb_prod), emb_user.shape[0], emb_prod.shape[0], 'prediction')
    delta =(Y-predicted)
    grad_user = (-2/df.shape[0])*(delta*emb_prod) + 2*lmbda*emb_user
    grad_prod = (-2/df.shape[0])*(delta.T*emb_user) + 2*lmbda*emb_prod
    return grad_user, grad_prod

In [None]:
def gradient_descent(df, emb_user, emb_prod, iterations=3000, learning_rate=0.01):
    """ 
    Computes gradient descent with momentum (0.9) for given number of iterations.
    emb_user: the trained user embedding
    emb_prod: the trained prod embedding
    """
    Y = create_sparse_matrix(df, emb_user.shape[0], emb_prod.shape[0])
    beta = 0.9
    grad_user, grad_prod = gradient(df, emb_user, emb_prod)
    v_user = grad_user
    v_prod = grad_prod
    for i in range(iterations):
        grad_user, grad_prod = gradient(df, emb_user, emb_prod)
        v_user = beta*v_user + (1-beta)*grad_user
        v_prod = beta*v_prod + (1-beta)*grad_prod
        emb_user = emb_user - learning_rate*v_user
        emb_prod = emb_prod - learning_rate*v_prod
        if(not (i+1)%50):
            print("\niteration", i+1, ":")
            print("train mse:",  cost(df, emb_user, emb_prod))           
    return emb_user, emb_prod

In [None]:
emb_user = create_embeddings(num_users, 127)
emb_prod = create_embeddings(num_prod, 127)
emb_user, emb_prod = gradient_descent(dfp, emb_user, emb_prod, iterations=20000, learning_rate=0.05)


iteration 50 :
train mse: 90.8775990778375

iteration 100 :
train mse: 90.28570645710681

iteration 150 :
train mse: 89.6128608640258

iteration 200 :
train mse: 88.83514477676887

iteration 250 :
train mse: 87.92731733147404

iteration 300 :
train mse: 86.8629591590736

iteration 350 :
train mse: 85.6150905695193

iteration 400 :
train mse: 84.15739626854247

iteration 450 :
train mse: 82.4661895146431

iteration 500 :
train mse: 80.52319617095421

iteration 550 :
train mse: 78.31910999005892

iteration 600 :
train mse: 75.85764370199192

iteration 650 :
train mse: 73.15948270607429

iteration 700 :
train mse: 70.2652027684358

iteration 750 :
train mse: 67.23598130477289

iteration 800 :
train mse: 64.15101470445151

iteration 850 :
train mse: 61.101124412196604

iteration 900 :
train mse: 58.17908603049376

iteration 950 :
train mse: 55.46844043405121

iteration 1000 :
train mse: 53.03338290803372

iteration 1050 :
train mse: 50.91225305749329

iteration 1100 :
train mse: 49.116061

####Making Predictions 

In [None]:
def encode_new_data(valid_df, user_ids, prod_ids):
    """ Encodes valid_df with the same encoding as train_df.
    """
    df_val_chosen = valid_df['product_id'].isin(prod_ids.keys()) & valid_df['user_id'].isin(user_ids.keys())
    valid_df = valid_df[df_val_chosen]
    valid_df['product_id'] =  np.array([prod_ids[x] for x in valid_df['product_id']])
    valid_df['user_id'] = np.array([user_ids[x] for x in valid_df['user_id']])
    return valid_df

In [None]:
print("before encoding:", dfp.shape)
dfp_original_idx = encode_new_data(dfp, id_users, id_prods)
print("after encoding:", dfp_original_idx.shape)

before encoding: (81137, 4)
after encoding: (81137, 4)


In [None]:
dfp_original_idx

Unnamed: 0,user_id,product_id,quantity,prediction
0,3373,1978,23,28.505148
1,1430,3347,2,1.612112
2,3893,4103,1,0.359289
3,1872,1351,2,2.618326
4,2060,2738,2,1.792436
...,...,...,...,...
81132,1778,3895,3,1.498406
81133,3035,1794,2,1.379310
81134,3406,3302,7,2.488433
81135,2733,126,1,1.027973


In [None]:
test_submission = dfp_original_idx.groupby(['user_id']).apply(lambda x: int(x.loc[x.prediction.idxmax()]['product_id'] ))             

In [None]:
submission_name = "learn_quants"
my_accuracy = my_submit(submission_name, test_submission)
print ("Submission Accuracy for {s} : {a:0.2f}%".format(a=100*my_accuracy, s=submission_name))

Submission Accuracy for learn_quants : 25.68%


In [None]:
# print("before encoding:", valid_df.shape)
# valid_df = encode_new_data(valid_df, user_ids, prod_ids)
# print("after encoding:", valid_df.shape)


In [None]:
# train_mse = cost(train_df, emb_user, emb_prod)
# val_mse = cost(valid_df, emb_user, emb_prod)
# print(train_mse, val_mse)

## See how you compare:

Visit http://leaderboard.argmax.ml/jul

In [None]:
valid_df.head()

Unnamed: 0,user_id,product_id,quantity,prediction
0,363,83,1,0.699034
1,549,2103,1,2.162716
2,551,654,1,1.567627
3,830,1458,1,2.949503
4,1021,1454,1,3.194179


In [None]:
test_submission = df.groupby("user_id")["prediction"]\
                    .apply(lambda x: list(x)[0]).to_dict()
                    

In [None]:
test_submission = dfp_original_idx.groupby(['user_id']).apply(lambda x: int(x.loc[x.prediction.idxmax()]['product_id'] ))             

In [None]:
submission_name = "learn_quants"
my_accuracy = my_submit(submission_name, test_submission)
print ("Submission Accuracy for {s} : {a:0.2f}%".format(a=100*my_accuracy, s=submission_name))

Submission Accuracy for learn_quants : 19.36%


##Evaluation 4 - 
###Microsoft Convolutional Sequence Embedding Recommendation (Caser)