In this notebook we will use rankfm. 
First, we will install rankfm.
Second, we will download the instacart online grocery shopping dataset from kaggle.
Third, we will create a factorization machine model build upon rankfm and compare its results to a baseline.

# Installation and imports

Install RankFM

In [54]:
!pip install rankfm==0.2.5



In [55]:
import os
import numpy as np
import pandas as pd
from rankfm.rankfm import RankFM

# Get Dataset

In [3]:
! pip install -q kaggle

In [None]:
from google.colab import files

files.upload()

In [12]:
! mkdir ~/kaggle

! cp kaggle.json ~/.kaggle/

In [13]:
 ! chmod 600 ~/.kaggle/kaggle.json

In [16]:
!kaggle datasets download -d psparks/instacart-market-basket-analysis

Downloading instacart-market-basket-analysis.zip to /content
 99% 196M/197M [00:02<00:00, 116MB/s] 
100% 197M/197M [00:02<00:00, 95.4MB/s]


In [20]:
!unzip /content/instacart-market-basket-analysis.zip -d ./instacart_online_grocery_shopping


Archive:  /content/instacart-market-basket-analysis.zip
  inflating: ./instacart_online_grocery_shopping/aisles.csv  
  inflating: ./instacart_online_grocery_shopping/departments.csv  
  inflating: ./instacart_online_grocery_shopping/order_products__prior.csv  
  inflating: ./instacart_online_grocery_shopping/order_products__train.csv  
  inflating: ./instacart_online_grocery_shopping/orders.csv  
  inflating: ./instacart_online_grocery_shopping/products.csv  


In [21]:
data_path = '/content/instacart_online_grocery_shopping'

# Create Merged Interaction dataset

In [22]:
orders_cols = ['order_id', 'user_id']
order_products_cols = ['order_id', 'product_id']
interaction_cols = ['user_id', 'product_id', 'order_id']

#Load Products data
products_dtypes = {
    'product_id': np.int32, 
    'product_name': str, 
    'aisle_id': np.uint8, 
    'department_id': np.uint8
}

products_df = pd.read_csv(os.path.join(data_path, 'products.csv'), dtype=products_dtypes)

#Load orders data
orders_dtypes = {
    'order_id': np.int32, 
    'user_id': np.int32, 
    'eval_set': str, 
    'order_number': np.uint8, 
    'order_dow': np.uint8, 
    'order_hour_of_day': np.uint8, 
    'days_since_prior': np.float32
}

orders_df = pd.read_csv(os.path.join(data_path, 'orders.csv'), dtype=orders_dtypes)

#load orders products data
order_product_dtypes = {
    'order_id': np.int32, 
    'product_id': np.int32, 
    'add_to_cart_order': np.uint8,
    'reordered': np.uint8
}

order_products_df = pd.read_csv(os.path.join(data_path, 'order_products__prior.csv'), dtype=order_product_dtypes)


interactions = pd.merge(orders_df[orders_cols], order_products_df[order_products_cols], on='order_id', how='inner')
interactions = interactions[interaction_cols]



In [23]:
interactions.info()
interactions.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32434489 entries, 0 to 32434488
Data columns (total 3 columns):
 #   Column      Dtype
---  ------      -----
 0   user_id     int32
 1   product_id  int32
 2   order_id    int32
dtypes: int32(3)
memory usage: 618.6 MB


Unnamed: 0,user_id,product_id,order_id
0,1,196,2539329
1,1,14084,2539329
2,1,12427,2539329
3,1,26088,2539329
4,1,26405,2539329


In [24]:
item_features = pd.get_dummies(products_df[['product_id', 'aisle_id']], columns=['aisle_id'])


## Sample data - inorder to demonstrate the use of rankfm we will sample 20K unique users from the dataset (approximately half of the users)

In [69]:
all_users = interactions.user_id.unique()
all_items = interactions.product_id.unique()

np.random.seed(42)
s_users = np.random.choice(all_users, size=20000, replace=False)
s_interactions = interactions[interactions.user_id.isin(s_users)].copy()
print(s_interactions.shape)

(3191742, 3)


In [70]:
s_items = s_interactions.product_id.unique()
print(len(s_items))

41223


In total, we are considering ~ 3.2M records, with around 31K unique items.
Let's look at the properties of this sample by looking at the number of unique users, items, and the sparsity level.

In [71]:
n_s_users = len(s_users)
n_s_items = len(s_items)

print("sample users:", n_s_users)
print("sample items:", n_s_items)
print("sample interactions:", s_interactions.shape)


s_sparsity = 1 - (s_interactions[['user_id', 'product_id']].drop_duplicates().shape[0] / (n_s_users * n_s_items))
print("sample interaction data sparsity: {}".format(round(100 * s_sparsity, 2)))

sample users: 20000
sample items: 41223
sample interactions: (3191742, 3)
sample interaction data sparsity: 99.84


In [72]:
# random shuffle
shuffle_index = np.arange(len(s_interactions))
np.random.shuffle(shuffle_index)

s_interactions = s_interactions.iloc[shuffle_index]
s_interactions['random'] = np.random.random(size=len(s_interactions))
s_interactions.head()

Unnamed: 0,user_id,product_id,order_id,random
32014209,203449,39947,2566299,0.248523
6113441,38914,30805,2907435,0.853104
22867106,145227,29101,1425935,0.953094
29185370,185383,30756,3222520,0.366231
5552326,35294,12381,2810344,0.706596


Create train/test files

In [73]:
# Split the data into train and test (without using sklearn)
test_pct = 0.25
train_mask = s_interactions['random'] <  (1 - test_pct)
test_mask = s_interactions['random'] >= (1 - test_pct)
          
interactions_total = s_interactions[['user_id', 'product_id']]
interactions_total = interactions_total.iloc[shuffle_index]

interactions_train = s_interactions[train_mask].groupby(['user_id', 'product_id']).size().to_frame('orders').reset_index()
interactions_test = s_interactions[test_mask].groupby(['user_id', 'product_id']).size().to_frame('orders').reset_index()

In [74]:
interactions_test

Unnamed: 0,user_id,product_id,orders
0,5,6808,1
1,5,15349,1
2,5,21413,1
3,5,24231,1
4,5,24535,1
...,...,...,...
517913,206198,31647,1
517914,206198,32238,1
517915,206198,34358,1
517916,206198,46667,1


In [75]:
# sample_weight_train = interactions_train['orders']
# sample_weight_test = interactions_test['orders']
sample_weight_train = np.log2(interactions_train['orders'] + 1)
sample_weight_test = np.log2(interactions_test['orders'] + 1)

interactions_train = interactions_train[['user_id', 'product_id']]
interactions_test = interactions_test[['user_id', 'product_id']]

train_users = np.sort(interactions_train.user_id.unique())
test_users = np.sort(interactions_test.user_id.unique())
cold_start_users = set(test_users) - set(train_users)

train_items = np.sort(interactions_train.product_id.unique())
test_items = np.sort(interactions_test.product_id.unique())
cold_start_items = set(test_items) - set(train_items)

item_features_train = item_features[item_features.product_id.isin(train_items)]
item_features_test = item_features[item_features.product_id.isin(test_items)]

print("total shape: {}".format(interactions_total.shape))
print("train shape: {}".format(interactions_train.shape))
print("test shape: {}".format(interactions_test.shape))

print("\ntrain weights shape: {}".format(sample_weight_train.shape))
print("test weights shape: {}".format(sample_weight_test.shape))

print("\ntrain users: {}".format(len(train_users)))
print("test users: {}".format(len(test_users)))
print("cold-start users: {}".format(cold_start_users))

print("\ntrain items: {}".format(len(train_items)))
print("test items: {}".format(len(test_items)))
print("number of cold-start items: {}".format(len(cold_start_items)))

print("\ntrain item features: {}".format(item_features_train.shape))
print("test item features: {}".format(item_features_test.shape))

total shape: (3191742, 2)
train shape: (1093314, 2)
test shape: (517918, 2)

train weights shape: (1093314,)
test weights shape: (517918,)

train users: 20000
test users: 19888
cold-start users: set()

train items: 39474
test items: 31594
number of cold-start items: 1749

train item features: (39474, 135)
test item features: (31594, 135)


# Train the RankFM model

In [76]:
model = RankFM(factors=64, loss='warp', max_samples=50, alpha=0.01, learning_rate=0.01, learning_schedule='invscaling')

In [77]:
%%time
model.fit(interactions_train, sample_weight=sample_weight_train, epochs=30, verbose=True)


training epoch: 0
log likelihood: -572073.9375

training epoch: 1
log likelihood: -524473.0

training epoch: 2
log likelihood: -523613.78125

training epoch: 3
log likelihood: -526168.25

training epoch: 4
log likelihood: -528736.875

training epoch: 5
log likelihood: -530344.5625

training epoch: 6
log likelihood: -531521.5

training epoch: 7
log likelihood: -531846.125

training epoch: 8
log likelihood: -532626.5

training epoch: 9
log likelihood: -533012.75

training epoch: 10
log likelihood: -533220.3125

training epoch: 11
log likelihood: -532427.5625

training epoch: 12
log likelihood: -531902.6875

training epoch: 13
log likelihood: -531325.125

training epoch: 14
log likelihood: -530571.375

training epoch: 15
log likelihood: -529767.5

training epoch: 16
log likelihood: -528649.875

training epoch: 17
log likelihood: -526964.25

training epoch: 18
log likelihood: -525803.625

training epoch: 19
log likelihood: -523906.65625

training epoch: 20
log likelihood: -521815.46875

t

# Evaluate

## Baseline - popularity

In [78]:
k=10
most_popular = interactions_train.groupby('product_id')['user_id'].count().sort_values(ascending=False)[:k]
test_user_items = interactions_test.groupby('user_id')['product_id'].apply(set).to_dict()
test_user_items = {key: val for key, val in test_user_items.items() if key in set(train_users)}

In [79]:
base_hit = np.mean([int(len(set(most_popular.index) & set(val)) > 0) for key, val in test_user_items.items()])
base_precision = np.mean([len(set(most_popular.index) & set(val)) / len(set(most_popular.index)) for key, val in test_user_items.items()])

print("baseline hit rate: {:.3f}".format(base_hit))
print("baseline precision: {:.3f}".format(base_precision))

baseline hit rate: 0.600
baseline precision: 0.132


## RankFM

In [80]:
from rankfm.evaluation import hit_rate, precision

model_hit = hit_rate(model, interactions_test, k=k)
model_precision = precision(model, interactions_test, k=k)


print('hit@10: ', model_hit)
print('precision@10: ', model_precision)


hit@10:  0.6209271922767498
precision@10:  0.1475160901045857


There is another popular factorization machine pacakge named: LightFM. 
Give it a try and comapre precision@10 or hit@10 of the two.

# LightFM

In [37]:
!pip install lightfm==1.16



In [38]:
from lightfm.data import Dataset
from lightfm import LightFM, evaluation

In [81]:
# YOUR CODE HERE #

# Answer

In [86]:
# all_users = interactions_total.user_id.unique()
# all_items = interactions_total.product_id.unique()
# print(len(all_users), len(all_items))

lfm_dataset = Dataset()
lfm_dataset.fit(users=s_users, items=s_items)

lfm_interactions, lfm_weights = lfm_dataset.build_interactions(zip(interactions_train['user_id'], interactions_train['product_id'], sample_weight_train))
lfm_interactions, lfm_weights

(<20000x41223 sparse matrix of type '<class 'numpy.int32'>'
 	with 1093314 stored elements in COOrdinate format>,
 <20000x41223 sparse matrix of type '<class 'numpy.float32'>'
 	with 1093314 stored elements in COOrdinate format>)

In [87]:
lfm_model = LightFM(no_components=64, loss='warp', max_sampled=50)

In [88]:
%%time
lfm_model.fit(lfm_interactions, epochs=30, verbose=True)


Epoch:   0%|          | 0/30 [00:00<?, ?it/s][A
Epoch:   3%|▎         | 1/30 [00:06<03:04,  6.35s/it][A
Epoch:   7%|▋         | 2/30 [00:13<03:05,  6.62s/it][A
Epoch:  10%|█         | 3/30 [00:20<03:04,  6.84s/it][A
Epoch:  13%|█▎        | 4/30 [00:28<03:02,  7.02s/it][A
Epoch:  17%|█▋        | 5/30 [00:36<03:00,  7.20s/it][A
Epoch:  20%|██        | 6/30 [00:43<02:56,  7.34s/it][A
Epoch:  23%|██▎       | 7/30 [00:51<02:51,  7.46s/it][A
Epoch:  27%|██▋       | 8/30 [00:59<02:46,  7.59s/it][A
Epoch:  30%|███       | 9/30 [01:07<02:41,  7.71s/it][A
Epoch:  33%|███▎      | 10/30 [01:16<02:40,  8.02s/it][A
Epoch:  37%|███▋      | 11/30 [01:24<02:35,  8.17s/it][A
Epoch:  40%|████      | 12/30 [01:33<02:28,  8.26s/it][A
Epoch:  43%|████▎     | 13/30 [01:41<02:21,  8.35s/it][A
Epoch:  47%|████▋     | 14/30 [01:50<02:14,  8.38s/it][A
Epoch:  50%|█████     | 15/30 [01:58<02:06,  8.44s/it][A
Epoch:  53%|█████▎    | 16/30 [02:07<01:58,  8.48s/it][A
Epoch:  57%|█████▋    | 17/30 [

CPU times: user 4min 14s, sys: 104 ms, total: 4min 14s
Wall time: 4min 14s





<lightfm.lightfm.LightFM at 0x7fbaa0cc9d50>

In [89]:
lfm_test_interactions, lfm_test_weights = lfm_dataset.build_interactions(zip(interactions_test['user_id'], interactions_test['product_id']))

In [90]:
lfm_precision = evaluation.precision_at_k(lfm_model, lfm_test_interactions, k=k).mean()
print("precision: {:.3f}".format(lfm_precision))


precision: 0.178
