### Modeling

As mentioned in the previous notebook, this second notebook aims to cover the modeling process of making hotel recommendations based on the clustered utility matrix.
Again, attempting to address issues such as sparsity, scalability and the cold start problem.

There will be a baseline Collaborative Filtering model yet the Final Hybrid System becomes an aggregate of models where old users could be recommended based on the SVD and new users would take the Hybrid Model by LightFM to create recommendations based on the user information. 

I incorporated user and item features to allow a content based approach alongside the collaborative filtering. And finally implemented an oncology model (decision tree) where user profile (attributes and classes) determines user behavior to an extent. So inputting a user profile will help predict the user cluster where I then can create recommendations based on that users cluster.

In [2]:
#Import relevant packages
import pandas as pd
import numpy as np

import tensorflow as tf
import lightfm
from lightfm import LightFM
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import auc_score
from scipy import sparse

#Precision at k userful for evaluating the model precision for K users.
import mean_average_precision as mapr
from lightfm import cross_validation

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from surprise import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV

In [4]:
df = pd.read_csv('s_df')

In [5]:
df_fm = pd.read_csv('clustered_utility')

In [6]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,user_clust,hotel_clust,rating
0,0,0,0,0.074074
1,1,0,1,0.133333


In [7]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [8]:
#Potential step to consider only hotels that are liked.
df.loc[df['rating'] > 2]

Unnamed: 0,user_clust,hotel_clust,rating
23,0,28,2.649383
31,0,40,2.908832
560,5,93,2.876923
617,6,55,2.760833
647,6,91,2.663810
...,...,...,...
90650,996,46,2.280411
90713,997,18,2.789105
90741,997,48,2.895671
90792,998,4,2.746883


Collaborative Filtering (SVD) with Surprise library 

In [9]:
#Trying SVD Baseline
from surprise import Reader, Dataset
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(df,reader)

In [10]:
trainset, testset = train_test_split(data, test_size=.2)

In [11]:
# Print number of uses and items for the trainset 
print('Number of users in train set : ', trainset.n_users, '\n')
print('Number of items in train set : ', trainset.n_items)

Number of users in train set :  1000 

Number of items in train set :  100


In [15]:
#Instantiate a baseline
svd = SVD()

svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1a5d748a58>

In [16]:
predictions = svd.test(testset)

In [17]:
%%time
accuracy.rmse(predictions)
accuracy.mae(predictions)

RMSE: 0.4633
MAE:  0.3893
CPU times: user 18.6 ms, sys: 765 µs, total: 19.4 ms
Wall time: 19 ms


In [18]:
cv_baseline = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.4532  0.4577  0.4608  0.4572  0.0031  
MAE (testset)     0.3876  0.3884  0.3883  0.3881  0.0003  
Fit time          3.10    3.06    3.08    3.08    0.02    
Test time         0.16    0.26    0.16    0.19    0.05    


In [19]:
#Tune SVD model with GridSearchCV
#Create set of parameters to run on GridSearchCV
parameters = {'n_factors': [20, 50, 80],
             'reg_all': [0.04, 0.06],
             'n_epochs': [10, 20, 30],
             'lr_all': [.002, .005, .01]}
svdgrid = GridSearchCV(SVD, param_grid=parameters, n_jobs=-1)

In [20]:
svdgrid.fit(data)

In [21]:
print(svdgrid.best_score)
print(svdgrid.best_params)

{'rmse': 0.4520644415307971, 'mae': 0.3846430379857714}
{'rmse': {'n_factors': 80, 'reg_all': 0.04, 'n_epochs': 30, 'lr_all': 0.01}, 'mae': {'n_factors': 80, 'reg_all': 0.04, 'n_epochs': 30, 'lr_all': 0.01}}


In [24]:
svd1 = SVD(n_factors=80, reg_all=0.04, n_epochs=30, lr_all=0.01)

svd1.fit(trainset)
svdpreds = svd1.test(testset)

In [232]:
accuracy.rmse(svdpreds)
accuracy.mae(svdpreds)

RMSE: 0.4606
MAE:  0.3870


0.3870429446867108

In [26]:
svd1_cv = cross_validate(svd1, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.4525  0.4517  0.4544  0.4528  0.0011  
MAE (testset)     0.3855  0.3851  0.3855  0.3854  0.0002  
Fit time          3.65    3.71    3.95    3.77    0.13    
Test time         0.26    0.15    0.16    0.19    0.05    


#### Pure Collaborative Filtering LightFM 

This will be considered the baseline and the comparison for Hybrid Modeling.

Need wide format of utility matrix (not 3 columns for Surprise SVD) and then converting it to a
SciPy Sparse matrix (COO format) to feed into LightFM Models.

In [27]:
df_fm.head(2)

Unnamed: 0,cluster,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,0,0.074074,0.133333,0.037037,0.111111,0.333333,0.074074,0.333333,0.234568,0.111111,...,0.111111,0.488889,0.0,0.037037,0.234568,0.0,0.074074,0.037037,0.074074,0.0
1,1,1.149392,0.04501,0.047945,0.054795,0.184682,0.232877,0.054795,0.047945,0.082192,...,0.134051,0.205479,0.130137,0.061644,0.123288,0.171233,0.751402,0.10274,0.226027,0.09589


In [28]:
df_fm = df_fm.astype(pd.SparseDtype("float", 0))
df_fm.dtypes

cluster    Sparse[float64, 0]
0          Sparse[float64, 0]
1          Sparse[float64, 0]
2          Sparse[float64, 0]
3          Sparse[float64, 0]
                  ...        
95         Sparse[float64, 0]
96         Sparse[float64, 0]
97         Sparse[float64, 0]
98         Sparse[float64, 0]
99         Sparse[float64, 0]
Length: 101, dtype: object

In [29]:
df_fm.sparse.density

0.9107029702970297

In [30]:
df_fm = df_fm.sparse.to_coo()

In [34]:
#Cross-validation for LightFM model through built-in train, test split
train_fm, test_fm = lightfm.cross_validation.random_train_test_split(df_fm, test_percentage=0.2, random_state=None)

In [290]:
#Uses Stochastic GD to optimize the loss function (bpr, WARP)
light = LightFM(loss='bpr')
light.fit(sparse.coo_matrix(train_fm), epochs=3)

<lightfm.lightfm.LightFM at 0x1a64b806a0>

In [291]:
train_precision = precision_at_k(light, train_fm, k=10).mean()
test_precision = precision_at_k(light, test_fm, k=10).mean()

train_auc = auc_score(light, train_fm).mean()
test_auc = auc_score(light, test_fm).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

Precision: train 0.92, test 0.06.
AUC: train 0.72, test 0.35.


In [None]:
#Testing WARP to optimize precision at k

In [253]:
#Learning schedule makes the model more robust to hyperparameter choices.
light = LightFM(loss='warp', learning_rate=0.01)
light.fit(sparse.coo_matrix(train_fm), epochs=3)

<lightfm.lightfm.LightFM at 0x1a64b80ac8>

In [254]:
%%time
train_precision = precision_at_k(light, train_fm, k=10).mean()
test_precision = precision_at_k(light, test_fm, k=10).mean()

train_auc = auc_score(light, train_fm).mean()
test_auc = auc_score(light, test_fm).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

Precision: train 0.81, test 0.15.
AUC: train 0.60, test 0.49.
CPU times: user 91.6 ms, sys: 4.05 ms, total: 95.6 ms
Wall time: 100 ms


In [41]:
test_precision = precision_at_k(light, test_fm).mean()
print('Test precision: %s' % test_precision)

Test precision: 0.15450002


# Hybrid Modeling 
With Content-Based Filtering and Collaborative Filtering using Item-Features

In [42]:
data1 = pd.read_csv("data1")

In [43]:
data1.head(5)

Unnamed: 0.1,Unnamed: 0,site_name,posa_continent,user_location_country,user_location_region,user_location_city,orig_destination_distance,user_id,is_mobile,is_package,...,is_booking,cnt,hotel_continent,hotel_country,hotel_market,hotel_cluster,year,month,hotel_nights,rating
0,0,2,3,66,348,48862,2234.2641,12,0,1,...,0,3,2,50,628,1,2014,8,4.0,1
1,1,2,3,66,348,48862,2234.2641,12,0,1,...,1,1,2,50,628,1,2014,8,4.0,5
2,2,2,3,66,348,48862,2234.2641,12,0,0,...,0,1,2,50,628,1,2014,8,4.0,1
3,3,2,3,66,442,35390,913.1932,93,0,0,...,0,1,2,50,1457,80,2014,8,5.0,1
4,4,2,3,66,442,35390,913.6259,93,0,0,...,0,1,2,50,1457,21,2014,8,5.0,1


In [44]:
post_final = pd.read_csv('post_final')
post_final.rename(columns={'cluster':'user_clust'}, inplace=True)

In [45]:
post_final.head(2)

Unnamed: 0,user_clust,user_country,user_region,user_city,package
0,0,66,442,36086,1
1,1,66,174,48862,1


In [46]:
df.head(5)

Unnamed: 0,user_clust,hotel_clust,rating
0,0,0,0.074074
1,0,1,0.133333
2,0,2,0.037037
3,0,3,0.111111
4,0,4,0.333333


In [47]:
feat = pd.DataFrame()
feat['hotel_cluster'] = data1['hotel_cluster']
feat['hotel_market'] = data1['hotel_market']
feat['hotel_country'] = data1['hotel_country']
feat['distance'] = data1['orig_destination_distance']

In [48]:
pd.set_option('display.max_columns', None)
feat.head(2)

Unnamed: 0,hotel_cluster,hotel_market,hotel_country,distance
0,1,628,50,2234.2641
1,1,628,50,2234.2641


In [49]:
import scipy
feat1 = feat.groupby('hotel_cluster').agg(lambda x: scipy.stats.mode(x)[0])

In [50]:
#Some representation of features for each hotel_cluster by groupby function being the mode. (Most occuring)
features = pd.DataFrame(feat1, columns = ['hotel_market', 'hotel_country', 'distance']).reset_index()
features.head(2)

Unnamed: 0,hotel_cluster,hotel_market,hotel_country,distance
0,0,212,50,1611.8127
1,1,628,50,227.9798


### Creating Item Features

LightFM requires item features to be passsed in a certain format (x:value, y:value) so, I have to generate
a list of all feature-value pairs

In [51]:
if1 = []
col = ['hotel_market']*len(features.hotel_market.unique()) + ['hotel_country']*len(features.hotel_country.unique()) + ['distance']*len(features.distance.unique())
unique_h = list(features.hotel_market.unique()) + list(features.hotel_country.unique()) + list(features.distance.unique())
#print('hotel_market:', unique_hotel_market)
#print('hotel_country:', unique_hotel_country)
#print('distance:', unique_distance)
for x,y in zip(col, unique_h):
    res = str(x)+ ":" +str(y)
    if1.append(res)
    print(res)

hotel_market:212
hotel_market:628
hotel_market:19
hotel_market:675
hotel_market:365
hotel_market:27
hotel_market:659
hotel_market:29
hotel_market:411
hotel_market:682
hotel_market:46
hotel_market:366
hotel_market:681
hotel_market:701
hotel_market:637
hotel_market:126
hotel_market:213
hotel_market:73
hotel_market:20
hotel_market:1230
hotel_market:118
hotel_market:1503
hotel_market:83
hotel_market:110
hotel_market:1480
hotel_market:402
hotel_market:1400
hotel_market:58
hotel_market:12
hotel_market:623
hotel_market:191
hotel_market:59
hotel_country:50
hotel_country:105
hotel_country:182
hotel_country:152
hotel_country:77
hotel_country:144
hotel_country:8
hotel_country:198
distance:1611.8127
distance:227.9798
distance:73.6105
distance:77.5249
distance:65.0841
distance:4029.0141
distance:60.2484
distance:217.30900000000003
distance:7448.2962
distance:1.39
distance:2485.1057
distance:0.8893
distance:4252.1855
distance:313.1939
distance:534.631
distance:1060.2808
distance:720.9157
distance:22

### Creating LightFM-compliant dataset
LightFM requires data be input in a specific format so I converted my data into a LightFM Dataset Object
(interactions) to feed into the model.

In [53]:
from lightfm.data import Dataset
dataset = Dataset()

In [54]:
#Fitting the dataset
dataset.fit(
    df['user_clust'].unique(), 
    df['hotel_clust'].unique(),
    item_features = if1
)

In [55]:
#Plugging in the interactions and their weights/ratings
(interactions, weights) = dataset.build_interactions([(x[0], x[1], x[2]) for x in df.values ])

In [56]:
interactions.todense()

matrix([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        ...,
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1]], dtype=int32)

In [57]:
weights.todense()

matrix([[0.07407407, 0.13333334, 0.03703704, ..., 0.        , 0.        ,
         0.        ],
        [1.1493921 , 0.04500978, 0.04794521, ..., 0.13013698, 0.17123288,
         0.09589041],
        [0.03389831, 0.00338983, 0.01016949, ..., 0.00338983, 0.01694915,
         0.01355932],
        ...,
        [0.13636364, 0.07713499, 0.14141414, ..., 0.09090909, 0.14141414,
         0.14141414],
        [0.05555556, 0.06591022, 0.03732639, ..., 0.01041667, 0.09288195,
         0.03645833],
        [0.2195122 , 0.09756097, 0.5121951 , ..., 0.38455284, 0.271777  ,
         0.3902439 ]], dtype=float32)

### Building Item Features

In [58]:
def feature_value(my_list):
    result = []
    ll = ['hotel_market:','hotel_country:', 'distance:']
    aa = my_list
    for x,y in zip(ll,aa):
        res = str(x) +""+ str(y)
        result.append(res)
    return result

In [59]:
ad_subset = features[['hotel_market', 'hotel_country','distance']] 
ad_list = [list(x) for x in ad_subset.values]
feature_list = []

In [None]:
#For some reason the function was returning ints as floats so the model could not recognize the values.
#Had to run this loop to remove ints trailing with .0

In [60]:
for i in range(len(ad_list)):
    for j in range(len(ad_list[i])):
        if ad_list[i][j].is_integer():
            ad_list[i][j] = int(ad_list[i][j])

In [61]:
ad_list

[[212, 50, 1611.8127],
 [628, 50, 227.9798],
 [19, 50, 73.6105],
 [675, 50, 77.5249],
 [365, 50, 65.0841],
 [27, 50, 4029.0141],
 [365, 50, 60.2484],
 [659, 50, 217.30900000000003],
 [29, 105, 7448.2962],
 [411, 50, 1.39],
 [682, 50, 2485.1057],
 [27, 50, 0.8893],
 [46, 182, 4252.1855],
 [366, 50, 313.1939],
 [365, 50, 534.631],
 [681, 50, 1060.2808],
 [682, 50, 720.9157],
 [701, 50, 2268.9354],
 [637, 50, 117.1706],
 [628, 50, 43.2695],
 [126, 50, 5473.8088],
 [675, 50, 42.9461],
 [29, 105, 110.04],
 [365, 50, 1564.6379],
 [628, 50, 1035.1268],
 [27, 50, 1094.2035],
 [213, 50, 1614.2436],
 [73, 152, 10021.5382],
 [366, 50, 4858.6758],
 [20, 77, 441.7211],
 [126, 50, 1053.2523],
 [126, 50, 2.0881],
 [366, 50, 2092.3516],
 [212, 50, 2035.164],
 [682, 50, 65.0057],
 [29, 50, 1562.4754],
 [46, 105, 69.1866],
 [675, 50, 60.3896],
 [701, 50, 1093.9088],
 [365, 50, 2.198],
 [1230, 50, 1.4053],
 [675, 50, 1093.561],
 [682, 50, 995.7545],
 [29, 50, 50.4227],
 [118, 50, 1669.9105],
 [628, 50, 2

In [62]:
for item in ad_list:
    feature_list.append(feature_value(item))
    print(feature_value(item))
print(f'Final output: {feature_list}')

['hotel_market:212', 'hotel_country:50', 'distance:1611.8127']
['hotel_market:628', 'hotel_country:50', 'distance:227.9798']
['hotel_market:19', 'hotel_country:50', 'distance:73.6105']
['hotel_market:675', 'hotel_country:50', 'distance:77.5249']
['hotel_market:365', 'hotel_country:50', 'distance:65.0841']
['hotel_market:27', 'hotel_country:50', 'distance:4029.0141']
['hotel_market:365', 'hotel_country:50', 'distance:60.2484']
['hotel_market:659', 'hotel_country:50', 'distance:217.30900000000003']
['hotel_market:29', 'hotel_country:105', 'distance:7448.2962']
['hotel_market:411', 'hotel_country:50', 'distance:1.39']
['hotel_market:682', 'hotel_country:50', 'distance:2485.1057']
['hotel_market:27', 'hotel_country:50', 'distance:0.8893']
['hotel_market:46', 'hotel_country:182', 'distance:4252.1855']
['hotel_market:366', 'hotel_country:50', 'distance:313.1939']
['hotel_market:365', 'hotel_country:50', 'distance:534.631']
['hotel_market:681', 'hotel_country:50', 'distance:1060.2808']
['hote

In [63]:
feature_list

[['hotel_market:212', 'hotel_country:50', 'distance:1611.8127'],
 ['hotel_market:628', 'hotel_country:50', 'distance:227.9798'],
 ['hotel_market:19', 'hotel_country:50', 'distance:73.6105'],
 ['hotel_market:675', 'hotel_country:50', 'distance:77.5249'],
 ['hotel_market:365', 'hotel_country:50', 'distance:65.0841'],
 ['hotel_market:27', 'hotel_country:50', 'distance:4029.0141'],
 ['hotel_market:365', 'hotel_country:50', 'distance:60.2484'],
 ['hotel_market:659', 'hotel_country:50', 'distance:217.30900000000003'],
 ['hotel_market:29', 'hotel_country:105', 'distance:7448.2962'],
 ['hotel_market:411', 'hotel_country:50', 'distance:1.39'],
 ['hotel_market:682', 'hotel_country:50', 'distance:2485.1057'],
 ['hotel_market:27', 'hotel_country:50', 'distance:0.8893'],
 ['hotel_market:46', 'hotel_country:182', 'distance:4252.1855'],
 ['hotel_market:366', 'hotel_country:50', 'distance:313.1939'],
 ['hotel_market:365', 'hotel_country:50', 'distance:534.631'],
 ['hotel_market:681', 'hotel_country:50

In [64]:
item_tuple = list(zip(features.hotel_cluster, feature_list))
item_tuple

[(0, ['hotel_market:212', 'hotel_country:50', 'distance:1611.8127']),
 (1, ['hotel_market:628', 'hotel_country:50', 'distance:227.9798']),
 (2, ['hotel_market:19', 'hotel_country:50', 'distance:73.6105']),
 (3, ['hotel_market:675', 'hotel_country:50', 'distance:77.5249']),
 (4, ['hotel_market:365', 'hotel_country:50', 'distance:65.0841']),
 (5, ['hotel_market:27', 'hotel_country:50', 'distance:4029.0141']),
 (6, ['hotel_market:365', 'hotel_country:50', 'distance:60.2484']),
 (7, ['hotel_market:659', 'hotel_country:50', 'distance:217.30900000000003']),
 (8, ['hotel_market:29', 'hotel_country:105', 'distance:7448.2962']),
 (9, ['hotel_market:411', 'hotel_country:50', 'distance:1.39']),
 (10, ['hotel_market:682', 'hotel_country:50', 'distance:2485.1057']),
 (11, ['hotel_market:27', 'hotel_country:50', 'distance:0.8893']),
 (12, ['hotel_market:46', 'hotel_country:182', 'distance:4252.1855']),
 (13, ['hotel_market:366', 'hotel_country:50', 'distance:313.1939']),
 (14, ['hotel_market:365', '

In [65]:
item_features = dataset.build_item_features(item_tuple, normalize=False)

In [66]:
item_features.todense()

matrix([[1., 0., 0., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 1.]], dtype=float32)

In [67]:
user_id_map, user_feature_map, item_id_map, item_feature_map = dataset.mapping()
dataset.mapping()

({0: 0,
  1: 1,
  2: 2,
  3: 3,
  4: 4,
  5: 5,
  6: 6,
  7: 7,
  8: 8,
  9: 9,
  10: 10,
  11: 11,
  12: 12,
  13: 13,
  14: 14,
  15: 15,
  16: 16,
  17: 17,
  18: 18,
  19: 19,
  20: 20,
  21: 21,
  22: 22,
  23: 23,
  24: 24,
  25: 25,
  26: 26,
  27: 27,
  28: 28,
  29: 29,
  30: 30,
  31: 31,
  32: 32,
  33: 33,
  34: 34,
  35: 35,
  36: 36,
  37: 37,
  38: 38,
  39: 39,
  40: 40,
  41: 41,
  42: 42,
  43: 43,
  44: 44,
  45: 45,
  46: 46,
  47: 47,
  48: 48,
  49: 49,
  50: 50,
  51: 51,
  52: 52,
  53: 53,
  54: 54,
  55: 55,
  56: 56,
  57: 57,
  58: 58,
  59: 59,
  60: 60,
  61: 61,
  62: 62,
  63: 63,
  64: 64,
  65: 65,
  66: 66,
  67: 67,
  68: 68,
  69: 69,
  70: 70,
  71: 71,
  72: 72,
  73: 73,
  74: 74,
  75: 75,
  76: 76,
  77: 77,
  78: 78,
  79: 79,
  80: 80,
  81: 81,
  82: 82,
  83: 83,
  84: 84,
  85: 85,
  86: 86,
  87: 87,
  88: 88,
  89: 89,
  90: 90,
  91: 91,
  92: 92,
  93: 93,
  94: 94,
  95: 95,
  96: 96,
  97: 97,
  98: 98,
  99: 99,
  100: 100,
  101: 1

### Creating User Features

In [68]:
post_final.head(2)

Unnamed: 0,user_clust,user_country,user_region,user_city,package
0,0,66,442,36086,1
1,1,66,174,48862,1


In [69]:
features2 = pd.DataFrame(post_final,  columns = ['user_country', 'user_region', 'user_city', 'package']).reset_index()
features2.rename(columns={'index':'user_clust'}, inplace=True)
features2

Unnamed: 0,user_clust,user_country,user_region,user_city,package
0,0,66,442,36086,1
1,1,66,174,48862,1
2,2,66,174,29254,0
3,3,66,348,48862,1
4,4,66,442,3781,1
...,...,...,...,...,...
995,995,66,174,49272,1
996,996,66,347,56153,1
997,997,66,442,37916,1
998,998,66,174,25315,0


In [70]:
uf = []
col = ['user_country']*len(features2.user_country.unique()) + ['user_region']*len(features2.user_region.unique()) + ['user_city']*len(features2.user_city.unique()) + ['package']*len(features2.package.unique())
unique_z = list(features2.user_country.unique()) + list(features2.user_region.unique()) + list(features2.user_city.unique()) + list(features2.package.unique())

for x,y in zip(col, unique_z):
    res = str(x)+ ":" +str(y)
    uf.append(res)
    print(res)

user_country:66
user_country:46
user_country:205
user_country:1
user_country:77
user_country:215
user_region:442
user_region:174
user_region:348
user_region:368
user_region:155
user_region:462
user_region:354
user_region:435
user_region:184
user_region:220
user_region:824
user_region:363
user_region:395
user_region:448
user_region:311
user_region:646
user_region:385
user_region:135
user_region:226
user_region:346
user_region:331
user_region:977
user_region:467
user_region:343
user_region:258
user_region:314
user_region:256
user_region:196
user_region:436
user_region:447
user_region:312
user_region:318
user_region:347
user_region:171
user_region:871
user_region:322
user_region:337
user_region:351
user_region:520
user_region:401
user_region:480
user_region:153
user_region:293
user_city:36086
user_city:48862
user_city:29254
user_city:3781
user_city:49272
user_city:36643
user_city:42300
user_city:7317
user_city:24103
user_city:3263
user_city:26232
user_city:14703
user_city:28620
user_city:

In [71]:
# we call fit to supply userid, item id and user/item features
dataset2 = Dataset()
dataset2.fit(
        df['user_clust'].unique(), # all the users
        df['hotel_clust'].unique(), # all the items
        item_features = if1,
        user_features = uf
)

In [72]:
(interactions, weights) = dataset2.build_interactions([(x[0], x[1], x[2]) for x in df.values ])

In [73]:
interactions.todense()

matrix([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        ...,
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1],
        [1, 1, 1, ..., 1, 1, 1]], dtype=int32)

In [74]:
weights.todense()

matrix([[0.07407407, 0.13333334, 0.03703704, ..., 0.        , 0.        ,
         0.        ],
        [1.1493921 , 0.04500978, 0.04794521, ..., 0.13013698, 0.17123288,
         0.09589041],
        [0.03389831, 0.00338983, 0.01016949, ..., 0.00338983, 0.01694915,
         0.01355932],
        ...,
        [0.13636364, 0.07713499, 0.14141414, ..., 0.09090909, 0.14141414,
         0.14141414],
        [0.05555556, 0.06591022, 0.03732639, ..., 0.01041667, 0.09288195,
         0.03645833],
        [0.2195122 , 0.09756097, 0.5121951 , ..., 0.38455284, 0.271777  ,
         0.3902439 ]], dtype=float32)

### Building User Features

In [75]:
def feature_value2(my_list):
    result = []
    ll = ['user_country:', 'user_region:', 'user_city:', 'package:']
    aa = my_list
    for x,y in zip(ll,aa):
        res = str(x) +""+ str(y)
        result.append(res)
    return result

In [76]:
ad_subset2 = features2[['user_country', 'user_region', 'user_city','package']] 
ad_list2 = [list(x) for x in ad_subset2.values]
feature_list2 = []
for item in ad_list2:
    feature_list2.append(feature_value2(item))
    print(feature_value2(item))
print(f'Final output: {feature_list2}')

['user_country:66', 'user_region:442', 'user_city:36086', 'package:1']
['user_country:66', 'user_region:174', 'user_city:48862', 'package:1']
['user_country:66', 'user_region:174', 'user_city:29254', 'package:0']
['user_country:66', 'user_region:348', 'user_city:48862', 'package:1']
['user_country:66', 'user_region:442', 'user_city:3781', 'package:1']
['user_country:66', 'user_region:174', 'user_city:49272', 'package:0']
['user_country:66', 'user_region:174', 'user_city:48862', 'package:1']
['user_country:66', 'user_region:174', 'user_city:36086', 'package:1']
['user_country:66', 'user_region:174', 'user_city:36643', 'package:1']
['user_country:66', 'user_region:442', 'user_city:48862', 'package:1']
['user_country:66', 'user_region:442', 'user_city:42300', 'package:0']
['user_country:66', 'user_region:174', 'user_city:48862', 'package:0']
['user_country:66', 'user_region:174', 'user_city:48862', 'package:0']
['user_country:66', 'user_region:174', 'user_city:48862', 'package:1']
['user_

In [77]:
user_tuple = list(zip(features2.user_clust, feature_list2))
user_tuple

[(0, ['user_country:66', 'user_region:442', 'user_city:36086', 'package:1']),
 (1, ['user_country:66', 'user_region:174', 'user_city:48862', 'package:1']),
 (2, ['user_country:66', 'user_region:174', 'user_city:29254', 'package:0']),
 (3, ['user_country:66', 'user_region:348', 'user_city:48862', 'package:1']),
 (4, ['user_country:66', 'user_region:442', 'user_city:3781', 'package:1']),
 (5, ['user_country:66', 'user_region:174', 'user_city:49272', 'package:0']),
 (6, ['user_country:66', 'user_region:174', 'user_city:48862', 'package:1']),
 (7, ['user_country:66', 'user_region:174', 'user_city:36086', 'package:1']),
 (8, ['user_country:66', 'user_region:174', 'user_city:36643', 'package:1']),
 (9, ['user_country:66', 'user_region:442', 'user_city:48862', 'package:1']),
 (10, ['user_country:66', 'user_region:442', 'user_city:42300', 'package:0']),
 (11, ['user_country:66', 'user_region:174', 'user_city:48862', 'package:0']),
 (12, ['user_country:66', 'user_region:174', 'user_city:48862',

In [78]:
user_features = dataset2.build_user_features(user_tuple, normalize=False)
user_features.todense()

matrix([[1., 0., 0., ..., 0., 1., 0.],
        [0., 1., 0., ..., 0., 1., 0.],
        [0., 0., 1., ..., 0., 0., 1.],
        ...,
        [0., 0., 0., ..., 1., 1., 0.],
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 1., 0.]], dtype=float32)

In [79]:
user_id_map2, user_feature_map2, item_id_map2, item_feature_map2 = dataset2.mapping()
dataset2.mapping()

({0: 0,
  1: 1,
  2: 2,
  3: 3,
  4: 4,
  5: 5,
  6: 6,
  7: 7,
  8: 8,
  9: 9,
  10: 10,
  11: 11,
  12: 12,
  13: 13,
  14: 14,
  15: 15,
  16: 16,
  17: 17,
  18: 18,
  19: 19,
  20: 20,
  21: 21,
  22: 22,
  23: 23,
  24: 24,
  25: 25,
  26: 26,
  27: 27,
  28: 28,
  29: 29,
  30: 30,
  31: 31,
  32: 32,
  33: 33,
  34: 34,
  35: 35,
  36: 36,
  37: 37,
  38: 38,
  39: 39,
  40: 40,
  41: 41,
  42: 42,
  43: 43,
  44: 44,
  45: 45,
  46: 46,
  47: 47,
  48: 48,
  49: 49,
  50: 50,
  51: 51,
  52: 52,
  53: 53,
  54: 54,
  55: 55,
  56: 56,
  57: 57,
  58: 58,
  59: 59,
  60: 60,
  61: 61,
  62: 62,
  63: 63,
  64: 64,
  65: 65,
  66: 66,
  67: 67,
  68: 68,
  69: 69,
  70: 70,
  71: 71,
  72: 72,
  73: 73,
  74: 74,
  75: 75,
  76: 76,
  77: 77,
  78: 78,
  79: 79,
  80: 80,
  81: 81,
  82: 82,
  83: 83,
  84: 84,
  85: 85,
  86: 86,
  87: 87,
  88: 88,
  89: 89,
  90: 90,
  91: 91,
  92: 92,
  93: 93,
  94: 94,
  95: 95,
  96: 96,
  97: 97,
  98: 98,
  99: 99,
  100: 100,
  101: 1

In [80]:
user_feature_map2

{0: 0,
 1: 1,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20,
 21: 21,
 22: 22,
 23: 23,
 24: 24,
 25: 25,
 26: 26,
 27: 27,
 28: 28,
 29: 29,
 30: 30,
 31: 31,
 32: 32,
 33: 33,
 34: 34,
 35: 35,
 36: 36,
 37: 37,
 38: 38,
 39: 39,
 40: 40,
 41: 41,
 42: 42,
 43: 43,
 44: 44,
 45: 45,
 46: 46,
 47: 47,
 48: 48,
 49: 49,
 50: 50,
 51: 51,
 52: 52,
 53: 53,
 54: 54,
 55: 55,
 56: 56,
 57: 57,
 58: 58,
 59: 59,
 60: 60,
 61: 61,
 62: 62,
 63: 63,
 64: 64,
 65: 65,
 66: 66,
 67: 67,
 68: 68,
 69: 69,
 70: 70,
 71: 71,
 72: 72,
 73: 73,
 74: 74,
 75: 75,
 76: 76,
 77: 77,
 78: 78,
 79: 79,
 80: 80,
 81: 81,
 82: 82,
 83: 83,
 84: 84,
 85: 85,
 86: 86,
 87: 87,
 88: 88,
 89: 89,
 90: 90,
 91: 91,
 92: 92,
 93: 93,
 94: 94,
 95: 95,
 96: 96,
 97: 97,
 98: 98,
 99: 99,
 100: 100,
 101: 101,
 102: 102,
 103: 103,
 104: 104,
 105: 105,
 106: 106,
 107: 107,
 108: 108,
 109: 109,
 110: 110,

## Modeling 

In [81]:
train1, test1 = lightfm.cross_validation.random_train_test_split(interactions, test_percentage=0.2, random_state=None)

In [82]:
train1

<1000x100 sparse matrix of type '<class 'numpy.int32'>'
	with 72785 stored elements in COOrdinate format>

In [83]:
test1

<1000x100 sparse matrix of type '<class 'numpy.int32'>'
	with 18197 stored elements in COOrdinate format>

#### Baseline Hybrid Model

In [84]:
base = LightFM()
base.fit(train1, item_features = item_features, user_features = user_features)

<lightfm.lightfm.LightFM at 0x1a61c1df28>

In [85]:
%%time
train_precision = precision_at_k(base, train1, item_features=item_features, user_features=user_features, k=5).mean()
test_precision = precision_at_k(base, test1, item_features=item_features, user_features=user_features, k=5).mean()

train_auc = auc_score(base, train1, item_features=item_features, user_features=user_features).mean()
test_auc = auc_score(base, test1, item_features=item_features, user_features=user_features).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

Precision: train 0.73, test 0.18.
AUC: train 0.54, test 0.52.
CPU times: user 105 ms, sys: 2.75 ms, total: 108 ms
Wall time: 109 ms


#### Tuning Hybrid Model

In [273]:
#Learning schedule makes the model more robust to hyperparameter choices.
light = LightFM(loss='warp', learning_rate=0.01, learning_schedule='adagrad')
light.fit(train1, item_features= item_features, user_features=user_features)

<lightfm.lightfm.LightFM at 0x1a64bab390>

In [274]:
%%time
train_precision = precision_at_k(light, train1, item_features=item_features, user_features=user_features, k=5).mean()

train_auc = auc_score(light, train1, item_features=item_features, user_features=user_features).mean()

print('Precision: %.2f.' % (train_precision))
print('AUC: %.2f.' % (train_auc))

Precision: 0.80.
AUC: 0.57.
CPU times: user 75.3 ms, sys: 3.3 ms, total: 78.6 ms
Wall time: 80.2 ms


In [270]:
light = LightFM(loss='bpr', learning_rate=0.05, learning_schedule='adagrad')
light.fit(train1, item_features= item_features, user_features=user_features)

<lightfm.lightfm.LightFM at 0x1a64babd30>

In [271]:
%%time
train_precision = precision_at_k(light, train1, item_features=item_features, user_features=user_features, k=5).mean()

train_auc = auc_score(light, train1, item_features=item_features, user_features=user_features).mean()

print('Precision: %.2f.' % (train_precision))
print('AUC: %.2f.' % (train_auc))

Precision: 0.87.
AUC: 0.65.
CPU times: user 77.3 ms, sys: 2.04 ms, total: 79.3 ms
Wall time: 79.1 ms


Predictions

In [88]:
#predict for existing user
user_x = user_id_map2[0]
n_users, n_items = interactions.shape # no of users * no of items
light.predict(user_x, np.arange(n_items)) # means predict for all

array([-0.2641093 , -0.33006698, -0.28969663, -0.3159843 , -0.19584833,
       -0.1795793 , -0.24201661, -0.2918671 , -0.19407591, -0.22385195,
       -0.24519959, -0.28380385, -0.2621444 , -0.3742377 , -0.22786663,
       -0.2048486 , -0.26394472, -0.23153996, -0.14257511, -0.24627802,
       -0.21421279, -0.38154045, -0.21380149, -0.24035321, -0.2303197 ,
       -0.21543238, -0.2230413 , -0.23228784, -0.42008924, -0.22233266,
       -0.33869737, -0.18158995, -0.2048643 , -0.2569126 , -0.2594498 ,
       -0.3258415 , -0.33152723, -0.24385454, -0.24276686, -0.22772142,
       -0.2660419 , -0.22541521, -0.32385787, -0.33977932, -0.32179964,
       -0.17454326, -0.32313   , -0.19510908, -0.32545027, -0.28993186,
       -0.273052  , -0.28319412, -0.29132807, -0.17593041, -0.33078107,
       -0.30590665, -0.43644503, -0.262209  , -0.22951174, -0.3425389 ,
       -0.2305027 , -0.30928105, -0.34925374, -0.32993913, -0.23620038,
       -0.26394328, -0.44236156, -0.2780687 , -0.4207214 , -0.25

#### Being able to predict for NEW users.

In [89]:
user_feature_list = ['user_country:66', 'user_region:174', 'user_city:48862', 'package:1']

In [90]:
def new_user_input(user_feature_map, user_feature_list):
  #user_feature_map = user_feature_map  
  num_features = len(user_feature_list)
  normalised_val = 1.0 
  target_indices = []
  for feature in user_feature_list:
    try:
        target_indices.append(user_feature_map[feature])
    except KeyError:
        print("new user feature encountered '{}'".format(feature))
        pass
  #print("target indices: {}".format(target_indices))
  new_user_features = np.zeros(len(user_feature_map.keys()))
  for i in target_indices:
    new_user_features[i] = normalised_val
  new_user_features = sparse.csr_matrix(new_user_features)
  return(new_user_features)

In [91]:
new_user_features = new_user_input(user_feature_map2, user_feature_list)

In [92]:
new_user_features.todense()

matrix([[0., 0., 0., ..., 0., 1., 0.]])

In [93]:
light.predict(0, np.arange(n_items), user_features=new_user_features)

array([-27.64719 , -27.84846 , -27.789427, -27.955019, -27.692997,
       -27.59371 , -27.75037 , -27.845528, -27.578558, -27.689655,
       -27.783152, -27.722939, -27.871927, -28.050377, -27.736067,
       -27.613546, -27.766125, -27.792068, -27.520163, -27.70413 ,
       -27.571182, -28.034004, -27.569946, -27.73175 , -27.705309,
       -27.728773, -27.723295, -27.623236, -28.084944, -27.603584,
       -27.952738, -27.630623, -27.669527, -27.707672, -27.858715,
       -27.979872, -27.903032, -27.794706, -27.647255, -27.590855,
       -27.747269, -27.743156, -27.842402, -28.048119, -27.907158,
       -27.64153 , -27.821224, -27.531546, -28.002157, -27.823479,
       -27.84626 , -27.75831 , -27.932985, -27.65263 , -27.894604,
       -27.825382, -28.114437, -27.66928 , -27.748356, -27.92019 ,
       -27.651892, -27.898592, -27.934505, -27.905355, -27.660353,
       -27.859673, -28.104322, -27.846777, -28.120453, -27.650785,
       -27.775906, -27.65272 , -27.64584 , -27.613794, -27.724

### Ontology Classifier Modeling

Ontology Model to predict user cluster for a new user
Additional Solution to cold start problem aside from LightFM Hybrid Recommender

Decision Tree Classifier

In [174]:
decision_df = pd.read_csv('decision_df')

In [175]:
decision_df.drop(columns={'Unnamed: 0'}, inplace=True)

In [176]:
decision_df.drop_duplicates(inplace=True)

In [188]:
decision_df.head(5)

Unnamed: 0,cluster,user_country,user_region,user_city,package,mobile
0,136,66,258,45545,0,0
1,472,66,348,48862,1,0
2,782,66,174,44951,1,1
3,821,66,142,17440,1,0
4,515,66,442,37449,0,1


Decision Tree Classification

In [178]:
#Define your X and Y for train, test split
X = decision_df.drop(['cluster'],axis=1)
y = decision_df['cluster']

In [179]:
y.shape,X.shape

((91091,), (91091, 5))

In [180]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.33)

In [181]:
#Instantiate classifier
tree = DecisionTreeClassifier(max_depth=6, criterion='gini', min_samples_split=10)

tree.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=6, min_samples_split=10)

In [182]:
tree_ypred = tree.predict(X_test)

In [183]:
#These are the user cluster predictions for the test set.
tree_ypred

array([434,  11, 434, ...,  11, 434,  76])

In [196]:
#3 randomly input new users
user_prof = [[5, 142, 17440, 1, 0]]
user_prof2 = [[66, 119, 37448, 0, 1]]
user_prof3 = [[66, 258, 45545, 0, 0]]

In [194]:
#For new users.
tree.predict(user_prof)

array([35])

In [192]:
tree.predict(user_prof2)

array([36])

In [197]:
tree.predict(user_prof3)

array([11])

Not looking into evaluation because for the purposes of this system as long as it can put a user into a cluster in
any way possible; it suffices for recommendations. Leniency in recommendations as well.

The idea of the ontology method is to solve the cold start problem, where the classifier will classify users into a user cluster and then recommendations are generated based on the cluster they are in by the Hybrid Model. Keep in mind that this is a solution outside of LightFM because LightFM does not suffer from the cold start problem when it takes in user-item embeddings.

Conclusively, I have an aggregate of many algorithms that make up my recommender system. 
-User preference and Item features are vital in recommender systems.
-Hybrid model successfully compensates the shortcomings of individual CF and Content
-Hybrid model produces an AUC score of 0.65