<head><h1 align="center">
Food.com Recipe Interactions
</h1></head>  
  
<head><h3 align="center">Recipe Recommender System Modeling</h3></head>

We will develop a **personalized recommender system**. With such systems, there are different options for methods in which to approach recommendations:
  
- The first method is to ask for *explicit* defined ratings from a user regarding the content that they have used/viewed/consumed.  

- Another method is to gather data *implicitly* as the user interacts with the system or service.
  
In this dataset, we have both:  
- **Explicit Data** — the `rating` feature of the **interactions dataset** for each `user`.
- **Implicit Data** — the `x` feature of the **x dataset**.  
  
Technically, I would suppose the `review` feature would fall under explicit data as well, but to use it as such would require *Natural Language Processing*, which we may get involved in later, but not in this notebook.  
  
Additionally with personalized recommender systems, we can center our system around either around suggesting similar *content* (recipes) or suggestions based on other *users of similar preference*. Both systems are succinctly described below:
  
**Content-Based Recommenders**  
> Main Idea: If you like an item, you will also like "similar" items.$_1$  
  
**Collaborative Filtering Systems**  
> Main Idea: If user A likes items 5, 6, 7, and 8 and user B likes items 5, 6, and 7, then it is highly likely that user B will also like item 8.$_1$  
  
We will use `scikit-surprise` for a user-based collaborative filtering model.

#### Setup

In [6]:
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np                                
import pandas as pd                               
import matplotlib.pyplot as plt                   
from IPython.display import Image                 
from IPython.display import display               
from time import gmtime, strftime                 
from sagemaker.predictor import csv_serializer   

# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name # set the region of the instance
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + containers[my_region] + " container for your SageMaker endpoint.")

Success - the MySageMakerInstance is in the us-east-2 region. You will use the 825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest container for your SageMaker endpoint.


In [2]:
%conda install --yes --prefix {sys.prefix} -c conda-forge scikit-surprise

Collecting package metadata (current_repodata.json): done
Solving environment: | 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - conda-forge/noarch::imageio==2.9.0=py_0
  - conda-forge/linux-64::jupyter_server==1.4.1=py36h5fab9bb_0
  - conda-forge/noarch::black==20.8b1=py_1
  - conda-forge/linux-64::bokeh==2.2.3=py36h5fab9bb_0
  - defaults/linux-64::_anaconda_depends==5.1.0=py36_2
  - conda-forge/noarch::pyls-black==0.4.6=pyh9f0ad1d_0
  - conda-forge/noarch::aiobotocore==1.2.1=pyhd8ed1ab_0
  - conda-forge/noarch::pyls-spyder==0.3.2=pyhd8ed1ab_0
  - conda-forge/linux-64::anyio==2.1.0=py36h5fab9bb_0
  - conda-forge/noarch::jupyterlab_server==2.3.0=pyhd8ed1ab_0
  - conda-forge/linux-64::matplotlib-base==3.3.4=py36hd391965_0
  - conda-forge/linux-64::spyder==4.2.0=py36h5fab9bb_0
  - conda-forge/noarch::python-language-server==0.36.2=pyhd8ed1ab_0
  - conda-forge/noarch::seaborn-base==0.11.1=pyhd8ed1ab_1
  -

In [7]:
from surprise.similarities import cosine, msd, pearson # for Memory-based Methods (Neighborhood-based)
from surprise.prediction_algorithms import SVD, knns, KNNWithMeans, KNNBasic, KNNBaseline, KNNWithZScore, CoClustering, BaselineOnly, NormalPredictor, NMF, SVDpp, SlopeOne
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split
from surprise import Reader, Dataset, accuracy, dump

In [8]:
# Additional import statements are my own, incase I would like to copy the above tutorial statement to other notebook.
import time
import csv
import json
import pickle
import seaborn as sns
from os import system
from math import floor
from copy import deepcopy
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import cdist
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 200)

In [9]:
bucket = 'sagemaker-studio-t1ems8mtnoj'
subfolder = ''
s3 = boto3.client('s3')
contents = s3.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']
# for f in contents:
#     print(f['Key'])

# Data Import & Preparation

In [7]:
rdf = pd.read_csv('data/RAW_recipes.csv')
idf = pd.read_csv('data/RAW_interactions.csv')

pp_rdf = pd.read_csv('data/PP_recipes.csv')
pp_idf = pd.read_csv('data/PP_users.csv')

ingID = pd.read_pickle('data/ingr_map.pkl')

tdf = pd.read_csv('data/interactions_train.csv')
vdf = pd.read_csv('data/interactions_validation.csv')

# """Cleaning: All steps carried over from EDA Notebook"""
rdf.drop(labels=721, inplace = True)

# Cleaning/FE: Creating columns for recipe's respective nutrients
rdf['kcal'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[0])
rdf['fat'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[1])
rdf['sugar'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[2])
rdf['salt'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[3])
rdf['protein'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[4])
rdf['sat_fat'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[5])
rdf['carbs'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[6])

# Cleaning: Imputing outlier value to median
rdf['minutes'] = np.where(rdf.minutes == 2147483647,
                         rdf.minutes.median(),
                         rdf.minutes)

idf['date'] = pd.to_datetime(idf.date)

# Model-based Collaborative Filtering with `scikit-Surprise`

In [8]:
idf.columns

Index(['user_id', 'recipe_id', 'date', 'rating', 'review'], dtype='object')

### Data Preparation

In [10]:
idf = pd.read_csv('data/RAW_interactions.csv')
reader_zero = Reader(rating_scale=(0, 5))
sidf_zero = Dataset.load_from_df(idf[['user_id', 'recipe_id', 'rating']], reader_zero)
cross_validate(BaselineOnly(bsl_options={'method':'als'}), sidf_zero, cv=5, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2153  1.2091  1.2194  1.2092  1.2117  1.2129  0.0039  
MAE (testset)     0.7393  0.7355  0.7402  0.7351  0.7371  0.7374  0.0020  
Fit time          5.92    5.87    6.12    6.00    6.12    6.01    0.10    
Test time         2.03    1.95    1.96    1.95    1.94    1.97    0.03    


{'test_rmse': array([1.21532947, 1.20913504, 1.21939506, 1.20921464, 1.21166491]),
 'test_mae': array([0.7392681 , 0.7355139 , 0.74016826, 0.73511423, 0.73708935]),
 'fit_time': (5.917161226272583,
  5.873985052108765,
  6.123218536376953,
  6.003076076507568,
  6.116864442825317),
 'test_time': (2.0266876220703125,
  1.9502432346343994,
  1.9635095596313477,
  1.9478819370269775,
  1.9431097507476807)}

In [4]:
'lol'

'lol'

In [9]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))

In [10]:
# The columns must correspond to user id, item id and ratings (in that order).
sidf = Dataset.load_from_df(idf[['user_id', 'recipe_id', 'rating']], reader)

In [11]:
type(sidf)

surprise.dataset.DatasetAutoFolds

### Prediction Algorithm Cross Validation Splits

**Baseline Model with ALS**  
Our first model, with 5 cross validation folds and hyperparameter tuning otherwise set to default settings, performed with a predicted rating mean RMSE of **1.2127**. I would be interested in seeing how tweaking the method used by the baseline estimator to Stochastic Gradient Descent instead changes outcomes, as well as changing the number of cross validation folds.

In [12]:
cross_validate(BaselineOnly(bsl_options={'method':'als'}), sidf, cv=5, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2074  1.2152  1.2161  1.2100  1.2150  1.2127  0.0034  
MAE (testset)     0.7342  0.7374  0.7389  0.7372  0.7383  0.7372  0.0016  
Fit time          5.24    5.96    6.01    5.92    6.05    5.84    0.30    
Test time         1.92    1.93    1.94    2.44    1.41    1.93    0.33    


{'test_rmse': array([1.20743482, 1.21524522, 1.21607505, 1.21002917, 1.21495625]),
 'test_mae': array([0.7341978 , 0.73737237, 0.7389167 , 0.73718265, 0.73831716]),
 'fit_time': (5.243390083312988,
  5.9565110206604,
  6.009812355041504,
  5.9188947677612305,
  6.051447153091431),
 'test_time': (1.9163506031036377,
  1.9299907684326172,
  1.9351892471313477,
  2.441502809524536,
  1.4138259887695312)}

**Stochastic Gradient Descent**  
This model performed with the most marginal difference when compared to the previous baseline ALS model. With a mean RMSE across the 5 folds of **1.2126**, it performed *slightly better* (by only **.0001!**). However, the marginal improvement doesn't take fit time in to consideration, with our ALS model performing 63.8% faster.

In [12]:
6.56/10.44

0.6283524904214559

In [22]:
cross_validate(BaselineOnly(bsl_options={'method':'sgd'}), sidf, cv=5, verbose=True)

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2110  1.2179  1.2107  1.2116  1.2115  1.2126  0.0027  
MAE (testset)     0.7347  0.7374  0.7345  0.7344  0.7337  0.7349  0.0013  
Fit time          9.78    10.36   10.90   10.66   10.51   10.44   0.38    
Test time         2.51    1.47    1.51    1.49    2.04    1.80    0.41    


{'test_rmse': array([1.21103128, 1.21793577, 1.21072264, 1.21160403, 1.21147109]),
 'test_mae': array([0.73465602, 0.73739129, 0.73451846, 0.7343926 , 0.73368812]),
 'fit_time': (9.78078579902649,
  10.355447053909302,
  10.901717185974121,
  10.656184911727905,
  10.510430335998535),
 'test_time': (2.512019157409668,
  1.472841739654541,
  1.5068244934082031,
  1.4883172512054443,
  2.0438711643218994)}

**ALS with 25 Cross Validation Folds**  
As the dataset is so large, I was curious to see how increasing the number of testing folds impacted performance. While difficult to read the output, the model performed better, with a mean RMSE across all folds of **1.2099**. and a higher standard deviation of 0.0070.

In [None]:
cross_validate(BaselineOnly(bsl_options={'method':'als'}), sidf, cv=25, verbose=False)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


**CoClustering**  
While this algorithm produced performance metrics not very far off from our other models, it performed worse all around—mean across folds, variance across folds, and fit time for each fold.

In [None]:
cross_validate(CoClustering(random_state=1984), sidf, cv=5, verbose=True, n_jobs=-1)

**SVD**  
This model performed with a mean RMSE of **1.2201** and a standard deviation of 0.0026 across all folds, but the fit time was *obscenely high* in comparison to the previous models. There are also many hyperparameters that may be tuned. Despite the much longer train time, let's see how hyper parameter tuning with **GridSearchCV** changes performance.  
  
Note that this model is performed with surprise's SVD algorithm's default hyperparameters:  
- `n_factors`: **100**
- `n_epochs`: **20**
- `lr_all`: **0.005** – The learning rate for all parameters.
- `reg_all`: **0.02** – The regularization term for all parameters.

In [42]:
# We can now use this dataset as we please with Surprise's utilities, e.g. calling cross_validate
cross_validate(SVD(random_state=1984), sidf, cv=5, verbose=True, n_jobs=-1)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2253  1.2232  1.2208  1.2100  1.2214  1.2201  0.0053  
MAE (testset)     0.7413  0.7409  0.7393  0.7344  0.7399  0.7391  0.0025  
Fit time          55.76   55.60   55.61   55.55   55.14   55.53   0.21    
Test time         2.25    2.23    2.36    2.22    2.16    2.25    0.06    


{'test_rmse': array([1.22526615, 1.22324572, 1.22075873, 1.20997109, 1.22137993]),
 'test_mae': array([0.74128305, 0.74086325, 0.7393137 , 0.73437507, 0.73991007]),
 'fit_time': (55.75669455528259,
  55.60287833213806,
  55.61086654663086,
  55.54658079147339,
  55.135498046875),
 'test_time': (2.2547242641448975,
  2.233457088470459,
  2.358004093170166,
  2.2175631523132324,
  2.1614022254943848)}

### SVD Grid Search Cross Validation Split

Here I performed GridSearchCV the SVD function to see if it can be optimized further.  
After performing grid search cross validation, the optimum hyperparameters for SVD resulted in the following:
- `'n_factors'`: **20** 
- `'n_epochs'`: **10** 
- `'lr_all'`: **0.01** 
- `'reg_all'`: **0.05**  
  
The with optimized hyperparameters, SVD did <u>**NOT perform better than my baseline models**</u>, with an RMSE of <u>**1.2142**</u>. Note that the optimum `n_factors` and `reg_all` were the minumum values instructed to for the grid search to evaluate, while the `n_epochs` and `lr_all` were the maximum. To evaluate further, I could perform a grid search with hyperparameters adjacent to those above.

In [13]:
param_grid = {'n_factors':[20, 50],'n_epochs':[5, 10],  'lr_all':[0.005,0.01], 'reg_all':[0.002,0.05]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
gs.fit(sidf)
params = gs.best_params['rmse']
gs_svd = SVD(n_factors=params['n_factors'], n_epochs=params['n_epochs'], lr_all=params['lr_all'], reg_all=params['reg_all'], verbose=True, random_state=1984)

In [41]:
cross_validate(gs_svd, sidf, cv=5, verbose=True, n_jobs=-1)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2144  1.2081  1.2217  1.2184  1.2083  1.2142  0.0054  
MAE (testset)     0.7393  0.7367  0.7427  0.7424  0.7375  0.7397  0.0025  
Fit time          11.14   11.10   11.17   11.25   10.97   11.13   0.09    
Test time         1.98    2.05    2.03    2.05    1.96    2.02    0.04    


{'test_rmse': array([1.21435208, 1.20805289, 1.2216954 , 1.21839995, 1.20829103]),
 'test_mae': array([0.7393403 , 0.73665142, 0.74265809, 0.7424117 , 0.73749491]),
 'fit_time': (11.142059564590454,
  11.095774412155151,
  11.170443296432495,
  11.254150390625,
  10.971884965896606),
 'test_time': (1.9834518432617188,
  2.0528981685638428,
  2.032770872116089,
  2.0463125705718994,
  1.9609873294830322)}

In [21]:
params

{'n_factors': 20, 'n_epochs': 10, 'lr_all': 0.01, 'reg_all': 0.05}

In [28]:
type(gs_svd)

surprise.prediction_algorithms.matrix_factorization.SVD

# Model Serialization

#### Using Surprise's `dump` module:

In [38]:
trainset = sidf.build_full_trainset()

algo = gs_svd
algo.fit(trainset)

# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())

# Here we use scikit-surprise's dump module to save the optimized algorithm state.
file_name = 'models/serialized_optimum_svd_model'
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name) # ... and here we reload the file...

# and now we ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print('Predictions are the same')

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9


#### Using Pickle

In [None]:
pickle_out = open("optimized_svd_model.pickle","wb")
pickle.dump(gs_svd, pickle_out)
pickle_out.close()

# Model Parellelism

While `scikit-surprise` is not inherently equipped to run computations on cloud distributed network clusters, AWS allows for model parallelism. 

## Collaborative Filtering Models

**KNN with Means**  
- First I will run KNN Means on user-user similarity, and will evaluate item-item similarity predictions as well. 
- I use **Pearson** correlation coefficient as my siilarity metric, as it best accounts for ratings of `0` not actually representing values less than ratings of `1`. 
- Additionally, I increased the number for nearest `k` to 350, from the default of 40. With a dataset so large, it would be a shame not to consider many users. I could increase this further in later iterations.  
- Due to memory restrictions, I've reduced the numebr for `k` to see if we can get the algorithm to function.

In [None]:
cross_validate(KNNWithMeans(k=10, sim_options={'name':'pearson', 'user_based':'True'}), sidf_zero, measures='rmse', cv=5, verbose=True)

In [None]:
cross_validate(KNNWithMeans(k=10, sim_options={'name':'pearson', 'user_based':'True'}), sidf, measures='rmse', cv=5, verbose=True)

In [None]:
# param_grid = {'n_factors':[20, 50],'n_epochs':[5, 10],  'lr_all':[0.005,0.01], 'reg_all':[0.002,0.05]}
# gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
# gs.fit(sidf)
# params = gs.best_params['rmse']
# gs_svd = SVD(n_factors=params['n_factors'], n_epochs=params['n_epochs'], lr_all=params['lr_all'], reg_all=params['reg_all'], verbose=True, random_state=1984)

**KNNWithMeans Parameters:**
- n_factors: **100** (Default) – The number of factors. 
- `n_epochs`: **20** (Default) – The number of iteration of the SGD procedure. 
- `biased`: **True** (Default) (bool) – Whether to use baselines (or biases). See note above. 
- `init_mean`: **0** (Default) – The mean of the normal distribution for factor vectors initialization. 
- `init_std_dev`: (Default) – The standard deviation of the normal distribution for factor vectors initialization. 
- `lr_all`: **0.005** (Default) – The learning rate for all parameters.
- `reg_all`: **0.02** (Default) – The regularization term for all parameters. 
- `lr_bu`: **None** (Default) – The learning rate for 𝑏𝑢. Takes precedence over lr_all if set. 
- `lr_bi`: **None** (Default) – The learning rate for 𝑏𝑖. Takes precedence over lr_all if set. 
- `lr_pu`: **None** (Default) – The learning rate for 𝑝𝑢. Takes precedence over lr_all if set. 
- `lr_qi`: **None** (Default) – The learning rate for 𝑞𝑖. Takes precedence over lr_all if set. 
- `reg_bu`: **None** (Default) – The regularization term for 𝑏𝑢. Takes precedence over reg_all if set. 
- `reg_bi`: **None** (Default) – The regularization term for 𝑏𝑖. Takes precedence over reg_all if set. 
- `reg_pu`: **None** (Default) – The regularization term for 𝑝𝑢. Takes precedence over reg_all if set. 
- `reg_qi`: **None** (Default) – The regularization term for 𝑞𝑖. Takes precedence over reg_all if set. 
- `random_state`: **None** (Default) (int, RandomState instance from numpy, or None) – Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. Default is None.
- `verbose`: **False** (Default) – If True, prints the current epoch. 

In [None]:
cross_validate(KNNWithMeans(k=100, sim_options={'name':'pearson', 'user_based':'True'}), sidf, measures='rmse', cv=5, verbose=True)

In [13]:
cross_validate(KNNWithMeans(k=350, sim_options={'name':'pearson', 'user_based':'True'}), sidf, measures='rmse', cv=50, verbose=True)

Computing the pearson similarity matrix...


MemoryError: Unable to allocate 371. GiB for an array with shape (223246, 223246) and data type int64

In [17]:
# param_grid = {'n_factors':[50,100,150],'n_epochs':[20,30],  'lr_all':[0.005,0.01],'reg_all':[0.02,0.1]}
# gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5, joblib_verbose=1)
# gs.fit(sidf)
# params = gs.best_params['rmse']
# svdtuned = SVD(n_factors=params['n_factors'], n_epochs=params['n_epochs'],lr_all=params['lr_all'], reg_all=params['reg_all'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


KeyboardInterrupt: 

In [47]:
print('Number of users: ', sidf.df.user_id.value_counts().index.shape[0], '\n')
print('Number of recipes: ', sidf.df.recipe_id.value_counts().index.shape[0], '\n')


Number of users:  226570 

Number of recipes:  231637 



In [40]:
# data.split(n_folds=3)
# trainset, testset = train_test_split(sidf, test_size=0.2)

In [None]:
# vector = CountVectorizer()

NameError: name 'train_test_split' is not defined

In [None]:
# param_grid = {'n_factors':[20, 100],'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
#               'reg_all': [0.4, 0.6]}
# gs_model = GridSearchCV(SVD,param_grid=param_grid,n_jobs = -1,joblib_verbose=5)
# gs_model.fit(jokes)

# svd = SVD(n_factors=100, n_epochs=10, lr_all=0.005, reg_all=0.4)
# svd.fit(trainset)
# predictions = svd.test(testset)
# print(accuracy.rmse(predictions))

## Memory-based

In [49]:
trainset, testset = train_test_split(sidf, test_size = 0.2, random_state = 1984)

In [50]:
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items, '\n')

Number of users:  192093 

Number of items:  211233 



In [53]:
sim_cos = {'name':'cosine', 'user_based':True}

In [54]:
basic = knns.KNNBasic(sim_options=sim_cos)
basic.fit(trainset)

Computing the cosine similarity matrix...


MemoryError: Unable to allocate 332. GiB for an array with shape (211233, 211233) and data type float64

### Collaborative Filtering — User-based Recommender System

>- Collaborative Filtering (CF) is currently the most widely used approach to build recommendation systems  
>- The key idea behind CF is that similar users have similar interests and that a user generally likes items that are similar to other items they like  
>- CF is filling an "empty cell" in the utility matrix based on the similarity between users or item. Matrix factorization or decomposition can help us solve this problem by determining what the overall "topics" are when a matrix is factored
  
Note: We will likely need to bring this notebook over to DataBricks or AWS.  
  
We can use **cosine similarity** between users, or **Pearson Correlation Coefficient**. Different metrics will offer different results, but I don't see why we couldn't offer multiple recommendations using different metrics.

**Note: Don't forget to downplay zeros as they are gaps, not values representative of sentiment!**

### SVD — Singular Value (Matrix) Decomposition

Notes:  
- eigendecompositions require square matrices
- for SVD, we can create square matrices by taking the dot product of a matrix and its transpose
- we can use SVD to draw vectors in a new space to capture as much of the variance in our data as possible

~~Below we use the raw interactions dataset to create the rating matrix `A`, with rows representing recipes and columns representing users.~~  
NVM LOOKS LIKE WE'RE GONNA NEED DISTRIBUTED COMPUTING FOR EVEN A SIMPLE SVD



In [14]:
A = np.ndarray(
    shape=(np.max(idf.recipe_id.values), np.max(idf.index.values)),
    dtype=np.uint8)

A[idf.recipe_id.values-1, idf.index.values-1] = idf.rating.values

MemoryError: Unable to allocate 567. GiB for an array with shape (537716, 1132366) and data type uint8

### ALS — Alternating Lease Squares Matrix Decomposition

- plug in guess values for P and Q
- hold the values of one constant, then use the values for R and the non-constant to find the optimum values for.
- repeat for the other

> "When we talk about collaborative filtering for recommender systems we want to solve the problem of our original matrix having millions of different dimensions, but our 'tastes' not being nearly as complex. Even if i’ve \[sic\] viewed hundreds of items they might just express a couple of different tastes. Here we can actually use matrix factorization to mathematically reduce the dimensionality of our original 'all users by all items' matrix into something much smaller that represents 'all items by some taste dimensions' and 'all users by some taste dimensions'. These dimensions are called ***latent or hidden features*** and we learn them from our data" ([Medium article: "ALS Implicit Collaborative Filtering"](https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe)).

For collaborative ALS, we will want our data to be shaped something like below, where the column numbers represent different recipes.  

Recipe->| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |...| n  
:-------|:--|---|---|---|---|---|---|---|---|---|---|---|---|---|---:  
user_01 | 0 | 0 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0  
user_02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0  
user_03 | 5 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 0 | 0  
user_04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0  
user_05 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0  
user_06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0  
user_07 | 0 | 0 | 5 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0  
  ...   
user_n  | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0  


# Evaluation

# Reference

## Reading in Big Data from S3

If the above read_csv statements don't work, it's likely because the larger files are not located in the /data folder. To add them to the folder, do so by accessing the S3 bucket where they are located, using the statements below.

In [19]:
# My Bucket ARN Access Key, just in case
arn = 'arn:aws:s3:us-east-2:133716175259:accesspoint/recipe-book-GitHub-t2-exec-to-s3'

In [17]:
# try:
#     urllib.request.urlretrieve ('https://sagemaker-studio-t1ems8mtnoj.s3.us-east-2.amazonaws.com/RAW_interactions.csv', 'data/RAW_interactions.csv')
#     print('Success: downloaded RAW_interactions.csv.')
# except Exception as e:
#     print('Data load error: ',e)

# try:
#     idf = pd.read_csv('data/RAW_interactions.csv',index_col=0)
#     print('Success: Data loaded into dataframe.')
# except Exception as e:
#     print('Data load error: ',e)

In [18]:
# try:
#     urllib.request.urlretrieve ('https://sagemaker-studio-t1ems8mtnoj.s3.us-east-2.amazonaws.com/RAW_recipes.csv', 'data/RAW_recipes.csv')
#     print('Success: downloaded RAW_recipes.csv.')
# except Exception as e:
#     print('Data load error: ',e)

# try:
#     rdf = pd.read_csv('data/RAW_recipes.csv',index_col=0)
#     print('Success: Data loaded into dataframe.')
# except Exception as e:
#     print('Data load error: ',e)

Reference:  
>(*Both personalized and content-based recommendation systems*) make use of different similarity metrics to determine how "similar" items are to one another. The most common similarity metrics are [**Euclidean distance**](https://en.wikipedia.org/wiki/Euclidean_distance), [**cosine similarity**](https://en.wikipedia.org/wiki/Cosine_similarity), [**Pearson correlation**](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and the [**Jaccard index**](https://en.wikipedia.org/wiki/Jaccard_index) (useful with binary data). Each one of these distance metrics has its advantages and disadvantages depending on the type of ratings you are using and the characteristics of your data.$_1$  
  

> The second category covers the Model based approaches, which involve a step to reduce or compress the large but sparse user-item matrix. For understanding this step, a basic understanding of dimensionality reduction can be very helpful.

>It's worth pointing out that when SVD is calculated for recommendation systems, it is preferred to be done with a modified version called "Funk's SVD" that only takes into account the rated values, ignoring whatever items have not been rated by users. The algorithm is named after Simon Funk, who was part of the team who placed 3rd in the Netflix challenge with this innovative way of performing matrix decomposition. Read more about Funk's SVD implementation at his original blog post. There is no simple way to include for this fact with SciPy's implementation of svd(), but luckily the surprise library has Funk's version of SVD implemented to make our lives easier!
>
>Similar to other sklearn features, we can expedite the process of trying out different parameters by using an implementation of grid search. Let's make use of the grid search here to account for some different configurations of parameters within the SVD pipeline. This might take some time! You'll notice that the n_jobs is parameter set to -1, which ensures that all of the cores on your computer will be used to process fitting and evaluating all of these models. To help keep track of what is occurring here, take note of the different values. This code ended up taking over 16 minutes to complete even with parallelization in effect, so the optimal parameters are given to you for the SVD model below. Use them to train a model and let's see how well it performs. If you want the full grid search experience, feel free to uncomment the code and give it a go!

### Addtl. Package Installs & Import Statements

#### Additional Rec. System Packages:

- implicit
- lightfm
- pyspark.mlib.recommendation
- Amazon Personalize
- crab
- suggest

In [12]:
# %conda install --yes --prefix {sys.prefix} -c conda-forge implicit

In [13]:
# %conda install --yes --prefix {sys.prefix} -c conda-forge lightfm

#### Unused Packages Import Statements

In [14]:
## PYSPARK MODELING NOW IN SEPARATE DOCUMENT

# import pyspark
# import pyspark.sql.functions as F
# from pyspark.sql.types import ArrayType, IntegerType

## Further Reading

For those of you visiting this page who are interested in reading more about recommender systems, below are fantastic resources that I have collected.  
$_n$$_o$$_w$ &nbsp;$_I$ &nbsp;$_a$$_m$ &nbsp;$_t$$_h$$_e$ &nbsp;$_r$$_e$$_c$$_o$$_m$$_m$$_e$$_n$$_d$$_a$$_t$$_i$$_o$$_n$ &nbsp;$_s$$_y$$_s$$_t$$_e$$_m$$_!$

[*Mining Massive Datasets: Chapter 9*](http://infolab.stanford.edu/~ullman/mmds/ch9.pdf), © Copyright Stanford University. Stanford, California 94305  
[Singular Value Decomposition (SVD) & Its Application In Recommender System](https://analyticsindiamag.com/singular-value-decomposition-svd-application-recommender-system/), Dr. Vaibhav Kumar

# Bibliography

1. [*Introduction to Recommender Systems* by Flatiron School](https://github.com/learn-co-curriculum/dsc-recommendation-system-introduction) is licensed under CC BY-NC-SA 4.0
>© 2018 Flatiron School, Inc.  
>© 2021 Flatiron School, LLC

# Notes

In [11]:
contents[0]

{'Key': 'RAW_interactions.csv',
 'LastModified': datetime.datetime(2021, 4, 29, 17, 32, 1, tzinfo=tzlocal()),
 'ETag': '"b76f667f757a6e0b6d3b08d23e0bc5e0-21"',
 'Size': 349436524,
 'StorageClass': 'STANDARD',
 'Owner': {'ID': 'ee74dd80de8b382fe7543b22cf2221ca7e68c76f28d55bec19b96bbe32873a4c'}}

In [12]:
contents[0]['Key']

'RAW_interactions.csv'

In [None]:
# import boto3
# import sys

# if sys.version_info[0] < 3: 
#     from StringIO import StringIO # Python 2.x
# else:
#     from io import StringIO # Python 3.x

# # get your credentials from environment variables
# aws_id = os.environ['133716175259']
# aws_secret = os.environ['AWS_SECRET']

# client = boto3.client('s3', aws_access_key_id=aws_id,
#         aws_secret_access_key=aws_secret)

# bucket_name = 'my_bucket'

# object_key = 'my_file.csv'
# csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
# body = csv_obj['Body']
# csv_string = body.read().decode('utf-8')

# df = pd.read_csv(StringIO(csv_string))