<head><h1 align="center">
Food.com Recipe Interactions
</h1></head>  
  
<head><h3 align="center">Recipe Recommender System Modeling</h3></head>

We will develop a **personalized recommender system**. With such systems, there are different options for methods in which to approach recommendations:
  
- The first method is to ask for *explicit* defined ratings from a user regarding the content that they have used/viewed/consumed.  

- Another method is to gather data *implicitly* as the user interacts with the system or service.
  
In this dataset, we have both:  
- **Explicit Data** — the `rating` feature of the **interactions dataset** for each `user`.
- **Implicit Data** — the `x` feature of the **x dataset**.  
  
Technically, I would suppose the `review` feature would fall under explicit data as well, but to use it as such would require *Natural Language Processing*, which we may get involved in later, but not in this notebook.  
  
Additionally with personalized recommender systems, we can center our system around either around suggesting similar *content* (recipes) or suggestions based on other *users of similar preference*. Both systems are succinctly described below:
  
**Content-Based Recommenders**  
> Main Idea: If you like an item, you will also like "similar" items.$_1$  
  
**Collaborative Filtering Systems**  
> Main Idea: If user A likes items 5, 6, 7, and 8 and user B likes items 5, 6, and 7, then it is highly likely that user B will also like item 8.$_1$  
  
We will use `scikit-surprise` for a user-based collaborative filtering model.

#### Setup

In [1]:
# import libraries
import boto3, re, sys, math, json, os, sagemaker, urllib.request
from sagemaker import get_execution_role
import numpy as np                                
import pandas as pd                               
import matplotlib.pyplot as plt                   
from IPython.display import Image                 
from IPython.display import display               
from time import gmtime, strftime                 
from sagemaker.predictor import csv_serializer   

# Define IAM role
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/xgboost:latest'} # each region has its XGBoost container
my_region = boto3.session.Session().region_name # set the region of the instance
print("Success - the MySageMakerInstance is in the " + my_region + " region. You will use the " + containers[my_region] + " container for your SageMaker endpoint.")

Success - the MySageMakerInstance is in the us-east-2 region. You will use the 825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest container for your SageMaker endpoint.


In [4]:
# %conda install --yes --prefix {sys.prefix} -c conda-forge scikit-surprise

Collecting package metadata (current_repodata.json): done
Solving environment: - 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - conda-forge/noarch::imageio==2.9.0=py_0
  - conda-forge/linux-64::jupyter_server==1.4.1=py36h5fab9bb_0
  - conda-forge/noarch::black==20.8b1=py_1
  - conda-forge/linux-64::bokeh==2.2.3=py36h5fab9bb_0
  - defaults/linux-64::_anaconda_depends==5.1.0=py36_2
  - conda-forge/noarch::pyls-black==0.4.6=pyh9f0ad1d_0
  - conda-forge/noarch::aiobotocore==1.2.1=pyhd8ed1ab_0
  - conda-forge/noarch::pyls-spyder==0.3.2=pyhd8ed1ab_0
  - conda-forge/linux-64::anyio==2.1.0=py36h5fab9bb_0
  - conda-forge/noarch::jupyterlab_server==2.3.0=pyhd8ed1ab_0
  - conda-forge/linux-64::matplotlib-base==3.3.4=py36hd391965_0
  - conda-forge/linux-64::spyder==4.2.0=py36h5fab9bb_0
  - conda-forge/noarch::python-language-server==0.36.2=pyhd8ed1ab_0
  - conda-forge/noarch::seaborn-base==0.11.1=pyhd8ed1ab_1
  -

In [8]:
from surprise.similarities import cosine, msd, pearson # for Memory-based Methods (Neighborhood-based)
from surprise.prediction_algorithms import SVD, knns, KNNWithMeans, KNNBasic, KNNBaseline, KNNWithZScore, CoClustering, BaselineOnly, NormalPredictor, NMF, SVDpp, SlopeOne
from surprise.model_selection import GridSearchCV, cross_validate, train_test_split
from surprise import Reader, Dataset, accuracy

In [9]:
# Additional import statements are my own, incase I would like to copy the above tutorial statement to other notebook.
import time
import csv
import json
import pickle
import seaborn as sns
from os import system
from math import floor
from copy import deepcopy
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import euclidean_distances
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import cdist
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 200)

In [10]:
bucket = 'sagemaker-studio-t1ems8mtnoj'
subfolder = ''
s3 = boto3.client('s3')
contents = s3.list_objects(Bucket=bucket, Prefix=subfolder)['Contents']
# for f in contents:
#     print(f['Key'])

# Additional Data Import & Preparation

In [11]:
rdf = pd.read_csv('data/RAW_recipes.csv')
idf = pd.read_csv('data/RAW_interactions.csv')

pp_rdf = pd.read_csv('data/PP_recipes.csv')
pp_idf = pd.read_csv('data/PP_users.csv')

ingID = pd.read_pickle('data/ingr_map.pkl')

tdf = pd.read_csv('data/interactions_train.csv')
vdf = pd.read_csv('data/interactions_validation.csv')

# """Cleaning: All steps carried over from EDA Notebook"""
rdf.drop(labels=721, inplace = True)

# Cleaning/FE: Creating columns for recipe's respective nutrients
rdf['kcal'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[0])
rdf['fat'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[1])
rdf['sugar'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[2])
rdf['salt'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[3])
rdf['protein'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[4])
rdf['sat_fat'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[5])
rdf['carbs'] = rdf.nutrition.apply(lambda x: x[1:-1].split(sep=', ')[6])

# Cleaning: Imputing outlier value to median
rdf['minutes'] = np.where(rdf.minutes == 2147483647,
                         rdf.minutes.median(),
                         rdf.minutes)

idf['date'] = pd.to_datetime(idf.date)

# Surprise

## Model-based

> The second category covers the Model based approaches, which involve a step to reduce or compress the large but sparse user-item matrix. For understanding this step, a basic understanding of dimensionality reduction can be very helpful.

First we will split our RAW Interactions data in to a training dataset and testing dataset.

>It's worth pointing out that when SVD is calculated for recommendation systems, it is preferred to be done with a modified version called "Funk's SVD" that only takes into account the rated values, ignoring whatever items have not been rated by users. The algorithm is named after Simon Funk, who was part of the team who placed 3rd in the Netflix challenge with this innovative way of performing matrix decomposition. Read more about Funk's SVD implementation at his original blog post. There is no simple way to include for this fact with SciPy's implementation of svd(), but luckily the surprise library has Funk's version of SVD implemented to make our lives easier!
>
>Similar to other sklearn features, we can expedite the process of trying out different parameters by using an implementation of grid search. Let's make use of the grid search here to account for some different configurations of parameters within the SVD pipeline. This might take some time! You'll notice that the n_jobs is parameter set to -1, which ensures that all of the cores on your computer will be used to process fitting and evaluating all of these models. To help keep track of what is occurring here, take note of the different values. This code ended up taking over 16 minutes to complete even with parallelization in effect, so the optimal parameters are given to you for the SVD model below. Use them to train a model and let's see how well it performs. If you want the full grid search experience, feel free to uncomment the code and give it a go!

In [12]:
idf.columns

Index(['user_id', 'recipe_id', 'date', 'rating', 'review'], dtype='object')

In [15]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))

In [16]:
# The columns must correspond to user id, item id and ratings (in that order).
sidf = Dataset.load_from_df(idf[['user_id', 'recipe_id', 'rating']], reader)

### SVD Grid Search Cross Validation Split

Note: This section should actually be below the **Cross Validation Split** header section, but will keep here since I want to work close to my dataset load.

In [25]:
param_grid = {'n_factors':[20, 50, 100, 150],'n_epochs':[20,30],  'lr_all':[0.005,0.01],'reg_all':[0.02,0.1]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
gs.fit(sidf)
params = gs.best_params['rmse']
svdtuned = SVD(n_factors=params['n_factors'], n_epochs=params['n_epochs'], lr_all=params['lr_all'], reg_all=params['reg_all'], verbose=True, random_state=1984)

MemoryError: Unable to allocate 161. MiB for an array with shape (211293, 100) and data type float64

In [17]:
# param_grid = {'n_factors':[50,100,150],'n_epochs':[20,30],  'lr_all':[0.005,0.01],'reg_all':[0.02,0.1]}
# gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5, joblib_verbose=1)
# gs.fit(sidf)
# params = gs.best_params['rmse']
# svdtuned = SVD(n_factors=params['n_factors'], n_epochs=params['n_epochs'],lr_all=params['lr_all'], reg_all=params['reg_all'])

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


KeyboardInterrupt: 

### Cross Validation Split

In [20]:
# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(SVD(), sidf, cv=5, verbose=True, n_jobs=-1)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2174  1.2205  1.2189  1.2189  1.2250  1.2201  0.0026  
MAE (testset)     0.7379  0.7401  0.7384  0.7393  0.7427  0.7397  0.0017  
Fit time          70.56   71.87   70.18   71.78   70.03   70.88   0.79    
Test time         2.80    2.91    2.84    2.11    2.88    2.71    0.30    


{'test_rmse': array([1.21739199, 1.22050868, 1.21886082, 1.21894024, 1.22502467]),
 'test_mae': array([0.73785234, 0.74011418, 0.738434  , 0.73929538, 0.74273665]),
 'fit_time': (70.5595211982727,
  71.86913752555847,
  70.17905449867249,
  71.77816939353943,
  70.03437781333923),
 'test_time': (2.8024628162384033,
  2.9095535278320312,
  2.8377883434295654,
  2.106692314147949,
  2.880246162414551)}

In [27]:
cross_validate(CoClustering(), sidf, cv=5, verbose=True, n_jobs=-1)

Evaluating RMSE, MAE of algorithm CoClustering on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.3047  1.3035  1.3016  1.3069  1.3114  1.3056  0.0034  
MAE (testset)     0.7507  0.7440  0.7415  0.7529  0.7499  0.7478  0.0043  
Fit time          70.71   71.66   70.47   70.04   70.42   70.66   0.54    
Test time         3.17    1.71    2.50    3.10    2.48    2.59    0.53    


{'test_rmse': array([1.30465138, 1.30351823, 1.30161146, 1.30690978, 1.31141918]),
 'test_mae': array([0.75067556, 0.74403444, 0.74154924, 0.75291796, 0.74994738]),
 'fit_time': (70.7091155052185,
  71.6614785194397,
  70.4731376171112,
  70.04253816604614,
  70.42406725883484),
 'test_time': (3.165019989013672,
  1.705190896987915,
  2.502916097640991,
  3.0976758003234863,
  2.481900930404663)}

In [21]:
cross_validate(BaselineOnly(bsl_options={'method':'als'}), sidf, cv=5, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2140  1.2142  1.2122  1.2084  1.2149  1.2127  0.0023  
MAE (testset)     0.7369  0.7389  0.7374  0.7354  0.7379  0.7373  0.0012  
Fit time          5.83    6.49    6.79    6.77    6.90    6.56    0.39    
Test time         2.47    1.47    1.48    1.48    2.10    1.80    0.41    


{'test_rmse': array([1.21398794, 1.21421063, 1.212189  , 1.20844943, 1.21489597]),
 'test_mae': array([0.73693909, 0.73889184, 0.73744221, 0.73536794, 0.73793712]),
 'fit_time': (5.83281135559082,
  6.491120100021362,
  6.787977457046509,
  6.771663188934326,
  6.903219699859619),
 'test_time': (2.466940402984619,
  1.473618507385254,
  1.4813072681427002,
  1.4766147136688232,
  2.1048150062561035)}

In [22]:
cross_validate(BaselineOnly(bsl_options={'method':'sgd'}), sidf, cv=5, verbose=True)

Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.2110  1.2179  1.2107  1.2116  1.2115  1.2126  0.0027  
MAE (testset)     0.7347  0.7374  0.7345  0.7344  0.7337  0.7349  0.0013  
Fit time          9.78    10.36   10.90   10.66   10.51   10.44   0.38    
Test time         2.51    1.47    1.51    1.49    2.04    1.80    0.41    


{'test_rmse': array([1.21103128, 1.21793577, 1.21072264, 1.21160403, 1.21147109]),
 'test_mae': array([0.73465602, 0.73739129, 0.73451846, 0.7343926 , 0.73368812]),
 'fit_time': (9.78078579902649,
  10.355447053909302,
  10.901717185974121,
  10.656184911727905,
  10.510430335998535),
 'test_time': (2.512019157409668,
  1.472841739654541,
  1.5068244934082031,
  1.4883172512054443,
  2.0438711643218994)}

In [20]:
cross_validate(BaselineOnly(), sidf, cv=25, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 25 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Fold 6  Fold 7  Fold 8  Fold 9  Fold 10 Fold 11 Fold 12 Fold 13 Fold 14 Fold 15 Fold 16 Fold 17 Fold 18

{'test_rmse': array([1.20069381, 1.20432673, 1.21696639, 1.22165007, 1.20939011,
        1.21501733, 1.22251124, 1.20808387, 1.21296566, 1.20964866,
        1.20483298, 1.20023967, 1.20505683, 1.21366249, 1.19894757,
        1.21399537, 1.22313645, 1.20913698, 1.20772644, 1.201417  ,
        1.21397934, 1.21002465, 1.21558545, 1.19813607, 1.21114252]),
 'test_mae': array([0.73071006, 0.73185704, 0.73909519, 0.73943119, 0.73363071,
        0.73830049, 0.74385617, 0.73310156, 0.73679779, 0.73365416,
        0.73214594, 0.73057217, 0.73160296, 0.73700097, 0.72887674,
        0.73572434, 0.73998376, 0.73610725, 0.73142927, 0.73051474,
        0.7364471 , 0.73319174, 0.73705975, 0.73208366, 0.73480419]),
 'fit_time': (7.626070261001587,
  8.421735525131226,
  8.420256853103638,
  8.573269605636597,
  8.557183027267456,
  8.369316577911377,
  8.699493408203125,
  8.528411388397217,
  8.719077825546265,
  8.61153531074524,
  8.747264623641968,
  8.60050916671753,
  8.693900108337402,
  8.3987

In [None]:
# check parameters for non-neural net rec systems
param_grid = {'n_factors':[20, 100],'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs_model = GridSearchCV(SVD, param_grid=param_grid, n_jobs = -1, joblib_verbose=5)
gs_model.fit(sidf)

In [47]:
print('Number of users: ', sidf.df.user_id.value_counts().index.shape[0], '\n')
print('Number of recipes: ', sidf.df.recipe_id.value_counts().index.shape[0], '\n')


Number of users:  226570 

Number of recipes:  231637 



In [40]:
# data.split(n_folds=3)
# trainset, testset = train_test_split(sidf, test_size=0.2)

In [None]:
# vector = CountVectorizer()

NameError: name 'train_test_split' is not defined

In [None]:
# param_grid = {'n_factors':[20, 100],'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
#               'reg_all': [0.4, 0.6]}
# gs_model = GridSearchCV(SVD,param_grid=param_grid,n_jobs = -1,joblib_verbose=5)
# gs_model.fit(jokes)

# svd = SVD(n_factors=100, n_epochs=10, lr_all=0.005, reg_all=0.4)
# svd.fit(trainset)
# predictions = svd.test(testset)
# print(accuracy.rmse(predictions))

## Memory-based

In [49]:
trainset, testset = train_test_split(sidf, test_size = 0.2, random_state = 1984)

In [50]:
print('Number of users: ', trainset.n_users, '\n')
print('Number of items: ', trainset.n_items, '\n')

Number of users:  192093 

Number of items:  211233 



In [53]:
sim_cos = {'name':'cosine', 'user_based':True}

In [54]:
basic = knns.KNNBasic(sim_options=sim_cos)
basic.fit(trainset)

Computing the cosine similarity matrix...


MemoryError: Unable to allocate 332. GiB for an array with shape (211233, 211233) and data type float64

### Collaborative Filtering — User-based Recommender System

>- Collaborative Filtering (CF) is currently the most widely used approach to build recommendation systems  
>- The key idea behind CF is that similar users have similar interests and that a user generally likes items that are similar to other items they like  
>- CF is filling an "empty cell" in the utility matrix based on the similarity between users or item. Matrix factorization or decomposition can help us solve this problem by determining what the overall "topics" are when a matrix is factored
  
Note: We will likely need to bring this notebook over to DataBricks or AWS.  
  
We can use **cosine similarity** between users, or **Pearson Correlation Coefficient**. Different metrics will offer different results, but I don't see why we couldn't offer multiple recommendations using different metrics.

**Note: Don't forget to downplay zeros as they are gaps, not values representative of sentiment!**

### SVD — Singular Value (Matrix) Decomposition

Notes:  
- eigendecompositions require square matrices
- for SVD, we can create square matrices by taking the dot product of a matrix and its transpose
- we can use SVD to draw vectors in a new space to capture as much of the variance in our data as possible

~~Below we use the raw interactions dataset to create the rating matrix `A`, with rows representing recipes and columns representing users.~~  
NVM LOOKS LIKE WE'RE GONNA NEED DISTRIBUTED COMPUTING FOR EVEN A SIMPLE SVD



In [14]:
A = np.ndarray(
    shape=(np.max(idf.recipe_id.values), np.max(idf.index.values)),
    dtype=np.uint8)

A[idf.recipe_id.values-1, idf.index.values-1] = idf.rating.values

MemoryError: Unable to allocate 567. GiB for an array with shape (537716, 1132366) and data type uint8

### ALS — Alternating Lease Squares Matrix Decomposition

- plug in guess values for P and Q
- hold the values of one constant, then use the values for R and the non-constant to find the optimum values for.
- repeat for the other

> "When we talk about collaborative filtering for recommender systems we want to solve the problem of our original matrix having millions of different dimensions, but our 'tastes' not being nearly as complex. Even if i’ve \[sic\] viewed hundreds of items they might just express a couple of different tastes. Here we can actually use matrix factorization to mathematically reduce the dimensionality of our original 'all users by all items' matrix into something much smaller that represents 'all items by some taste dimensions' and 'all users by some taste dimensions'. These dimensions are called ***latent or hidden features*** and we learn them from our data" ([Medium article: "ALS Implicit Collaborative Filtering"](https://medium.com/radon-dev/als-implicit-collaborative-filtering-5ed653ba39fe)).

For collaborative ALS, we will want our data to be shaped something like below, where the column numbers represent different recipes.  

Recipe->| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |...| n  
:-------|:--|---|---|---|---|---|---|---|---|---|---|---|---|---|---:  
user_01 | 0 | 0 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0  
user_02 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0  
user_03 | 5 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 0 | 0  
user_04 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0  
user_05 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0  
user_06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0  
user_07 | 0 | 0 | 5 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0  
  ...   
user_n  | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 0  


# Evaluation

# Reference

## Reading in Big Data from S3

If the above read_csv statements don't work, it's likely because the larger files are not located in the /data folder. To add them to the folder, do so by accessing the S3 bucket where they are located, using the statements below.

In [19]:
# My Bucket ARN Access Key, just in case
arn = 'arn:aws:s3:us-east-2:133716175259:accesspoint/recipe-book-GitHub-t2-exec-to-s3'

In [17]:
# try:
#     urllib.request.urlretrieve ('https://sagemaker-studio-t1ems8mtnoj.s3.us-east-2.amazonaws.com/RAW_interactions.csv', 'data/RAW_interactions.csv')
#     print('Success: downloaded RAW_interactions.csv.')
# except Exception as e:
#     print('Data load error: ',e)

# try:
#     idf = pd.read_csv('data/RAW_interactions.csv',index_col=0)
#     print('Success: Data loaded into dataframe.')
# except Exception as e:
#     print('Data load error: ',e)

In [18]:
# try:
#     urllib.request.urlretrieve ('https://sagemaker-studio-t1ems8mtnoj.s3.us-east-2.amazonaws.com/RAW_recipes.csv', 'data/RAW_recipes.csv')
#     print('Success: downloaded RAW_recipes.csv.')
# except Exception as e:
#     print('Data load error: ',e)

# try:
#     rdf = pd.read_csv('data/RAW_recipes.csv',index_col=0)
#     print('Success: Data loaded into dataframe.')
# except Exception as e:
#     print('Data load error: ',e)

Reference:  
>(*Both personalized and content-based recommendation systems*) make use of different similarity metrics to determine how "similar" items are to one another. The most common similarity metrics are [**Euclidean distance**](https://en.wikipedia.org/wiki/Euclidean_distance), [**cosine similarity**](https://en.wikipedia.org/wiki/Cosine_similarity), [**Pearson correlation**](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and the [**Jaccard index**](https://en.wikipedia.org/wiki/Jaccard_index) (useful with binary data). Each one of these distance metrics has its advantages and disadvantages depending on the type of ratings you are using and the characteristics of your data.$_1$  
  

### Addtl. Package Installs & Import Statements

#### Additional Rec. System Packages:

- implicit
- lightfm
- pyspark.mlib.recommendation
- Amazon Personalize
- crab
- suggest

In [12]:
# %conda install --yes --prefix {sys.prefix} -c conda-forge implicit

In [13]:
# %conda install --yes --prefix {sys.prefix} -c conda-forge lightfm

#### Unused Packages Import Statements

In [14]:
## PYSPARK MODELING NOW IN SEPARATE DOCUMENT

# import pyspark
# import pyspark.sql.functions as F
# from pyspark.sql.types import ArrayType, IntegerType

## Further Reading

For those of you visiting this page who are interested in reading more about recommender systems, below are fantastic resources that I have collected.  
$_n$$_o$$_w$ &nbsp;$_I$ &nbsp;$_a$$_m$ &nbsp;$_t$$_h$$_e$ &nbsp;$_r$$_e$$_c$$_o$$_m$$_m$$_e$$_n$$_d$$_a$$_t$$_i$$_o$$_n$ &nbsp;$_s$$_y$$_s$$_t$$_e$$_m$$_!$

[*Mining Massive Datasets: Chapter 9*](http://infolab.stanford.edu/~ullman/mmds/ch9.pdf), © Copyright Stanford University. Stanford, California 94305  
[Singular Value Decomposition (SVD) & Its Application In Recommender System](https://analyticsindiamag.com/singular-value-decomposition-svd-application-recommender-system/), Dr. Vaibhav Kumar

# Bibliography

1. [*Introduction to Recommender Systems* by Flatiron School](https://github.com/learn-co-curriculum/dsc-recommendation-system-introduction) is licensed under CC BY-NC-SA 4.0
>© 2018 Flatiron School, Inc.  
>© 2021 Flatiron School, LLC

# Notes

In [11]:
contents[0]

{'Key': 'RAW_interactions.csv',
 'LastModified': datetime.datetime(2021, 4, 29, 17, 32, 1, tzinfo=tzlocal()),
 'ETag': '"b76f667f757a6e0b6d3b08d23e0bc5e0-21"',
 'Size': 349436524,
 'StorageClass': 'STANDARD',
 'Owner': {'ID': 'ee74dd80de8b382fe7543b22cf2221ca7e68c76f28d55bec19b96bbe32873a4c'}}

In [12]:
contents[0]['Key']

'RAW_interactions.csv'

In [None]:
# import boto3
# import sys

# if sys.version_info[0] < 3: 
#     from StringIO import StringIO # Python 2.x
# else:
#     from io import StringIO # Python 3.x

# # get your credentials from environment variables
# aws_id = os.environ['133716175259']
# aws_secret = os.environ['AWS_SECRET']

# client = boto3.client('s3', aws_access_key_id=aws_id,
#         aws_secret_access_key=aws_secret)

# bucket_name = 'my_bucket'

# object_key = 'my_file.csv'
# csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
# body = csv_obj['Body']
# csv_string = body.read().decode('utf-8')

# df = pd.read_csv(StringIO(csv_string))