# Group_Assignment_Retail
<img src="https://d26a57ydsghvgx.cloudfront.net/product/Customer%20Story%20Images/instacartlogo.png"  width=300, height=100  align="right" />

**Authors**: 
- Isobel Rae Impas
- Jan P. Thoma
- Nikolas Artadi
- Camila Vasquez
- Santiago Alfonso Galeano
- Miguel Frutos Soriano

**Subject**: 
Analytics for Retail and Consumer Goods


<img src="./images/instructions.png"    align="center" />


Products from different database will fall close together. Their components will be very similar on all dimensions. There are now different ways of solving the problem of recommending a recipe to specific users:
1. Finding proximity between Baskets and Recipes.
2. Finding the probability of a user buying specific products contained in a recipe.
  

### STEPS
1. [Chapter 1 - Obtain a retail database that contains information with a relationship between users products and orders.](#ch1)
1. [Chapter 2 - Obtain a database of recipes that contains each recipe with a list of products.](#ch2)
1. [Chapter 3 - Use products from both databases and put them into a single list.](#ch3)
1. [Chapter 4 - Input this list into word2vec.](#ch4)
1. [Chapter 5 - Extract vectors for each product.](#ch5)
1. [Chapter 6 - Finding proximity between Baskets and Recipes](#ch6)
1. [Chapter 7 - Finding the probability of a user buying specific products contained in a recipe.](#ch7)


### IMPORTANT ASSUMPTIONS TO CONSIDER
- **Ingredients** from the "*simplified-recipes-1M dataset*" would be equivalent to **Products** of the "*instacart dataset*".
- **Recipes** from the "*simplified-recipes-1M dataset*" would be equivalent to **Baskets** of the "*instacart dataset*".


In [1]:
import pandas as pd
import numpy as np

#Start by importing the libraries
#!pip install gensim
#!pip install annoy

%matplotlib inline

import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.metrics.pairwise import cosine_similarity 

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
from gensim.utils import tokenize
from gensim.similarities.annoy import AnnoyIndexer
from gensim.similarities import Similarity
from annoy import AnnoyIndex
import random
import time
import re, string
from collections import Counter

import nltk
nltk.download('stopwords')
from collections import defaultdict
import nltk
import spacy
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

import tempfile
import os
import matplotlib.cm as cm
import matplotlib.patheffects as PathEffects
import imageio
import shutil
from IPython.display import Image

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/isobelimpas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Datasets

In [2]:
#INSTACART
df_aisles = pd.read_csv('./data/aisles.csv')
df_departments = pd.read_csv('./data/departments.csv')
df_products = pd.read_csv('./data/products.csv')
df_orders =pd.read_csv('./data/orders.csv')
#df_orders contains the prior and the train

df_order_products__prior = pd.read_csv('./data/order_products__prior.csv')
df_order_products__train = pd.read_csv('./data/order_products__train.csv')

#1M RECIPES
with np.load('./data/simplified-recipes-1M.npz', allow_pickle=True) as data:
    recipes = data['recipes']
    ingredients = data['ingredients']

<a id="ch1"></a>
# Chapter 1 - Obtain a retail database that contains information with a relationship between users products and orders.

## Dataset - simplified-recipes-1M 
### Description
This recipe-ingredient dataset contains about **1,000,000 carefully cleaned and preprocessed recipes**. The underlying data comes from five different base datasets  which were merged in order to create a more complete recipe collection. The main contribution here is that all recipes have been meticulously cleaned and standardized.<br>

### Source

http://dominikschmidt.xyz/simplified-recipes-1M/

Dataset Sources:
-   **Recipe1M+**
    Marin, Javier, et al. [“Recipe1M+](http://pic2recipe.csail.mit.edu/) <br>
    A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, pp. 1–1. DOI.org (Crossref), doi:10.1109/TPAMI.2019.2927476.

-   **Epicurious** - [Recipes with Rating and Nutrition](https://www.kaggle.com/datasets/hugodarwood/epirecipes) <br>
    Recipes with Rating and Nutrition

-   **Yummly** - [Recipe Ingredients Dataset](https://www.kaggle.com/datasets/kaggle/recipe-ingredients-dataset)<br>
    Recipe Ingredients Dataset

-   **Datafinity** - [Food Ingredient Lists](https://www.kaggle.com/datasets/datafiniti/food-ingredient-lists)<br>
    Food Ingredient Lists

-   **Eight Portions** - [Recipe Box](https://eightportions.com/datasets/Recipes/)<br>
    Recipe Box


### INGREDIENTS/PRODUCTS

In [3]:
ingredients

array(['salt', 'pepper', 'butter', ..., 'watercress leaves',
       'emerils essence', 'corn flakes cereal'], dtype='<U39')

In [4]:
#Including it as a Dataframe
df_ingredients = pd.DataFrame(ingredients)
#Renaming the column
df_ingredients.rename(columns = {0:'name'}, inplace = True)
#Creating the source column
df_ingredients['source'] = 'recipes_ingredients'
# Split products into terms: Tokenize.
df_ingredients['name_tokenize'] = df_ingredients['name'].str.split()
df_ingredients

Unnamed: 0,name,source,name_tokenize
0,salt,recipes_ingredients,[salt]
1,pepper,recipes_ingredients,[pepper]
2,butter,recipes_ingredients,[butter]
3,garlic,recipes_ingredients,[garlic]
4,sugar,recipes_ingredients,[sugar]
...,...,...,...
3495,poblano chilies,recipes_ingredients,"[poblano, chilies]"
3496,crystal hot sauce,recipes_ingredients,"[crystal, hot, sauce]"
3497,watercress leaves,recipes_ingredients,"[watercress, leaves]"
3498,emerils essence,recipes_ingredients,"[emerils, essence]"


### COOKING RECIPES

In [5]:
ingredients[recipes[4]]

array(['black pepper', 'coarse sea salt', 'fresh lemon',
       'fresh lemon juice', 'ground', 'ground black pepper', 'lemon',
       'lemon juice', 'lime', 'lime peel', 'mayonaise', 'pepper',
       'sea salt', 'shallots', 'sherry wine', 'sherry wine vinegar',
       'vinegar', 'wine vinegar'], dtype='<U39')

In [6]:
#We have notice that there is a discrepancy on recipe 727892, so we deleted.
recipy = np.delete(recipes, 727892)
#Collecting the ingredients into the recipes.
recipes_words = list()
i = 0
for ingredient in recipy:
    cooking = ingredients[recipy[i]]
    recipes_words.append(cooking)
    i +=1 

In [7]:
#Including it as a Dataframe
df_recipes = pd.DataFrame({0: recipes_words})

In [8]:
#Renaming the column
df_recipes.rename(columns = {0:'recipe_name'}, inplace = True)
#Creating the source column
df_recipes['source'] = 'recipes_recipes'
df_recipes

Unnamed: 0,recipe_name,source
0,"[basil leaves, focaccia, leaves, mozzarella, p...",recipes_recipes
1,"[balsamic vinegar, boiling water, butter, cook...",recipes_recipes
2,"[bottle, bouillon, carrots, celery, chicken bo...",recipes_recipes
3,"[grand marnier, kahlua]",recipes_recipes
4,"[black pepper, coarse sea salt, fresh lemon, f...",recipes_recipes
...,...,...
1067551,"[butter, coconut, deep dish pie crust, eggs, f...",recipes_recipes
1067552,"[brown sugar, butter, flour, nectarines, oats,...",recipes_recipes
1067553,"[basil, buttermilk, buttermilk biscuits, fat f...",recipes_recipes
1067554,"[brown sugar, butter, liquid, potatoes, salt, ...",recipes_recipes


<a id="ch2"></a>
# Chapter 2- Obtain a database of recipes that contains each recipe with a list of products.

## Dataset - Instacart
### Description
`orders` (3.4m rows, 206k users):
* `order_id`: order identifier
* `user_id`: customer identifier
* `eval_set`: which evaluation set this order belongs in (see `SET` described below)
* `order_number`: the order sequence number for this user (1 = first, n = nth)
* `order_dow`: the day of the week the order was placed on
* `order_hour_of_day`: the hour of the day the order was placed on
* `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

`products` (50k rows):
* `product_id`: product identifier
* `product_name`: name of the product
* `aisle_id`: foreign key
* `department_id`: foreign key

`aisles` (134 rows):
* `aisle_id`: aisle identifier
* `aisle`: the name of the aisle

`deptartments` (21 rows):
* `department_id`: department identifier
* `department`: the name of the department

`order_products__SET` (30m+ rows):
* `order_id`: foreign key
* `product_id`: foreign key
* `add_to_cart_order`: order in which each product was added to cart
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

where `SET` is one of the four following evaluation sets (`eval_set` in `orders`):
* `"prior"`: orders prior to that users most recent order (~3.2m orders)
* `"train"`: training data supplied to participants (~131k orders)
* `"test"`: test data reserved for machine learning competitions (~75k orders)


### Source

Kaggle Competition: https://www.kaggle.com/competitions/instacart-market-basket-analysis/data



## Prepare and structure the data to be useable by the word2vec algorithm

In [9]:
def clean_text(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    # Make everything lowercase.
    text = text.lower()
    # Clean special characters.
    text = text.replace('\W', ' ')
    #Remove text in square brackets
    text = re.sub(r'\[.*?\]', '', text)
    #Remove symbol ®
    text = re.sub(r'®','',text)
    #Remove punctuation
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    #remove words containing numbers
    text = re.sub(r'\w*\d\w*', '', text)
    # Remove a sentence if it is only one word long
    if len(text) > 2:
        return ' '.join(word for word in text.split() if word not in STOPWORDS)

In [10]:
#Apply clean_text
df_clean = pd.DataFrame(df_products.product_name.apply(lambda x: clean_text(x)))

In [11]:
#Checking missings
missing_values = df_clean.isnull().sum().sort_values(ascending=False)
missing_values

product_name    6
dtype: int64

In [12]:
#Change col name
df_clean = df_clean.rename(columns = {'product_name':'prod_name_new'})

In [13]:
#Comparison with the, before cleaning, product_name column
df_clean = pd.DataFrame(pd.concat([df_products, df_clean], axis=1, join="inner"))

In [14]:
df_clean

Unnamed: 0,product_id,product_name,aisle_id,department_id,prod_name_new
0,1,Chocolate Sandwich Cookies,61,19,chocolate sandwich cookies
1,2,All-Seasons Salt,104,13,allseasons salt
2,3,Robust Golden Unsweetened Oolong Tea,94,7,robust golden unsweetened oolong tea
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,smart ones classic favorites mini rigatoni vod...
4,5,Green Chile Anytime Sauce,5,13,green chile anytime sauce
...,...,...,...,...,...
49683,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,vodka triple distilled twist vanilla
49684,49685,En Croute Roast Hazelnut Cranberry,42,1,en croute roast hazelnut cranberry
49685,49686,Artisan Baguette,112,3,artisan baguette
49686,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,smartblend healthy metabolism dry cat food


In [15]:
#Transform product_name from object to string type for lemmatization
df_clean['prod_name_new'] = df_clean['prod_name_new'].astype('str')

In [16]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def lemmatizer(text):        
    sent = []
    Doc = nlp(text)
    for word in Doc:
        sent.append(word.lemma_)
    return " ".join(sent)

df_clean["text_lemmatize"] =  df_clean.apply(lambda x: lemmatizer(x['prod_name_new']), axis=1)

In [17]:
#Remove -PRON- type as it is not adding value but complexity into the model
df_clean['text_lemmatize_clean'] = df_clean['text_lemmatize'].str.replace('-PRON-', '')

In [18]:
# Split products into terms: Tokenize.
df_clean['product_name_tokenize'] = df_clean['text_lemmatize'].str.split()

In [19]:
df_instacart = pd.DataFrame(df_clean[['text_lemmatize_clean','product_name_tokenize', 'product_id']])
df_instacart.rename(columns = {'text_lemmatize_clean':'name','product_name_tokenize':'name_tokenize'}, inplace = True)

In [20]:
df_instacart['source'] = 'instacart_products'

In [21]:
df_instacart

Unnamed: 0,name,name_tokenize,product_id,source
0,chocolate sandwich cookie,"[chocolate, sandwich, cookie]",1,instacart_products
1,allseason salt,"[allseason, salt]",2,instacart_products
2,robust golden unsweetene oolong tea,"[robust, golden, unsweetene, oolong, tea]",3,instacart_products
3,smart one classic favorite mini rigatoni vodka...,"[smart, one, classic, favorite, mini, rigatoni...",4,instacart_products
4,green chile anytime sauce,"[green, chile, anytime, sauce]",5,instacart_products
...,...,...,...,...
49683,vodka triple distil twist vanilla,"[vodka, triple, distil, twist, vanilla]",49684,instacart_products
49684,en croute roast hazelnut cranberry,"[en, croute, roast, hazelnut, cranberry]",49685,instacart_products
49685,artisan baguette,"[artisan, baguette]",49686,instacart_products
49686,smartblend healthy metabolism dry cat food,"[smartblend, healthy, metabolism, dry, cat, food]",49687,instacart_products


In [22]:
df_instacart.product_id.nunique()

49688

In [23]:
df_instacart['product_id'].astype(int)

0            1
1            2
2            3
3            4
4            5
         ...  
49683    49684
49684    49685
49685    49686
49686    49687
49687    49688
Name: product_id, Length: 49688, dtype: int64

<a id="ch3"></a>
# Chapter 3 - Use products from both databases and put them into a single list.

In [24]:
frames = [df_ingredients, df_instacart[['name','name_tokenize','source','product_id']]]

total_products = pd.concat(frames)

In [25]:
total_products

Unnamed: 0,name,source,name_tokenize,product_id
0,salt,recipes_ingredients,[salt],
1,pepper,recipes_ingredients,[pepper],
2,butter,recipes_ingredients,[butter],
3,garlic,recipes_ingredients,[garlic],
4,sugar,recipes_ingredients,[sugar],
...,...,...,...,...
49683,vodka triple distil twist vanilla,instacart_products,"[vodka, triple, distil, twist, vanilla]",49684.0
49684,en croute roast hazelnut cranberry,instacart_products,"[en, croute, roast, hazelnut, cranberry]",49685.0
49685,artisan baguette,instacart_products,"[artisan, baguette]",49686.0
49686,smartblend healthy metabolism dry cat food,instacart_products,"[smartblend, healthy, metabolism, dry, cat, food]",49687.0


<a id="ch4"></a>
# Chapter 4 - Input this list into word2vec.

In [26]:
#MODEL:
#List of Lists as input to our model.
#Size of the Vector 20.
#Window 5: Gensim default window size is 5 (two words before and two words after the input word.
#Minimum number of ocurrences = 1

w2vec_model = Word2Vec(list(total_products['name_tokenize']), vector_size=20, window=5, min_count=1, workers=4)

### WORDS VECTORS

In [27]:
# Create  dictionaries to obtain word vectors

#Loop the words 
word_vec = dict()
for w in w2vec_model.wv.index_to_key:
    word_vec[w] = w2vec_model.wv[w]
    
display(list(word_vec.items())[:2])

[('organic',
  array([ 0.6412038 ,  0.81334287,  1.4783449 ,  0.76453996, -0.5468077 ,
          0.6737628 ,  0.8331944 ,  1.5893403 , -0.05049876, -2.0816164 ,
          0.9515679 ,  0.52965826,  0.54076886,  0.028983  ,  0.5928873 ,
          0.6319864 ,  0.33908325,  0.14977205, -1.6992792 , -2.38657   ],
        dtype=float32)),
 ('chocolate',
  array([ 1.3187026 ,  0.8681453 ,  3.4522731 ,  1.2101547 , -0.16830018,
         -1.250247  ,  1.44484   ,  1.8248792 , -1.3747295 , -0.51753265,
          0.08031385, -2.5597017 ,  1.1982394 , -0.93762326, -1.4354299 ,
          0.80700475,  2.3752043 , -2.87967   , -4.690836  , -2.064535  ],
        dtype=float32))]

<a id="ch5"></a>
# Chapter 5 - Extract vectors for each product.

### PRODUCT VECTORS

In [28]:
# Create  dictionaries to obtain prod vectors
# Loop through each word in the product name to generate the vector.
# The vector for each product would be the average of the different words

prods_w2v = dict()
for index, row in total_products.iterrows():
    word_vector = list()
    #print(row['Products_mod'])
    for word in row['name_tokenize']:
        word_vector.append(word_vec[word])
    
    prods_w2v[row['name']] = np.average(word_vector, axis=0)

In [29]:
display(list(prods_w2v.items())[:2])

[('salt',
  array([-0.4064497 ,  0.62870306,  1.9824469 , -1.2775681 , -2.5620747 ,
          1.8138994 ,  0.94125956,  2.9013045 ,  1.4211092 , -0.57436246,
          1.5717244 , -0.62417585, -0.40272334,  0.09225506,  0.24318846,
         -2.448614  , -0.2816807 , -4.098007  , -4.235197  , -1.8655416 ],
        dtype=float32)),
 ('pepper',
  array([ 0.98076695, -0.22865653,  2.2217455 ,  0.90035087, -0.10680755,
          2.4719598 , -1.2166597 ,  1.3206859 ,  2.49167   ,  0.9706965 ,
          1.6113052 ,  1.8975558 ,  2.4170737 ,  0.2728517 ,  0.55506575,
          0.18317232,  0.28092062, -0.3937776 , -2.5857167 , -3.2600374 ],
        dtype=float32))]

In [30]:
#Including product_vector inside the dataframe
prod_lists = pd.DataFrame(list(prods_w2v.items()),columns = ['column1','column2'])
prod_lists = prod_lists.rename(columns = {'column1':'name','column2':'vector'})
product_lists = pd.merge(total_products, prod_lists, on="name")

In [31]:
product_lists = product_lists.rename(columns = {'product_id':'id' })

In [32]:
product_lists

Unnamed: 0,name,source,name_tokenize,id,vector
0,salt,recipes_ingredients,[salt],,"[-0.4064497, 0.62870306, 1.9824469, -1.2775681..."
1,salt,instacart_products,[salt],20126.0,"[-0.4064497, 0.62870306, 1.9824469, -1.2775681..."
2,pepper,recipes_ingredients,[pepper],,"[0.98076695, -0.22865653, 2.2217455, 0.9003508..."
3,pepper,instacart_products,[pepper],25297.0,"[0.98076695, -0.22865653, 2.2217455, 0.9003508..."
4,butter,recipes_ingredients,[butter],,"[-0.20565397, 0.49051186, 2.420473, 2.0453606,..."
...,...,...,...,...,...
53183,vodka triple distil twist vanilla,instacart_products,"[vodka, triple, distil, twist, vanilla]",49684.0,"[-0.1393412, 0.793434, 0.58441895, 0.049001902..."
53184,en croute roast hazelnut cranberry,instacart_products,"[en, croute, roast, hazelnut, cranberry]",49685.0,"[-0.01120081, 0.1534398, 1.2319181, 0.52216744..."
53185,artisan baguette,instacart_products,"[artisan, baguette]",49686.0,"[-0.038701084, 0.46550775, 0.34835252, 0.18698..."
53186,smartblend healthy metabolism dry cat food,instacart_products,"[smartblend, healthy, metabolism, dry, cat, food]",49687.0,"[0.36770225, -0.12373075, 0.5402949, 1.0274087..."


In [33]:
## We want to recover also the texts to understand what's going on with products

#List a sample of 3000 prod_vectors
w2vec_tsne = list(product_lists.sample(n=3000, random_state=1)['vector'])
#List a sample of 3000 prod_names
product_names = list(product_lists.sample(n=3000, random_state=1)['name'])
#List a sample of 3000 dept_id
colors = list(product_lists.sample(n=3000, random_state=1)['source'])

In [34]:
# Train the TSNE MODEL
tsne_model = TSNE(perplexity=30, 
                  n_components=2,
                  init='pca',
                  n_iter=3500,
                  random_state=23)
#Fit the model
tsne_results = tsne_model.fit_transform(w2vec_tsne)

#Creating a dataframe with the results
df_tsne_data = pd.DataFrame()

df_tsne_data['tsne-2d-one'] = tsne_results[:,0]
df_tsne_data['tsne-2d-two'] = tsne_results[:,1]
df_tsne_data['product_names'] = product_names
df_tsne_data['color'] = colors

In [35]:
fig = px.scatter(df_tsne_data,
                 x="tsne-2d-one", y="tsne-2d-two",
                 color="color",
                 hover_data=['product_names'],
                 title='Products',
                 width=800, height=800)
fig.show()

In [36]:
df_instacart

Unnamed: 0,name,name_tokenize,product_id,source
0,chocolate sandwich cookie,"[chocolate, sandwich, cookie]",1,instacart_products
1,allseason salt,"[allseason, salt]",2,instacart_products
2,robust golden unsweetene oolong tea,"[robust, golden, unsweetene, oolong, tea]",3,instacart_products
3,smart one classic favorite mini rigatoni vodka...,"[smart, one, classic, favorite, mini, rigatoni...",4,instacart_products
4,green chile anytime sauce,"[green, chile, anytime, sauce]",5,instacart_products
...,...,...,...,...
49683,vodka triple distil twist vanilla,"[vodka, triple, distil, twist, vanilla]",49684,instacart_products
49684,en croute roast hazelnut cranberry,"[en, croute, roast, hazelnut, cranberry]",49685,instacart_products
49685,artisan baguette,"[artisan, baguette]",49686,instacart_products
49686,smartblend healthy metabolism dry cat food,"[smartblend, healthy, metabolism, dry, cat, food]",49687,instacart_products


### BASKET LISTS

In [37]:
#We will use df_order_products__prior as it has included the product_id
baskets = pd.merge(df_instacart[['product_id', 'name']], df_order_products__prior[['order_id', 'product_id']])

In [38]:
baskets.tail()

Unnamed: 0,product_id,name,order_id
32434484,49688,fresh foaming cleanser,3111954
32434485,49688,fresh foaming cleanser,3122003
32434486,49688,fresh foaming cleanser,3166828
32434487,49688,fresh foaming cleanser,3290206
32434488,49688,fresh foaming cleanser,3401313


In [39]:
baskets.order_id.nunique()

3214874

In [40]:
basket_lists = baskets.groupby('order_id')['name'].apply(list)
basket_lists_id = baskets.groupby('order_id')['product_id'].apply(list)
display(basket_lists.head())
display(basket_lists_id.head())

order_id
2    [natural stir creamy almond butter, garlic pow...
3    [air chill organic boneless skinless chicken b...
4    [kelloggs nutrigrain apple cinnamon cereal, go...
5    [clementine, mini original babybel cheese, ori...
6    [dryer sheet geranium scent, cleanse, clean da...
Name: name, dtype: object

order_id
2    [1819, 9327, 17794, 28985, 30035, 33120, 40141...
3    [17461, 17668, 17704, 21903, 24838, 32665, 337...
4    [10054, 17616, 21351, 22598, 25146, 26434, 277...
5    [6184, 6348, 8479, 9633, 12962, 13176, 13245, ...
6                                [15873, 40462, 41897]
Name: product_id, dtype: object

### BASKET VECTORS

In [41]:
# Obtain vector avg for each basket.
# As there are 3m baskets... this might take some time as well

baskets_w2vec = dict()
print('executing...')
i = 2
basket_lists.to_frame()
for basket in basket_lists.to_frame().reset_index().iterrows():
    basket_vector = list()
    #print(basket)
    for product in basket[1]['name']:
        basket_vector.append(prods_w2v[product])

    baskets_w2vec[basket[1]['order_id']] = np.average(basket_vector, axis=0)
    i +=1

executing...


In [42]:
# Order_ID: Vector
list(baskets_w2vec.items())[:5]

[(2,
  array([ 0.13472296,  0.48697677,  0.86039525,  0.6526383 , -0.14831406,
          0.20565213,  0.44811997,  1.2721009 ,  0.15117984, -0.23962559,
          0.8292485 , -0.14681986,  0.3423673 ,  0.06778838,  0.5773539 ,
          0.608443  ,  0.6868708 , -0.62058413, -1.5527236 , -0.9910861 ],
        dtype=float32)),
 (3,
  array([-0.05451326,  0.8877266 ,  0.69503367,  0.39022902, -0.37164974,
         -0.26124418,  0.31999624,  1.4450022 , -0.618386  , -0.5495236 ,
          0.87686276, -0.11502355,  0.8125913 , -0.1234179 ,  0.75872403,
          0.60335606,  0.79617   , -0.33121836, -1.624976  , -1.5655068 ],
        dtype=float32)),
 (4,
  array([-0.11562122,  0.52167666,  0.86436963,  0.86814046, -0.44983053,
         -0.4896961 ,  0.6765699 ,  1.3392094 , -0.5010141 , -0.17928672,
          0.39944902, -0.6125184 ,  0.32910582, -0.3877094 ,  0.5760803 ,
          0.7931175 ,  1.3802223 , -0.9221834 , -1.5881171 , -1.8559049 ],
        dtype=float32)),
 (5,
  array([ 0.09

In [43]:
basket_lists = basket_lists.to_frame()#.reset_index()

In [44]:
basket_lists['order_id']=basket_lists.index

In [45]:
basket_lists

Unnamed: 0_level_0,name,order_id
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,"[natural stir creamy almond butter, garlic pow...",2
3,[air chill organic boneless skinless chicken b...,3
4,"[kelloggs nutrigrain apple cinnamon cereal, go...",4
5,"[clementine, mini original babybel cheese, ori...",5
6,"[dryer sheet geranium scent, cleanse, clean da...",6
...,...,...
3421079,[moisture soap],3421079
3421080,"[vanilla bean ice cream, import butter, organi...",3421080
3421081,"[pepper jack cheese slice, classic wheat bread...",3421081
3421082,"[toast coconut chip original recipe, original ...",3421082


In [46]:
# To make it easy to insert into the dataframe we're generating a 
# list that can be used directly as a column in the dataframe.
# We're going to use try catch for problematic baskets. (There shouldn't be any but to keep it clean...).

basket_vectors = list()
for basket in basket_lists.iterrows():
    try:
        basket_vectors.append(baskets_w2vec[basket[1]['order_id']])
    except:
        print(basket)
        print(basket[1]['order_id'])
        print(basket[1]['name'])
        pass

In [47]:
basket_lists['basket_vectors'] = basket_vectors
basket_lists.tail()

Unnamed: 0_level_0,name,order_id,basket_vectors
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3421079,[moisture soap],3421079,"[0.00588765, 0.51124877, 0.69981205, -1.388482..."
3421080,"[vanilla bean ice cream, import butter, organi...",3421080,"[0.16407602, 0.7882831, 1.2202548, 0.7493656, ..."
3421081,"[pepper jack cheese slice, classic wheat bread...",3421081,"[-0.07891184, 0.76611936, 0.54213536, 0.644906..."
3421082,"[toast coconut chip original recipe, original ...",3421082,"[-0.011147954, 0.7171343, 0.73358124, 0.548353..."
3421083,"[natural french toast stick, organic sweet sal...",3421083,"[-0.03521915, 0.5513612, 0.7440297, 1.1261252,..."


In [48]:
basket_lists['name_2'] = basket_lists['name']
basket_lists['name_2'] = basket_lists['name_2'].str.join(',')
basket_lists.rename(columns = {'name_2':'name_tokenize','order_id':'id','basket_vectors':'vector'},inplace = True)
basket_lists['source'] = 'instacart_baskets'

In [49]:
basket_lists

Unnamed: 0_level_0,name,id,vector,name_tokenize,source
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,"[natural stir creamy almond butter, garlic pow...",2,"[0.13472296, 0.48697677, 0.86039525, 0.6526383...","natural stir creamy almond butter,garlic powde...",instacart_baskets
3,[air chill organic boneless skinless chicken b...,3,"[-0.05451326, 0.8877266, 0.69503367, 0.3902290...",air chill organic boneless skinless chicken br...,instacart_baskets
4,"[kelloggs nutrigrain apple cinnamon cereal, go...",4,"[-0.115621224, 0.52167666, 0.86436963, 0.86814...","kelloggs nutrigrain apple cinnamon cereal,gold...",instacart_baskets
5,"[clementine, mini original babybel cheese, ori...",5,"[0.09988686, 0.83936363, 0.58972186, 0.4815239...","clementine,mini original babybel cheese,origin...",instacart_baskets
6,"[dryer sheet geranium scent, cleanse, clean da...",6,"[0.21051049, 0.73799723, 0.21527433, -0.908677...","dryer sheet geranium scent,cleanse,clean day l...",instacart_baskets
...,...,...,...,...,...
3421079,[moisture soap],3421079,"[0.00588765, 0.51124877, 0.69981205, -1.388482...",moisture soap,instacart_baskets
3421080,"[vanilla bean ice cream, import butter, organi...",3421080,"[0.16407602, 0.7882831, 1.2202548, 0.7493656, ...","vanilla bean ice cream,import butter,organic p...",instacart_baskets
3421081,"[pepper jack cheese slice, classic wheat bread...",3421081,"[-0.07891184, 0.76611936, 0.54213536, 0.644906...","pepper jack cheese slice,classic wheat bread,f...",instacart_baskets
3421082,"[toast coconut chip original recipe, original ...",3421082,"[-0.011147954, 0.7171343, 0.73358124, 0.548353...","toast coconut chip original recipe,original sp...",instacart_baskets


### RECIPES LIST

In [50]:
df_recipes

Unnamed: 0,recipe_name,source
0,"[basil leaves, focaccia, leaves, mozzarella, p...",recipes_recipes
1,"[balsamic vinegar, boiling water, butter, cook...",recipes_recipes
2,"[bottle, bouillon, carrots, celery, chicken bo...",recipes_recipes
3,"[grand marnier, kahlua]",recipes_recipes
4,"[black pepper, coarse sea salt, fresh lemon, f...",recipes_recipes
...,...,...
1067551,"[butter, coconut, deep dish pie crust, eggs, f...",recipes_recipes
1067552,"[brown sugar, butter, flour, nectarines, oats,...",recipes_recipes
1067553,"[basil, buttermilk, buttermilk biscuits, fat f...",recipes_recipes
1067554,"[brown sugar, butter, liquid, potatoes, salt, ...",recipes_recipes


### RECIPES VECTORS

In [51]:
df_recipes['recipe_id']=df_recipes.index

In [52]:
recipes_w2vec = dict()
print('executing...')
i = 2

for recipe in df_recipes.iterrows():
    try:
        recipe_vector = list()
        #print(basket)
        for product in recipe[1]['recipe_name']:
            recipe_vector.append(prods_w2v[product])

        recipes_w2vec[recipe[1]['recipe_id']] = np.average(recipe_vector, axis=0)
        i +=1
    except:
        print('error')
        pass

executing...


In [53]:
# Order_ID: Vector
list(recipes_w2vec.items())[:2]

[(0,
  array([ 0.12241288,  0.5001333 ,  0.09830185,  0.20527565, -0.00590448,
          0.30233386, -0.23661828,  0.95352304,  0.37756976,  0.4064672 ,
          0.6295367 ,  0.14396349,  0.6101173 , -0.02359849,  0.44494778,
          0.17520979,  0.4520603 , -0.17216928, -0.6315498 , -0.5328419 ],
        dtype=float32)),
 (1,
  array([-0.10585818,  0.29251096,  0.45397046,  0.12504661, -0.52095103,
         -0.0379079 ,  0.09785165,  1.1288114 , -0.08085217,  0.22542897,
          0.44537878,  0.19100739,  0.42642975, -0.07085072,  0.75970805,
          0.26095238,  0.54637295, -0.5867161 , -0.912922  , -0.68762136],
        dtype=float32))]

In [54]:
# To make it easy to insert into the dataframe we're generating a 
# list that can be used directly as a column in the dataframe.
# We're going to use try catch for problematic baskets. (There shouldn't be any but to keep it clean...).

recipe_vectors = list()
for recipe in df_recipes.iterrows():
    try:
        recipe_vectors.append(recipes_w2vec[recipe[1]['recipe_id']])
    except:
        print(recipe)
        print(recipe['recipe_id'])
        print(recipe['recipe_name'])
        pass

In [55]:
df_recipes['recipe_vectors'] = recipe_vectors

In [56]:
df_recipes['recipe_name_2'] = df_recipes['recipe_name']
df_recipes['recipe_name_2'] = df_recipes['recipe_name_2'].str.join(',')
df_recipes.rename(columns = {'recipe_name':'name_tokenize','recipe_name_2':'name','order_id':'id','recipe_vectors':'vector'},inplace = True)
df_recipes.drop('recipe_id', axis=1, inplace=True)

In [57]:
df_recipes.tail()

Unnamed: 0,name_tokenize,source,vector,name
1067551,"[butter, coconut, deep dish pie crust, eggs, f...",recipes_recipes,"[-0.16565832, 1.087761, 0.91491175, 0.85712767...","butter,coconut,deep dish pie crust,eggs,flaked..."
1067552,"[brown sugar, butter, flour, nectarines, oats,...",recipes_recipes,"[-0.026276851, 0.9721594, 0.84825337, 0.762600...","brown sugar,butter,flour,nectarines,oats,peach..."
1067553,"[basil, buttermilk, buttermilk biscuits, fat f...",recipes_recipes,"[0.26833847, 1.0071172, 0.2884569, 0.45283347,...","basil,buttermilk,buttermilk biscuits,fat free,..."
1067554,"[brown sugar, butter, liquid, potatoes, salt, ...",recipes_recipes,"[-0.25019228, 1.1563611, 1.08847, 0.22800772, ...","brown sugar,butter,liquid,potatoes,salt,sugar,..."
1067555,"[basil, bay leaves, boneless, butter, canned c...",recipes_recipes,"[0.24490994, 0.40884346, 0.87505096, 0.6047019...","basil,bay leaves,boneless,butter,canned chicke..."


### USER VECTORS

In [58]:
#Merge df_orders_products__train with products, aisles and departments
orders = df_order_products__train.merge(df_orders, how='left',on='order_id')
orders = orders.merge(df_products,how='left', on ='product_id')
orders = orders.merge(df_departments,how='left', on='department_id')
orders = orders.merge(df_aisles,how='left', on='aisle_id')

In [59]:
users = pd.merge(df_instacart[['product_id', 'name']], orders[['user_id', 'product_id']])

In [60]:
users.tail()

Unnamed: 0,product_id,name,user_id
1384612,49687,smartblend healthy metabolism dry cat food,137573
1384613,49688,fresh foaming cleanser,159099
1384614,49688,fresh foaming cleanser,40460
1384615,49688,fresh foaming cleanser,187233
1384616,49688,fresh foaming cleanser,40384


In [61]:
users_lists = users.groupby('user_id')['name'].apply(list)
users_lists_id = users.groupby('user_id')['product_id'].apply(list)
display(users_lists.tail())
display(users_lists_id.tail())

user_id
206199    [organic peel cooked beet, original fresh stac...
206200    [organic navel orange, basil, bag organic bana...
206203    [protein bar lemon cream pie, original whole f...
206205    [mango chunk, classic guacamole, original vege...
206209    [diet pepsi pack, calcium enrich lactose free ...
Name: name, dtype: object

user_id
206199    [1468, 6128, 6701, 7103, 7702, 8898, 10333, 12...
206200    [8174, 8955, 13176, 15592, 21137, 21709, 22312...
206203    [2482, 3765, 3957, 13176, 14050, 15693, 21469,...
206205    [1158, 10181, 16174, 17600, 21137, 22035, 2477...
206209    [6846, 9405, 15655, 24852, 37966, 39216, 40603...
Name: product_id, dtype: object

In [62]:
# Obtain vector avg for each user.
# There are around 200k users

users_w2vec = dict()
print('executing...')
i = 2
users_lists.to_frame()
for users in users_lists.to_frame().reset_index().iterrows():
    users_vector = list()
    #print(users)
    for product in users[1]['name']:
        users_vector.append(prods_w2v[product])
        #print(users_vector)

    users_w2vec[users[1]['user_id']] = np.average(users_vector, axis=0)
    i +=1

executing...


In [63]:
# Order_ID: Vector
list(users_w2vec.items())[0:5]

[(1,
  array([-0.0257036 ,  1.0246985 ,  0.9214456 ,  0.52856314, -0.24973412,
         -0.09103654,  0.3383019 ,  1.4290547 , -0.46524373, -0.21430093,
          0.5267204 , -0.31316724,  0.8894281 , -0.26815107,  0.44281435,
          0.2117354 ,  1.2450072 , -0.46552363, -1.9319936 , -1.5563462 ],
        dtype=float32)),
 (2,
  array([ 0.14423807,  0.5845108 ,  0.5330917 ,  0.68122023,  0.01821856,
          0.36131823,  0.22650278,  1.359647  ,  0.05696385, -0.08862846,
          0.8258939 , -0.11192022,  0.6539112 ,  0.19449982,  0.5713464 ,
          0.51384413,  0.77348393, -0.74483067, -1.269674  , -1.3991578 ],
        dtype=float32)),
 (5,
  array([ 0.30747432,  0.6849458 ,  0.78011864,  0.638885  , -0.15647191,
          0.46940005,  0.08167948,  1.5069749 ,  0.44431052, -0.15435949,
          0.89021915,  0.2556543 ,  1.0276848 ,  0.05055719,  0.8109936 ,
          0.42557842,  0.8019122 , -0.27787548, -1.2823838 , -1.4401332 ],
        dtype=float32)),
 (7,
  array([ 0.03

In [64]:
users_lists = users_lists.to_frame()

In [65]:
users_lists['user_id']=users_lists.index

In [66]:
users_lists

Unnamed: 0_level_0,name,user_id
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"[soda, pistachio, cinnamon toast crunch, organ...",1
2,"[organic cashew carrot ginger soup, mint chip,...",2
5,"[organic raw agave nectar, sharp cheddar chees...",5
7,"[panama peach antioxidant infusion, lean groun...",7
8,"[organic green onion, broccoli rabe, organic s...",8
...,...,...
206199,"[organic peel cooked beet, original fresh stac...",206199
206200,"[organic navel orange, basil, bag organic bana...",206200
206203,"[protein bar lemon cream pie, original whole f...",206203
206205,"[mango chunk, classic guacamole, original vege...",206205


In [67]:
# To make it easy to insert into the dataframe we're generating a 
# list that can be used directly as a column in the dataframe.
# We're going to use try catch for problematic baskets. (There shouldn't be any but to keep it clean...).

users_vectors = list()
for users in users_lists.iterrows():
    try:
        users_vectors.append(users_w2vec[users[1]['user_id']])
    except:
        print(users)
        print(users[1]['user_id'])
        print(users[1]['name'])
        pass

In [68]:
users_lists['user_vectors'] = users_vectors
users_lists.tail()

Unnamed: 0_level_0,name,user_id,user_vectors
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
206199,"[organic peel cooked beet, original fresh stac...",206199,"[-0.017089121, 0.71270674, 0.82169324, 0.81923..."
206200,"[organic navel orange, basil, bag organic bana...",206200,"[0.0718578, 0.62231255, 0.783742, 0.46168876, ..."
206203,"[protein bar lemon cream pie, original whole f...",206203,"[0.052184146, 0.70741, 1.1525983, 0.4260726, -..."
206205,"[mango chunk, classic guacamole, original vege...",206205,"[0.106130816, 0.7316446, 0.7280097, 0.71691686..."
206209,"[diet pepsi pack, calcium enrich lactose free ...",206209,"[0.11774992, 0.7756263, 0.7322011, 0.70409715,..."


In [69]:
users_lists['name_2'] = users_lists['name']
users_lists['name_2'] = users_lists['name_2'].str.join(',')
users_lists.rename(columns = {'name':'name_tokenize','name_2':'name','user_vectors':'vector','user_id':'id'},inplace = True)
users_lists['source'] = 'instacart_users'

## Combine Users, Baskets, Products, Recipes & Ingredients into one Dataset

In [70]:
# List with all datasets
datasets = [users_lists, basket_lists, product_lists, df_recipes]

# List with all datasets names
datasets_names = ['user_lists', 'basket_lists', 'product_lists', 'df_recipes']



In [71]:
# Show columns, info, description.
for i in range(len(datasets)):
    print("")
    print(datasets_names[i])
    print(datasets[i].info())


user_lists
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131209 entries, 1 to 206209
Data columns (total 5 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   name_tokenize  131209 non-null  object
 1   id             131209 non-null  int64 
 2   vector         131209 non-null  object
 3   name           131209 non-null  object
 4   source         131209 non-null  object
dtypes: int64(1), object(4)
memory usage: 6.0+ MB
None

basket_lists
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3214874 entries, 2 to 3421083
Data columns (total 5 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   name           object
 1   id             int64 
 2   vector         object
 3   name_tokenize  object
 4   source         object
dtypes: int64(1), object(4)
memory usage: 147.2+ MB
None

product_lists
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53188 entries, 0 to 53187
Data columns (total 5 columns):
 #   Column         

In [72]:
big_table = pd.concat(datasets)

In [73]:
big_table

Unnamed: 0,name_tokenize,id,vector,name,source
1,"[soda, pistachio, cinnamon toast crunch, organ...",1.0,"[-0.025703602, 1.0246985, 0.9214456, 0.5285631...","soda,pistachio,cinnamon toast crunch,organic s...",instacart_users
2,"[organic cashew carrot ginger soup, mint chip,...",2.0,"[0.14423807, 0.5845108, 0.5330917, 0.68122023,...","organic cashew carrot ginger soup,mint chip,sm...",instacart_users
5,"[organic raw agave nectar, sharp cheddar chees...",5.0,"[0.30747432, 0.6849458, 0.78011864, 0.638885, ...","organic raw agave nectar,sharp cheddar cheese,...",instacart_users
7,"[panama peach antioxidant infusion, lean groun...",7.0,"[0.036747053, 0.7592107, 0.86147314, 0.5052083...","panama peach antioxidant infusion,lean ground ...",instacart_users
8,"[organic green onion, broccoli rabe, organic s...",8.0,"[0.29801205, 0.5706611, 0.7100633, 0.31531656,...","organic green onion,broccoli rabe,organic spro...",instacart_users
...,...,...,...,...,...
1067551,"[butter, coconut, deep dish pie crust, eggs, f...",,"[-0.16565832, 1.087761, 0.91491175, 0.85712767...","butter,coconut,deep dish pie crust,eggs,flaked...",recipes_recipes
1067552,"[brown sugar, butter, flour, nectarines, oats,...",,"[-0.026276851, 0.9721594, 0.84825337, 0.762600...","brown sugar,butter,flour,nectarines,oats,peach...",recipes_recipes
1067553,"[basil, buttermilk, buttermilk biscuits, fat f...",,"[0.26833847, 1.0071172, 0.2884569, 0.45283347,...","basil,buttermilk,buttermilk biscuits,fat free,...",recipes_recipes
1067554,"[brown sugar, butter, liquid, potatoes, salt, ...",,"[-0.25019228, 1.1563611, 1.08847, 0.22800772, ...","brown sugar,butter,liquid,potatoes,salt,sugar,...",recipes_recipes


In [74]:
# Sort by the different sources
big_table = big_table.sort_values("source")
# "id" column is not an identifier of the observation anymore, so we have decided to delete it.
big_table.drop('id', axis=1, inplace=True)
# Instead we will user the index as the identifier for our models
big_table = big_table.reset_index()
big_table['id']=big_table.index
big_table.drop('index', axis=1, inplace=True)

In [75]:
big_table

Unnamed: 0,name_tokenize,vector,name,source,id
0,"philadelphia cream cheese,granny smith apple,b...","[-0.18721125, 0.56669676, 0.51225615, 0.836172...","[philadelphia cream cheese, granny smith apple...",instacart_baskets,0
1,envirokidz gluten free wheat free gorilla munc...,"[0.23514682, 0.73965114, 0.684528, 1.0336504, ...",[envirokidz gluten free wheat free gorilla mun...,instacart_baskets,1
2,"total lowfat greek strain yogurt blueberry,map...","[0.10246369, 0.8207988, 0.6681635, 0.7025926, ...","[total lowfat greek strain yogurt blueberry, m...",instacart_baskets,2
3,"organic red onion,thyme,organic yellow onion,o...","[0.1910567, 0.76884663, 0.74673134, 0.6947772,...","[organic red onion, thyme, organic yellow onio...",instacart_baskets,3
4,"calorie light lemonade,yellow onion,less sodiu...","[0.06870963, 0.7890909, 0.691059, 0.5913845, -...","[calorie light lemonade, yellow onion, less so...",instacart_baskets,4
...,...,...,...,...,...
4466822,"[boiling water, cold water, gelatin, jello, ma...","[-0.4137816, 0.29615954, 0.26641956, -0.048193...","boiling water,cold water,gelatin,jello,mandari...",recipes_recipes,4466822
4466823,"[chicken, chicken broth, clove, dry white wine...","[0.18970525, 0.5653795, 0.55356914, 0.41273332...","chicken,chicken broth,clove,dry white wine,gar...",recipes_recipes,4466823
4466824,"[black pepper, breadcrumbs, condensed cream, c...","[0.09424104, 0.8146942, 0.78937924, 0.2342297,...","black pepper,breadcrumbs,condensed cream,conde...",recipes_recipes,4466824
4466825,"[baking powder, baking soda, black, black sesa...","[-0.06529629, 0.6206819, 1.0193346, 0.13594516...","baking powder,baking soda,black,black sesame s...",recipes_recipes,4466825


The different categories are distributed as follows:
- **instacart_baskets**: 0 - 3214873
- **instacart_products**: 3214874 - 3264561
- **instacart_users**: 3264562 - 3395770
- **recipes_ingredients**: 3395771 - 3399270
- **recipes_recipes**: 3399271 - 4466826

In [76]:
#List a sample of 3000 prod_vectors
w2vec_tsne = list(big_table.sample(n=3000, random_state=1)['vector'])
#List a sample of 3000 prod_names
product_names = list(big_table.sample(n=3000, random_state=1)['name'])
#List a sample of 3000 dept_id
colors = list(big_table.sample(n=3000, random_state=1)['source'])

In [77]:
# Train the TSNE MODEL
tsne_model = TSNE(perplexity=30, 
                  n_components=2,
                  init='pca',
                  n_iter=3500,
                  random_state=23)
#Fit the model
tsne_results = tsne_model.fit_transform(w2vec_tsne)

#Creating a dataframe with the results
df_tsne_data = pd.DataFrame()

df_tsne_data['tsne-2d-one'] = tsne_results[:,0]
df_tsne_data['tsne-2d-two'] = tsne_results[:,1]
df_tsne_data['product_names'] = product_names
df_tsne_data['color'] = colors

In [78]:
fig = px.scatter(df_tsne_data,
                 x="tsne-2d-one", y="tsne-2d-two",
                 color="color",
                 hover_data=['product_names'],
                 title='Products',
                 width=800, height=600)
fig.show()

<a id="ch6"></a>
# Chapter 6 - Finding proximity between Baskets and Recipes

## Per Ticket
- The input to the model would be an order/basket id which would be transformed into a vector.
- Using annoy we would receive as output for example the 10 closest recipes.
- The heard of the models would be word2vec and annoy obtain distances between vectors.

<img src="./images/per_ticket.png" align="center" />

### ANNOY - Basket TO Recipe

**COSINE DISTANCE**

In [79]:
f = 20
r = AnnoyIndex(f, metric='angular')  # Length of item vector that will be indexed
for index, row in big_table.iterrows():
    r.add_item(row['id'], row['vector'])

r.build(10) # 10 trees

True

**10 CLOSEST RECIPES FROM A BASKET ID**

In [99]:
basket_selected = int(input("Welcome to Power Ranger Market. Please select a basket number between 0 - 3214873: "))
print('Thank you. Your basket number "%d" has been selected.' %(basket_selected))
print('We will now provide you with 10 recipes recommendations based on the following items in your basket: "%s"' %big_table.iloc[basket_selected]['name'])

Thank you. Your basket number "3214873" has been selected.
We will now provide you with 10 recipes recommendations based on the following items in your basket: "['organic mayonnaise', 'strawberry', 'banana', 'apple juicy red family pack', 'large lemon', 'spinach']"


In [96]:
i = 0
j = 0
similar_list = list()
similar_recipes = list()
while i < 10:
    similar_list = r.get_nns_by_item(basket_selected,j+1 , search_k=-1, include_distances=False)
    if similar_list[-1] in range (3399271,4466826) and not similar_list[-1] in similar_recipes:
        similar_recipes.append(similar_list[-1])
        i+=1
    else:
        j+=1

In [97]:
display(similar_recipes)
list(big_table[big_table.id.isin(similar_recipes)]['name_tokenize'])

[4433481,
 3795949,
 3658298,
 3721998,
 4409961,
 4029050,
 3749332,
 3799142,
 4142420,
 4288009]

[array(['butter', 'cake', 'cake mix', 'chocolate', 'margarine',
        'marshmallow', 'marshmallow creme', 'milk',
        'nonhydrogenated margarine', 'peanut', 'peanut butter', 'peanuts',
        'salted peanuts', 'semisweet chocolate'], dtype='<U39'),
 array(['a', 'butter', 'cake', 'chocolate', 'cookie', 'peanut butter',
        'prepared'], dtype='<U39'),
 array(['candy', 'chocolate', 'crackers', 'peanut butter', 'ritz crackers',
        'sandwiches'], dtype='<U39'),
 array(['butter', 'chocolate', 'chocolate chips', 'cookie', 'cool whip',
        'oreo', 'peanut butter'], dtype='<U39'),
 array(['butter', 'chocolate', 'chocolate chips', 'cookie', 'cream',
        'granola', 'other', 'other nuts', 'peanut', 'peanut butter',
        'peanuts', 'semisweet chocolate', 'semisweet chocolate chips',
        'whipping cream'], dtype='<U39'),
 array(['brownie', 'butter', 'chocolate', 'cookie', 'mix', 'peanut butter'],
       dtype='<U39'),
 array(['candy', 'candy sprinkles', 'marshmallow', 

## Per User
### Input using users
- The input to the model would be an user id which would be transformed into a vector.
- Using annoy we would receive as output for example the 10 closest recipes.
- The heard of the models would be word2vec and annoy obtain distances between vectors.

<img src="./images/per_user_using_users.png" align="center" />

**10 CLOSEST RECIPES FROM A BASKET ID**

In [102]:
user_selected = int(input("Welcome to Power Ranger Market. Please select a user number between 3264562 - 3395770: "))
print('Thank you. Your user number "%d" has been selected.' %(user_selected))
print('We will now provide you with 10 recipes recommendations based on the following items from previous orders: "%s"' %big_table.iloc[user_selected]['name_tokenize'])

Thank you. Your user number "3395770" has been selected.
We will now provide you with 10 recipes recommendations based on the following items from previous orders: "['raw coconut water', 'baby spinach', 'tuna belly ventresca tin', 'original thin rye crispbread']"


In [100]:
i = 0
j = 0
similar_list = list()
similar_users = list()
while i < 10:
    similar_list = r.get_nns_by_item(user_selected,j+1 , search_k=-1, include_distances=False)
    if similar_list[-1] in range (3399271,4466826) and not similar_list[-1] in similar_users:
        similar_users.append(similar_list[-1])
        i+=1
    else:
        j+=1

In [101]:
display(similar_users)
list(big_table[big_table.id.isin(similar_users)]['name_tokenize'])

[3765679,
 3876425,
 3709237,
 3586464,
 4466311,
 3552168,
 4218241,
 4134505,
 4304739,
 3791521]

[array(['cherry pie filling', 'filling', 'liquid', 'marshmallows',
        'mini marshmallows', 'nonfat yogurt', 'pie filling', 'pineapple',
        'sweetened', 'yogurt'], dtype='<U39'),
 array(['citrus', 'colors', 'extract', 'fruit', 'grapes', 'nonfat yogurt',
        'papaya', 'raspberry', 'ricotta', 'ricotta cheese', 'sliced',
        'sugar', 'vanilla extract', 'yogurt'], dtype='<U39'),
 array(['banana', 'bananas', 'cool whip', 'fat free', 'gelatin', 'pudding',
        'sliced', 'strawberries', 'strawberry', 'vanilla',
        'vanilla pudding', 'water'], dtype='<U39'),
 array(['egg white', 'fruit cocktail', 'gelatin', 'juice',
        'nonfat vanilla yogurt', 'strawberry', 'strawberry yogurt',
        'vanilla yogurt', 'white'], dtype='<U39'),
 array(['angel food cake', 'apricot', 'apricot nectar', 'apricots',
        'bing cherries', 'cake', 'cherries', 'extract', 'lemon',
        'lemon juice', 'plain yogurt', 'vanilla extract', 'yogurt'],
       dtype='<U39'),
 array(['butterm

###  Input using product list
- The input to the model would be a product list which would be transformed into a vector.
- Using annoy we would receive as output for example the 10 closest recipes.
- The heard of the models would be word2vec and annoy obtain distances between vectors.

<img src="./images/per_user_using_productlist.png" align="center" />

**10 CLOSEST RECIPES FROM A PRODUCT LIST**

In [86]:
i = 0
j = 0
lista_products = list()
similar_list = list()
similar_recipes = list()

#Add the products. Append it into a list.
lista_products.append(prods_w2v['organic tomato basil pasta sauce'])
lista_products.append(prods_w2v['vanilla bean frozen yogurt'])
lista_products.append(prods_w2v['original unflavored gelatine mix'])

# The vector changes as the user adds these products to the basket. 
v = np.average(lista_products, axis=0)

while i < 10:
    similar_list = r.get_nns_by_vector(v,j+1 , search_k=-1, include_distances=False)
    if similar_list[-1] in range (3399271,4466826) and not similar_list[-1] in similar_recipes:
        similar_recipes.append(similar_list[-1])
        i+=1
    else:
        j+=1

In [87]:
display(similar_recipes)
list(big_table[big_table.id.isin(similar_recipes)]['name_tokenize'])

[4218870,
 3590445,
 3649059,
 3448701,
 3724874,
 4370641,
 4023967,
 4403269,
 3938550,
 3677593]

[array(['cilantro', 'fat free', 'fat free yogurt', 'filet', 'halibut',
        'hot sauce', 'juice', 'lime', 'pepper', 'salt', 'sauce', 'yogurt'],
       dtype='<U39'),
 array(['cranberry', 'cranberry sauce', 'onion', 'plain yogurt',
        'red onion', 'sauce', 'sliced', 'wine', 'yogurt'], dtype='<U39'),
 array(['carrot', 'celery', 'celery ribs', 'cilantro', 'curry powder',
        'fat free', 'fat free greek yogurt', 'greek yogurt', 'honey',
        'lemon', 'lemon juice', 'macaroni', 'mayonnaise', 'onion',
        'orzo pasta', 'pasta', 'pepper', 'powder', 'raisin', 'red onion',
        'rotelle', 'salt', 'yogurt'], dtype='<U39'),
 array(['greek yogurt', 'seasoning', 'seasoning mix', 'taco seasoning',
        'taco seasoning mix', 'yogurt'], dtype='<U39'),
 array(['chili', 'cinnamon', 'cumin', 'garam masala', 'masala', 'onion',
        'seeds', 'yogurt'], dtype='<U39'),
 array(['bread crumbs', 'celery', 'celery salt', 'chicken', 'crumbs',
        'lemon', 'lemon juice', 'lowfat yog

<a id="ch7"></a>
# Chapter 7 - Finding the probability of a user buying specific products contained in a recipe.

### Buy predicting each product individually and then averaging to obtain a recipe.

- The input to the model would be a user id which would be transformed into a vector.
- The output would be the probability of the user buying a specific recipe, for all recipes.
- The head of the models would be word2vec and xgboost to give a probability score.

<img src="./images/prob_prod_indiv.png" align="center" />


### By predicting the whole recipe.
- The input to the model would be a user id and a recipe id which would be transformed into a vector.
- The output would be the probability of the user buying a specific recipe.
- The heard of the models would be word2vec and xgboost to give a probability score.

<img src="./images/prob_whole_recipe.png" align="center" />