# Group_Assignment_Retail
<img src="https://d26a57ydsghvgx.cloudfront.net/product/Customer%20Story%20Images/instacartlogo.png"  width=300, height=100  align="right" />

**Authors**: 
- Isobel Rae Impas
- Jan P. Thoma
- Nikolas Artadi
- Camila Vasquez
- Santiago Alfonso Galeano
- Miguel Frutos Soriano

**Subject**: 
Analytics for Retail and Consumer Goods

### STEPS
1. [Chapter 1 - Obtain a retail database that contains information with a relationship between users products and orders.](#ch1)
1. [Chapter 2 - Obtain a database of recipes that contains each recipe with a list of products. Products must be expressed in the same language.](#ch2)
1. [Chapter 3 - Use products from both databases and put them into a single list.](#ch3)
1. [Chapter 4 - Input this list into word2vec.](#ch4)
1. [Chapter 5 - Extract vectors for each product.](#ch5)

In [172]:
import pandas as pd
import numpy as np

#Start by importing the libraries
#!pip install gensim
#!pip install annoy

%matplotlib inline

import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.metrics.pairwise import cosine_similarity 

from gensim.test.utils import common_texts
from gensim.models import Word2Vec
from gensim.utils import tokenize
from gensim.similarities.annoy import AnnoyIndexer
from gensim.similarities import Similarity
from annoy import AnnoyIndex
import random
import time
import re, string
from collections import Counter

import nltk
nltk.download('stopwords')
from collections import defaultdict
import nltk
import spacy
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

import tempfile
import os
import matplotlib.cm as cm
import matplotlib.patheffects as PathEffects
import imageio
import shutil
from IPython.display import Image

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/miguelfrutossoriano/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Datasets

In [147]:
df_products = pd.read_csv('products.csv')

with np.load('simplified-recipes-1M.npz', allow_pickle=True) as data:
    recipes = data['recipes']
    ingredients = data['ingredients']

<a id="ch1"></a>
# Chapter 1 - Obtain a retail database that contains information with a relationship between users products and orders.

## Dataset - simplified-recipes-1M 
### Description
This recipe-ingredient dataset contains about **1,000,000 carefully cleaned and preprocessed recipes**. The underlying data comes from five different base datasets  which were merged in order to create a more complete recipe collection. The main contribution here is that all recipes have been meticulously cleaned and standardized.<br>

### Source

http://dominikschmidt.xyz/simplified-recipes-1M/

Dataset Sources:
-   **Recipe1M+**
    Marin, Javier, et al. [“Recipe1M+](http://pic2recipe.csail.mit.edu/) <br>
    A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, pp. 1–1. DOI.org (Crossref), doi:10.1109/TPAMI.2019.2927476.

-   **Epicurious** - [Recipes with Rating and Nutrition](https://www.kaggle.com/datasets/hugodarwood/epirecipes) <br>
    Recipes with Rating and Nutrition

-   **Yummly** - [Recipe Ingredients Dataset](https://www.kaggle.com/datasets/kaggle/recipe-ingredients-dataset)<br>
    Recipe Ingredients Dataset

-   **Datafinity** - [Food Ingredient Lists](https://www.kaggle.com/datasets/datafiniti/food-ingredient-lists)<br>
    Food Ingredient Lists

-   **Eight Portions** - [Recipe Box](https://eightportions.com/datasets/Recipes/)<br>
    Recipe Box


### INGREDIENTS/PRODUCTS

In [208]:
ingredients

array(['salt', 'pepper', 'butter', ..., 'watercress leaves',
       'emerils essence', 'corn flakes cereal'], dtype='<U39')

In [223]:
#Including it as a Dataframe
df_1M_products = pd.DataFrame(ingredients)
#Renaming the column
df_1M_products.rename(columns = {0:'product_name'}, inplace = True)
#Creating the source column
df_1M_products['source'] = 'recipes'
# Split products into terms: Tokenize.
df_1M_products['product_name_tokenize'] = df_1M_products['product_name'].str.split()
df_1M_products

Unnamed: 0,product_name,source,product_name_tokenize
0,salt,recipes,[salt]
1,pepper,recipes,[pepper]
2,butter,recipes,[butter]
3,garlic,recipes,[garlic]
4,sugar,recipes,[sugar]
...,...,...,...
3495,poblano chilies,recipes,"[poblano, chilies]"
3496,crystal hot sauce,recipes,"[crystal, hot, sauce]"
3497,watercress leaves,recipes,"[watercress, leaves]"
3498,emerils essence,recipes,"[emerils, essence]"


### COOKING RECIPES

In [149]:
ingredients[recipes[4]]

array(['black pepper', 'coarse sea salt', 'fresh lemon',
       'fresh lemon juice', 'ground', 'ground black pepper', 'lemon',
       'lemon juice', 'lime', 'lime peel', 'mayonaise', 'pepper',
       'sea salt', 'shallots', 'sherry wine', 'sherry wine vinegar',
       'vinegar', 'wine vinegar'], dtype='<U39')

In [165]:
#We have notice that there is a discrepancy on recipe 727892, so we deleted.
recipy = np.delete(recipes, 727892)
#Collecting the ingredients into the recipes.
recipes_words = list()
i = 0
for ingredient in recipy:
    cooking = ingredients[recipy[i]]
    recipes_words.append(cooking)
    i +=1 

In [184]:
#Including it as a Dataframe
df_recipes = pd.DataFrame({0: recipes_words})

In [247]:
#Renaming the column
df_recipes.rename(columns = {0:'recipe_name'}, inplace = True)
#Creating the source column
df_recipes['source'] = 'recipes'
df_recipes

Unnamed: 0,product_name,source
0,"[basil leaves, focaccia, leaves, mozzarella, p...",recipes
1,"[balsamic vinegar, boiling water, butter, cook...",recipes
2,"[bottle, bouillon, carrots, celery, chicken bo...",recipes
3,"[grand marnier, kahlua]",recipes
4,"[black pepper, coarse sea salt, fresh lemon, f...",recipes
...,...,...
1067551,"[butter, coconut, deep dish pie crust, eggs, f...",recipes
1067552,"[brown sugar, butter, flour, nectarines, oats,...",recipes
1067553,"[basil, buttermilk, buttermilk biscuits, fat f...",recipes
1067554,"[brown sugar, butter, liquid, potatoes, salt, ...",recipes


<a id="ch2"></a>
# Chapter 2- Obtain a database of recipes that contains each recipe with a list of products.

## Dataset - Instacart
### Description
`orders` (3.4m rows, 206k users):
* `order_id`: order identifier
* `user_id`: customer identifier
* `eval_set`: which evaluation set this order belongs in (see `SET` described below)
* `order_number`: the order sequence number for this user (1 = first, n = nth)
* `order_dow`: the day of the week the order was placed on
* `order_hour_of_day`: the hour of the day the order was placed on
* `days_since_prior`: days since the last order, capped at 30 (with NAs for `order_number` = 1)

`products` (50k rows):
* `product_id`: product identifier
* `product_name`: name of the product
* `aisle_id`: foreign key
* `department_id`: foreign key

`aisles` (134 rows):
* `aisle_id`: aisle identifier
* `aisle`: the name of the aisle

`deptartments` (21 rows):
* `department_id`: department identifier
* `department`: the name of the department

`order_products__SET` (30m+ rows):
* `order_id`: foreign key
* `product_id`: foreign key
* `add_to_cart_order`: order in which each product was added to cart
* `reordered`: 1 if this product has been ordered by this user in the past, 0 otherwise

where `SET` is one of the four following evaluation sets (`eval_set` in `orders`):
* `"prior"`: orders prior to that users most recent order (~3.2m orders)
* `"train"`: training data supplied to participants (~131k orders)
* `"test"`: test data reserved for machine learning competitions (~75k orders)


### Source

Kaggle Competition: https://www.kaggle.com/competitions/instacart-market-basket-analysis/data



## Prepare and structure the data to be useable by the word2vec algorithm

In [224]:
def clean_text(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    # Make everything lowercase.
    text = text.lower()
    # Clean special characters.
    text = text.replace('\W', ' ')
    #Remove text in square brackets
    text = re.sub(r'\[.*?\]', '', text)
    #Remove symbol ®
    text = re.sub(r'®','',text)
    #Remove punctuation
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    #remove words containing numbers
    text = re.sub(r'\w*\d\w*', '', text)
    # Remove a sentence if it is only one word long
    if len(text) > 2:
        return ' '.join(word for word in text.split() if word not in STOPWORDS)

In [225]:
#Apply clean_text
df_clean = pd.DataFrame(df_products.product_name.apply(lambda x: clean_text(x)))

In [226]:
#Checking missings
missing_values = df_clean.isnull().sum().sort_values(ascending=False)
missing_values

product_name    6
dtype: int64

In [227]:
#Change col name
df_clean = df_clean.rename(columns = {'product_name':'prod_name_new'})

In [228]:
#Change col name
df_clean = df_clean.rename(columns = {'product_name':'prod_name_new'})

In [229]:
df_clean

Unnamed: 0,prod_name_new
0,chocolate sandwich cookies
1,allseasons salt
2,robust golden unsweetened oolong tea
3,smart ones classic favorites mini rigatoni vod...
4,green chile anytime sauce
...,...
49683,vodka triple distilled twist vanilla
49684,en croute roast hazelnut cranberry
49685,artisan baguette
49686,smartblend healthy metabolism dry cat food


In [230]:
#Transform product_name from object to string type for lemmatization
df_clean['prod_name_new'] = df_clean['prod_name_new'].astype('str')

In [231]:
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser']) # disabling Named Entity Recognition for speed

def lemmatizer(text):        
    sent = []
    Doc = nlp(text)
    for word in Doc:
        sent.append(word.lemma_)
    return " ".join(sent)

df_clean["text_lemmatize"] =  df_clean.apply(lambda x: lemmatizer(x['prod_name_new']), axis=1)

In [232]:
#Remove -PRON- type as it is not adding value but complexity into the model
df_clean['text_lemmatize_clean'] = df_clean['text_lemmatize'].str.replace('-PRON-', '')

In [233]:
# Split products into terms: Tokenize.
df_clean['product_name_tokenize'] = df_clean['text_lemmatize'].str.split()

In [235]:
df_instacart = pd.DataFrame(df_clean[['text_lemmatize_clean','product_name_tokenize']])
df_instacart.rename(columns = {'text_lemmatize_clean':'product_name'}, inplace = True)

In [236]:
df_instacart['source'] = 'instacart'

In [237]:
df_instacart

Unnamed: 0,product_name,product_name_tokenize,source
0,chocolate sandwich cookie,"[chocolate, sandwich, cookie]",instacart
1,allseason salt,"[allseason, salt]",instacart
2,robust golden unsweetene oolong tea,"[robust, golden, unsweetene, oolong, tea]",instacart
3,smart one classic favorite mini rigatoni vodka...,"[smart, one, classic, favorite, mini, rigatoni...",instacart
4,green chile anytime sauce,"[green, chile, anytime, sauce]",instacart
...,...,...,...
49683,vodka triple distil twist vanilla,"[vodka, triple, distil, twist, vanilla]",instacart
49684,en croute roast hazelnut cranberry,"[en, croute, roast, hazelnut, cranberry]",instacart
49685,artisan baguette,"[artisan, baguette]",instacart
49686,smartblend healthy metabolism dry cat food,"[smartblend, healthy, metabolism, dry, cat, food]",instacart


<a id="ch3"></a>
# Chapter 3 - Use products from both databases and put them into a single list.

In [238]:
frames = [df_1M_products, df_instacart]

total_products = pd.concat(frames)

In [239]:
total_products

Unnamed: 0,product_name,source,product_name_tokenize
0,salt,recipes,[salt]
1,pepper,recipes,[pepper]
2,butter,recipes,[butter]
3,garlic,recipes,[garlic]
4,sugar,recipes,[sugar]
...,...,...,...
49683,vodka triple distil twist vanilla,instacart,"[vodka, triple, distil, twist, vanilla]"
49684,en croute roast hazelnut cranberry,instacart,"[en, croute, roast, hazelnut, cranberry]"
49685,artisan baguette,instacart,"[artisan, baguette]"
49686,smartblend healthy metabolism dry cat food,instacart,"[smartblend, healthy, metabolism, dry, cat, food]"


<a id="ch4"></a>
# Chapter 4 - Input this list into word2vec.

In [242]:
#MODEL:
#List of Lists as input to our model.
#Size of the Vector 20.
#Window 5: Gensim default window size is 5 (two words before and two words after the input word.
#Minimum number of ocurrences = 1

w2vec_model = Word2Vec(list(total_products['product_name_tokenize']), vector_size=20, window=5, min_count=1, workers=4)

### WORDS VECTORS

In [243]:
# Create  dictionaries to obtain word vectors

#Loop the words 
word_vec = dict()
for w in w2vec_model.wv.index_to_key:
    word_vec[w] = w2vec_model.wv[w]
    
display(list(word_vec.items())[:2])

[('organic',
  array([ 0.8227379 ,  0.85770893,  0.7821191 ,  0.94946855, -1.0111732 ,
          0.25177354,  0.7361633 ,  1.712379  ,  0.1892081 , -1.6166791 ,
          1.2296935 , -0.08542251,  1.1780608 , -0.1348745 ,  0.39695787,
          0.57542956,  0.23890342,  0.41779295, -1.5441748 , -2.378866  ],
        dtype=float32)),
 ('chocolate',
  array([ 0.4257686 ,  0.83233166,  4.238562  ,  1.9029965 , -0.6545395 ,
         -0.9106448 ,  0.40481508,  2.039697  , -0.6449666 , -0.46806687,
          0.10918399, -2.1554935 ,  1.2563599 , -0.79320264, -0.6668195 ,
         -0.36873293,  2.0192173 , -2.8527844 , -5.009957  , -2.228268  ],
        dtype=float32))]

<a id="ch5"></a>
# Chapter 5 - Extract vectors for each product.

### PRODUCT VECTORS

In [244]:
# Create  dictionaries to obtain prod vectors
# Loop through each word in the product name to generate the vector.
# The vector for each product would be the average of the different words

prods_w2v = dict()
for index, row in total_products.iterrows():
    word_vector = list()
    #print(row['Products_mod'])
    for word in row['product_name_tokenize']:
        word_vector.append(word_vec[word])
    
    prods_w2v[row['product_name']] = np.average(word_vector, axis=0)

In [245]:
display(list(prods_w2v.items())[:2])

[('salt',
  array([-0.9277114 ,  1.3709065 ,  0.7961813 , -1.4068056 , -2.5468886 ,
          3.5155895 ,  0.54835004,  3.152282  ,  0.999399  , -1.2660598 ,
          1.4716754 , -1.2852179 ,  1.3523791 , -0.95161647,  0.33901408,
         -1.5528069 , -0.6865827 , -3.2860389 , -3.7523117 , -1.2448114 ],
        dtype=float32)),
 ('pepper',
  array([ 1.2926804 ,  0.2748337 ,  1.1499476 ,  0.84602904,  0.38453394,
          3.2115936 , -1.1842566 ,  1.2449261 ,  2.051805  ,  0.9805605 ,
          1.4547281 ,  1.346296  ,  3.4658117 ,  0.3665797 ,  1.1006526 ,
          0.43127254,  0.25667349, -0.22884527, -1.8060076 , -3.160971  ],
        dtype=float32))]

In [246]:
#Including product_vector inside the dataframe
prod = pd.DataFrame(list(prods_w2v.items()),columns = ['column1','column2'])
prod = prod.rename(columns = {'column1':'product_name'})
prod = prod.rename(columns = {'column2':'product_vector'})
total_products_vectors = pd.merge(total_products, prod, on="product_name")
total_products_vectors

Unnamed: 0,product_name,source,product_name_tokenize,product_vector
0,salt,recipes,[salt],"[-0.9277114, 1.3709065, 0.7961813, -1.4068056,..."
1,salt,instacart,[salt],"[-0.9277114, 1.3709065, 0.7961813, -1.4068056,..."
2,pepper,recipes,[pepper],"[1.2926804, 0.2748337, 1.1499476, 0.84602904, ..."
3,pepper,instacart,[pepper],"[1.2926804, 0.2748337, 1.1499476, 0.84602904, ..."
4,butter,recipes,[butter],"[-1.1640556, 0.42402837, 2.6274712, 1.409844, ..."
...,...,...,...,...
53183,vodka triple distil twist vanilla,instacart,"[vodka, triple, distil, twist, vanilla]","[-0.33344287, 0.641096, 0.78675956, 0.16579585..."
53184,en croute roast hazelnut cranberry,instacart,"[en, croute, roast, hazelnut, cranberry]","[0.09934117, 0.15263984, 1.182023, 0.6832714, ..."
53185,artisan baguette,instacart,"[artisan, baguette]","[-0.09982366, 0.40260994, 0.30901158, 0.127719..."
53186,smartblend healthy metabolism dry cat food,instacart,"[smartblend, healthy, metabolism, dry, cat, food]","[0.04900678, -0.23788376, 0.36894712, 0.932850..."
