# Grocery Recommendation Project

1. Data Clustering on users  
2. Recommendation systems
  * Content-based filtering  
  * Collaborative filtering  
3. Metadata NLP search engine
4. Market Basket analysis  
4. Website interface  

For this analysis I will be working as a Data Scientist for a grocery store that is looking to discover insights from sales data that could be used for targeted direct mail marketing (specific coupons mailed to customers), targeted email marketing ("An item you like has gone on sale!"), and online shopper recommendations to 'add to cart' based on similar items and also based on items other people who bought that item have purchased.  

If time permits, I may also perform a market basket analysis to forecast what products a customer is likely to purchase in their next order.

## Dataset information

This data was retrieved from Kaggle and was provided by Instacart for a market basket analysis competition in 2018.  

The data is divided into 6 files:

- **_Aisles.csv_**: 134 Unique aisle numbers and descriptions
- **_Departments.csv_**: 21 Unique department numbers and descriptions
- **_Products.csv_**: 49,688 Unique product ids, with description, aisle id, and department id
- **_Orders.csv_**: 3,421,083 Unique order id, with user id, order number, order_dow, order_hour_of_day, days_since_prior_order, and eval_set indicating if the order is in train, prior, or test
- **_Order_products_train.csv_**: Order id, product id, add to cart order, and reorder indicator
- **_Order_products_prior.csv_**: Order id, product id, add to cart order, and reorder indicator


## EDA and Data Preprocessing

See separate notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
import time
from user_functions import *
from datetime import datetime
import pickle
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Set default visualization parameters

CB91_Blue = '#2CBDFE'
CB91_Green = '#47DBCD'
CB91_Pink = '#F3A0F2'
CB91_Purple = '#9D2EC5'
CB91_Violet = '#661D98'
CB91_Amber = '#F5B14C'
color_list = [CB91_Blue, CB91_Pink, CB91_Green, CB91_Amber, CB91_Purple, CB91_Violet]
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=color_list)
sns.set_context("notebook", rc={"font.size":16, "axes.titlesize":20, "axes.labelsize":18})
sns.set(font='Franklin Gothic Book',
rc={'axes.axisbelow': False,
'axes.edgecolor': 'lightgrey',
# 'axes.edgecolor': 'white',
'axes.facecolor': 'None',
'axes.grid': False,
'axes.labelcolor': 'dimgrey',
# 'axes.labelcolor': 'white',
'axes.spines.right': False,
'axes.spines.top': False,
'axes.prop_cycle': plt.cycler(color=color_list),
'figure.facecolor': 'white',
'lines.solid_capstyle': 'round',
'patch.edgecolor': 'w',
'patch.force_edgecolor': True,
'text.color': 'dimgrey',
# 'text.color': 'white',    
'xtick.bottom': False,
'xtick.color': 'dimgrey',
# 'xtick.color': 'white',    
'xtick.direction': 'out',
'xtick.top': False,
'ytick.color': 'dimgrey',
# 'ytick.color': 'white',
'ytick.direction': 'out',
'ytick.left': False,
'ytick.right': False})
%matplotlib inline

# NOTE: if you visualizations are too cluttered to read, try calling 'plt.gcf().autofmt_xdate()'!

## Clustering

In [3]:
# Now I want to experiment with clustering the 'similar' users together
# But what data do I need for each user?  Some kind of summary statistics?
# I guess I need each product to be a column, with the number of times it was ordered?
# Product level is too granular so I am going to try aisle

In [4]:
merged_orders = pickle.load(open("Pickle/merged_orders.p", "rb"))

In [5]:
merged_orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,aisle,department
0,2539329,1,prior,1,2,8,,196,1,0,Soda,77,7,soft drinks,beverages
1,2539329,1,prior,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,23,19,popcorn jerky,snacks
2,2539329,1,prior,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,soy lactosefree,dairy eggs
3,2539329,1,prior,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,paper goods,household
4,2539329,1,prior,1,2,8,,12427,3,0,Original Beef Jerky,23,19,popcorn jerky,snacks


In [6]:
# What don't I need for my user dataframe?
# Since product level is too granular, I will capture the aisle
user_info = merged_orders[['user_id', 'order_number', 'order_dow', 'order_hour_of_day', 
                           'days_since_prior_order', 'aisle']]

In [7]:
user_info.head()

Unnamed: 0,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,aisle
0,1,1,2,8,,soft drinks
1,1,1,2,8,,popcorn jerky
2,1,1,2,8,,soy lactosefree
3,1,1,2,8,,paper goods
4,1,1,2,8,,popcorn jerky


### Get dummy variables for each aisle

In [8]:
user_data = pd.get_dummies(user_info, prefix=None, columns=['aisle'])

In [9]:
user_data
# Group By User_id
# I need max of order_number
# Mode of order_dow, median of order_hour_of_day, mean of days_since
# Sum of each aisle

Unnamed: 0,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order,aisle_air fresheners candles,aisle_asian foods,aisle_baby accessories,aisle_baby bath body care,aisle_baby food formula,...,aisle_spreads,aisle_tea,aisle_tofu meat alternatives,aisle_tortillas flat bread,aisle_trail mix snack mix,aisle_trash bags liners,aisle_vitamins supplements,aisle_water seltzer sparkling water,aisle_white wines,aisle_yogurt
0,1,1,2,8,,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,2,8,,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,1,2,8,,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,2,8,,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,1,2,8,,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33819101,206209,14,6,14,30.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33819102,206209,14,6,14,30.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33819103,206209,14,6,14,30.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33819104,206209,14,6,14,30.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
user_data['aisle_soft drinks'].value_counts()

0    33445290
1      373816
Name: aisle_soft drinks, dtype: int64

### Group By each User Id

In [11]:
# Splitting the data in thirds to use groupby then will rejoin them
user_data1 = user_data[user_data['user_id'] <= 65000]
user_data2 = user_data[(user_data['user_id'] <= 135000) & (user_data['user_id'] > 65000)]
user_data3 = user_data[user_data['user_id'] > 135000]

In [12]:
grouped_user1 = user_data1.groupby('user_id').sum()

In [15]:
grouped_user2 = user_data2.groupby('user_id').sum()

MemoryError: Unable to allocate 11.5 GiB for an array with shape (134, 11508594) and data type int64

In [None]:
grouped_user3 = user_data3.groupby('user_id').sum()

In [None]:
# Go back to user_data and drop the aisle info before I group the other variables in different ways
user_data1_noaisles = user_data1.iloc[:,:5]
user_data2_noaisles = user_data2.iloc[:,:5]
user_data3_noaisles = user_data3.iloc[:,:5]

In [None]:
# Cool.  Found a way to group each varaiable differently.
group1 = user_data1_noaisles.groupby('user_id').agg({'order_number': 'max', 'order_dow': lambda x:x.value_counts().index[0], 
                                            'order_hour_of_day': 'median', 'days_since_prior_order': 'mean'})
group2 = user_data2_noaisles.groupby('user_id').agg({'order_number': 'max', 'order_dow': lambda x:x.value_counts().index[0], 
                                            'order_hour_of_day': 'median', 'days_since_prior_order': 'mean'})
group3 = user_data3_noaisles.groupby('user_id').agg({'order_number': 'max', 'order_dow': lambda x:x.value_counts().index[0], 
                                            'order_hour_of_day': 'median', 'days_since_prior_order': 'mean'})

In [None]:
group1

In [None]:
# get number of orders for each user and add to grouped_user dfs
grouped_user1['num_orders'] = group1.order_number
grouped_user2['num_orders'] = group2.order_number
grouped_user3['num_orders'] = group3.order_number

In [None]:
grouped_user1['mean_days_since'] = group1.days_since_prior_order
grouped_user2['mean_days_since'] = group2.days_since_prior_order
grouped_user3['mean_days_since'] = group3.days_since_prior_order

In [None]:
grouped_user1['mode_order_dow'] = group1.order_dow
grouped_user2['mode_order_dow'] = group2.order_dow
grouped_user3['mode_order_dow'] = group3.order_dow

In [None]:
grouped_user1['median_order_hour'] = group1.order_hour_of_day
grouped_user2['median_order_hour'] = group2.order_hour_of_day
grouped_user3['median_order_hour'] = group3.order_hour_of_day

In [None]:
grouped_user1.drop(columns=['order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order'], inplace=True)
grouped_user2.drop(columns=['order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order'], inplace=True)
grouped_user3.drop(columns=['order_number', 'order_dow', 'order_hour_of_day', 'days_since_prior_order'], inplace=True)

In [None]:
grouped_users = pd.concat([grouped_user1, grouped_user2, grouped_user3], axis=0)

In [None]:
grouped_users

In [None]:
# This function came from a Medium article by Adam Ross Nelson to rearrange columns in a df
def movecol(df, cols_to_move=[], ref_col='', place='After'):
    
    cols = df.columns.tolist()
    if place == 'After':
        seg1 = cols[:list(cols).index(ref_col) + 1]
        seg2 = cols_to_move
    if place == 'Before':
        seg1 = cols[:list(cols).index(ref_col)]
        seg2 = cols_to_move + [ref_col]
    
    seg1 = [i for i in seg1 if i not in seg2]
    seg3 = [i for i in cols if i not in seg1 + seg2]
    
    return(df[seg1 + seg2 + seg3])

In [None]:
grouped_users = movecol(grouped_users, 
             cols_to_move=['num_orders', 'mode_order_dow', 'median_order_hour', 'mean_days_since'], 
             ref_col='aisle_air fresheners candles',
             place='Before')
grouped_users

In [None]:
pickle.dump(grouped_users, open("Pickle/grouped_users.p", "wb"))

### Run KMeans clustering

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_users = scaler.fit_transform(grouped_users)

In [None]:
scaled_users

In [None]:
from sklearn.cluster import KMeans
random_state = 12

'''The classical EM-style algorithm is “full”. The “elkan” variation is more efficient on data with well-defined clusters,
by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape 
(n_samples, n_clusters).'''

# Is having 20 separate market segments helpful?  Can I figure out what make them different and target that?  

k_means_13 = KMeans(n_clusters=13, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_14 = KMeans(n_clusters=14, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_15 = KMeans(n_clusters=15, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_16 = KMeans(n_clusters=16, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_17 = KMeans(n_clusters=17, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_18 = KMeans(n_clusters=18, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_19 = KMeans(n_clusters=19, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_10 = KMeans(n_clusters=10, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_11 = KMeans(n_clusters=11, random_state=random_state, algorithm='full').fit(scaled_users)
k_means_12 = KMeans(n_clusters=12, random_state=random_state, algorithm='full').fit(scaled_users)
# k_means_20 = KMeans(n_clusters=20, random_state=random_state, algorithm='full').fit(scaled_users)


In [None]:
# pickle.dump(k_means_20, open("Pickle/k_means_20.p", "wb"))

In [None]:
k_means_20 = pickle.load(open("Pickle/k_means_20.p", "rb"))

In [None]:
k_list = [k_means_10, k_means_11, k_means_12, k_means_13, 
          k_means_14, k_means_15, k_means_16, k_means_17, k_means_18, k_means_19, k_means_20]

In [None]:
from sklearn.metrics import calinski_harabasz_score

CH_score = []

for model in k_list:
    labels = model.labels_
    CH_score.append(calinski_harabasz_score(grouped_users, labels))

In [None]:
# Need to decide if I keep going with more clusters
# Previous k_means_20 on unscaled data had CH around 16000, now scaled it is at 8000

plt.plot([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], CH_score)
plt.xticks([10,11,12,13,14,15,16,17,18,19,20])
plt.title('Calinski Harabasz Scores for Different Values of K')
plt.ylabel('Variance Ratio')
plt.xlabel('K=')
plt.savefig('Images/ch_scaled_scores.png');

In [None]:
# Looking at Within Cluster Sum of Squares

wcss_score = []

for model in k_list:
    labels = model.labels_
    wcss_score.append(model.inertia_)

In [None]:
plt.plot([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], wcss_score)
plt.xticks([10,11,12,13,14,15,16,17,18,19,20])
plt.title('Within Cluster Sum of Squares Scores for Different Values of K')
plt.ylabel('WCSS')
plt.xlabel('K=')
plt.savefig('Images/wcss_scores.png');

In [None]:
# Silhouette score 1 is good, -1 is bad, near 0 means overlapping custers

from sklearn import metrics
metrics.silhouette_score(grouped_users, k_means_20.labels_, sample_size = 30000, random_state = random_state)

In [None]:
metrics.silhouette_score(grouped_users, k_means_19.labels_, sample_size = 30000, random_state = random_state)

In [None]:
# OK let's focus on k_means_20 and have a look at our clusters

k_means_20.labels_

In [None]:
# Add cluster assignment to the grouped_users dataframe
grouped_users['cluster'] = k_means_20.labels_

### Analyze clusters

In [None]:
# There are top 3 big clusters and lots of smaller.  May be hard to determine what the big clusters have in common.
grouped_users.cluster.value_counts()

In [None]:
# Now take the grouped users and group them by cluster

# For each user, I took the mode of their order_dow.  Now I am taking the median value for the cluster.

# cluster_data = grouped_users.groupby('cluster').agg({'num_orders': 'median', 
#                                                      'mode_order_dow': lambda x:x.value_counts().index[0], 
#                                                      'median_order_hour': 'median', 'mean_days_since': 'mean'})
cluster_data = grouped_users.groupby('cluster').median()

In [None]:
cluster_data # This is the median info for each cluster

# I can see cluster 7 has a lot of baby products

In [None]:
# These are the users that make up cluster 7, and yes they have a lot of baby products

grouped_users[grouped_users['cluster'] == 7]

In [None]:
# These are all of cluster 7's values that aren't zero... still 62 of them

cluster_data.iloc[7,(cluster_data.loc[7].values > 0)]

### Use TSNE to convert cluster data to 3D

In [None]:
# Convert to three dimensional for graphing

from sklearn.manifold import TSNE

cluster_embedded = TSNE(n_components=3).fit_transform(cluster_data)

In [None]:
# Create dataframe to plot
cluster_embedded_df = pd.DataFrame(cluster_embedded, index = cluster_data.index, columns = ['1','2','3'])
cluster_embedded_df.reset_index(inplace=True)
cluster_embedded_df

In [None]:
import re, seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import ListedColormap

# axes instance
fig = plt.figure(figsize=(6,6))
ax = Axes3D(fig)

# get colormap from seaborn
cmap = ListedColormap(sns.color_palette("husl", 256).as_hex())

# plot
sc = ax.scatter(cluster_embedded_df['1'], cluster_embedded_df['2'], cluster_embedded_df['3'], 
                s=40, c=cluster_embedded_df['cluster'], marker='o', cmap=cmap, alpha=1)
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

# legend
plt.legend(*sc.legend_elements(num=20), bbox_to_anchor=(1.05, 1), loc=2)

# save
plt.savefig("Images/scatter_hue.png", bbox_inches='tight')

In [None]:
# This shows clusters 5, 11, 12, 14, and 17 as being very separate from the rest

import plotly.express as px

fig = px.scatter_3d(cluster_embedded_df, x='1', y='2', z='3', color='cluster')
fig.show()

In [None]:
# Can I plot all of the users with their cluster color?
# This took about 2 hours and 20 minutes to run.

# user_embedded = TSNE(n_components=3).fit_transform(grouped_users.drop(columns='cluster'))

In [None]:
# pickle.dump(user_embedded, open("Pickle/user_embedded.p", "wb"))

In [None]:
user_embedded = pickle.load(open("Pickle/user_embedded.p", "rb"))

In [None]:
user_embedded

In [None]:
# Create dataframe to plot
user_embedded_df = pd.DataFrame(user_embedded, index = grouped_users.index, columns = ['1','2','3'])
user_embedded_df.reset_index(inplace=True)
user_embedded_df['cluster'] = k_means_20.labels_
user_embedded_df

In [None]:
fig = px.scatter_3d(user_embedded_df.sample(200), x='1', y='2', z='3', color='cluster')
fig.show()

### Cluster Breakdown by Aisle

In [None]:
# Create a way to compare clusters.  Find if they have max or min values for any features.

cluster_metrics = {}
for cluster in cluster_data.index:
    cluster_list = [grouped_users.cluster.value_counts()[cluster]]
    for col in cluster_data.columns:
        if (cluster_data.loc[:,col].max() > 0) & (cluster_data.loc[:,col].idxmax() == cluster):
            cluster_list.append(('max ' + col, cluster_data.loc[cluster,col]))
        if (cluster_data.loc[:,col].min() > 0) & (cluster_data.loc[:,col].idxmin() == cluster):
                cluster_list.append(('min ' + col, cluster_data.loc[cluster,col]))
    cluster_metrics[cluster] = cluster_list

In [None]:
cluster_metrics

In [None]:
# Clusters 5, 11, 12, 14, and 17 looked very separate from the rest on the graph.

print(cluster_metrics[5]) # Lots of personal care / pharmacy type products
print(cluster_metrics[11]) # Soap and skin care
print(cluster_metrics[12]) # Very large cluster, with fewest number of orders and highest days between orders
print(cluster_metrics[14]) # Tons of veggies, herb, and spices
print(cluster_metrics[17]) # Bulk dried fruits and veggies

In [None]:
#Other interesting clusters

print(cluster_metrics[7]) # Baby products
print(cluster_metrics[8]) # Lots of orders, shortest days between, big buyers
print(cluster_metrics[9]) # Alcohol purchasers
print(cluster_metrics[13]) # Household, laundry, cleaning products
print(cluster_metrics[15]) # Chocolate, gum and soft drinks, least veggies
print(cluster_metrics[16]) # Vegan and tofu

In [None]:
# Make a heatmap of clusters and aisles
# Scale the data first to make it more meaningful
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
graph_data = scaler.fit_transform(cluster_data)
graph_df=pd.DataFrame(graph_data, columns = cluster_data.columns)
fig = plt.figure(figsize=(15,10))
heat_map = sns.heatmap(graph_df, cmap="YlGnBu")
plt.savefig('Images/cluster_heatmap.png')

In [None]:
# Can I write a function to predict what cluster someone will belong in?
# They would have to give me a shopping list... no even then some clusters may be based on order frequency rather than item.
# Or rather, if they give me an item, can I ouput "Others who bought this item also bought..."
# That we will get from the recommendation system below I think

### Cluster Buying Power

In [None]:
# Add up all of the products for each person

grouped_users['num_products'] = grouped_users[grouped_users.columns[4:-1]].sum(axis=1)

In [None]:
# This will be used to count how many users are in each cluster when I do the groupby
grouped_users['user_count'] = list(np.ones(len(grouped_users)))

In [None]:
grouped_users = movecol(grouped_users, 
                        cols_to_move=['num_products', 'user_count', 'cluster'], 
                        ref_col='mode_order_dow', 
                        place='Before')
grouped_users

In [None]:
# Group by cluster, adding up the number of products purchased
grouped_clusters = grouped_users.groupby('cluster').sum()
grouped_clusters

In [None]:
# This doesn't take into account the relative prices of the items purchased
# But we can now see the portion of products purchased by each cluster
cluster_power = grouped_clusters.iloc[:,0:3]

In [None]:
cluster_power

In [None]:
# Calculate ordering statistics per cluster
cluster_power['portion_of_orders'] = cluster_power['num_orders'].apply(lambda x: 
                                                                             x/(cluster_power['num_orders'].sum()))
cluster_power['portion_of_products'] = cluster_power['num_products'].apply(lambda x: 
                                                                                 x/(cluster_power['num_products'].sum()))
cluster_power['portion_of_users'] = cluster_power['user_count'].apply(lambda x: 
                                                                             x/(cluster_power['user_count'].sum()))
cluster_power['orders_per_user'] = cluster_power['num_orders']/cluster_power['user_count']
cluster_power['products_per_user'] = cluster_power['num_products']/cluster_power['user_count']

In [None]:
cluster_power.sort_values('products_per_user', ascending=False)

In [None]:
'''Sorting these statistics different ways shows interesting results depending on what you are looking for.  We can see
that cluster 8 orders a very large number of products per user, but overall cluster 8 represents a small portion of all
of the users.  Cluster 12 represents over 50% of all of the users, but only 25% of the orders and only 17% of the products.
Cluster 1 is very proportional with about 23% of the users, orders, and products.  Cluster 4 is the third largest cluster
with 7% of the users, but they make up about 17% of the orders and products.'''

## NLP Metadata search engine

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
products_desc = pickle.load(open("Pickle/products_desc.p", "rb"))

### Create the metadata and fit to a vectorizer

In [None]:
products_desc['metadata'] = products_desc.apply(lambda x : x['aisle']+' '+x['department']+' '+x['product_name'], axis = 1)

In [None]:
products_desc

In [None]:
count_vec = CountVectorizer(stop_words='english')
count_vec_matrix = count_vec.fit_transform(products_desc['metadata'])

In [None]:
# This function takes in any words and vectorizes them, then find similar vector in the count_vec_matrix

def vectorize_products_based_on_metadata(product_input):

    vec = count_vec.transform(pd.Series(product_input))
    
    simil = cosine_similarity(vec, count_vec_matrix)
    
    simil_scores = pd.DataFrame(simil.reshape(49688,), index = products_desc.index, columns=['score'])
    
    # Don't return scores of zero, only as many positive scores as exist
    non_zero_scores = simil_scores[simil_scores['score'] > 0]
    
    if len(non_zero_scores) == 0:
        print('No similar products found.  Please refine your search terms and try again')
        return
    
    if len(non_zero_scores) < 10:
        item_count = len(non_zero_scores)
    else:
        item_count = 10
    
    similarity_scores = simil_scores.sort_values(['score'], ascending=False)[:item_count]
    
    return (products_desc['product_name'].iloc[similarity_scores.index])

In [None]:
vectorize_products_based_on_metadata('Bubble Bath')

In [None]:
vectorize_products_based_on_metadata('Oreo')

In [None]:
vectorize_products_based_on_metadata('Oreos')

In [None]:
vectorize_products_based_on_metadata('Oreos Cookies')

In [None]:
vectorize_products_based_on_metadata('Premium Almonds')

In [None]:
# I'd rather put more weight on the noun and less on the adjective

vectorize_products_based_on_metadata('Red Potatoes')

In [None]:
vectorize_products_based_on_metadata('randomword')

### Stem the product metadata and refit

These are mostly proper names of products so I don't think I want to lemmatize as that may change the product name too much.

In [None]:
import nltk
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

In [None]:
stem_list=[]
for i in range(len(products_desc['metadata'])):
    word_list = nltk.word_tokenize(products_desc['metadata'][i])
    stem_set = list(set([stemmer.stem(word) for word in word_list]))
    stem_list.append(' '.join(stem_set))

In [None]:
products_desc['stemmed'] = stem_list

In [None]:
products_desc

In [None]:
# Fitting the vectorizer

stem_count_vec_matrix = count_vec.fit_transform(products_desc['stemmed'])

In [None]:
# This improved function takes in any words and stems and vectorizes them

def stem_and_vectorize_products_based_on_metadata(product_input):

    word_list = nltk.word_tokenize(product_input)
    input_stemmed = [stemmer.stem(word) for word in word_list]
    vec = count_vec.transform(pd.Series(input_stemmed))
    
    simil = cosine_similarity(vec, stem_count_vec_matrix)
    
    simil_scores = pd.DataFrame(simil.reshape(stem_count_vec_matrix.shape[0],), 
                                index = products_desc.index, columns=['score'])
    
    # Don't return scores of zero, only as many positive scores as exist
    non_zero_scores = simil_scores[simil_scores['score'] > 0]
    
    if len(non_zero_scores) == 0:
        print('No similar products found.  Please refine your search terms and try again')
        return
    
    if len(non_zero_scores) < 10:
        item_count = len(non_zero_scores)
    else:
        item_count = 10
    
    similarity_scores = simil_scores.sort_values(['score'], ascending=False)[:item_count]
    
    return (products_desc['product_name'].iloc[similarity_scores.index])

In [None]:
stem_and_vectorize_products_based_on_metadata('Oreos')

In [None]:
stem_and_vectorize_products_based_on_metadata('Oreo')