## 3.0 Item2Vec 
Our recommender model is based on the Item2Vec specification, which is a direct transfer from the Word2Vec model first reported by Mikolov et al. (2013) at Google. Instead of using documents with Natural Language, we are using the individual orders of different customers as identified by the KMeans clustering. In this notebook, we are going to train 13 different Item2Vec models, one for each cluster that we have identified. We are going to restrict the data in that we only consider orders with at least 4 items in them, as this should give them recommender sufficient information, apart from the cluster the user belongs to, of course.

In [1]:
# Importing libraries
import random
import pickle

import pandas as pd
import numpy as np

import pyarrow.parquet as pq

from os import listdir
from os import cpu_count

from typing import List
from toolz.functoolz import pipe
from gensim.models import Word2Vec

In [2]:
def import_data(data_dir: str) -> List[pd.DataFrame]:
    """
    Parameters:
    ----------------
    data_dir: str
      The path where the data is stored

    Returns:
    ----------------
    dataframes_ls: List[pd.DataFrame]
      A list of pandas dataframes
    """
    files = [file.split('.')[0] for file in listdir(data_dir) if file.split('.')[0] != ""]

    # Creating a string expression to evaluate the data
    eval_expr = ', '.join(f'pd.read_csv(\'{data_dir}/{file}.csv\')' for file in files)

    # Evaluating the expression and assigning it, which creates a list of dataframes
    dataframes_ls = eval(eval_expr)

    return dataframes_ls

In [3]:
dataframes = import_data("./data")
files = [file.split('.')[0] for file in listdir("./data")]
files_dict = dict(zip(files, range(len(files))))
products = dataframes[files_dict['products']]

In [4]:
cluster_data = pq.read_table('./dummy_k13.parquet').to_pandas()

In [5]:
cluster_data_named = pd.merge(cluster_data, products, on='product_id', how='inner')

In [6]:
cluster_data_named['product_id'] = cluster_data_named['product_id'].astype(str)
cluster_data_named['user_id'] = cluster_data_named['user_id'].astype(str)

In [7]:
def filter_data_by_cluster(data: pd.DataFrame, cluster_num: int):
    return data.loc[data['cluster'] == cluster_num, :]

In [8]:
clusters_separated = [filter_data_by_cluster(cluster_data_named, cluster_num) for cluster_num in range(0, len(cluster_data_named['cluster'].unique()))]

In [9]:
def split_users_in_cluster(cluster_data: pd.DataFrame, train_rate: float):
    unique_users = cluster_data['user_id'].unique()
    train_users = np.random.choice(unique_users, round(len(unique_users)*train_rate), False).tolist()
    test_users = [user for user in unique_users if user not in train_users]
    return train_users, test_users

In [10]:
train_test_tuples = [split_users_in_cluster(cluster, 0.75) for cluster in clusters_separated]

In [11]:
train_users = [users[0] for users in train_test_tuples]
test_users = [users[1] for users in train_test_tuples]

For computational ease, we are going to save the test users in a separate directory to save time when testing the model.

In [12]:
def save_test_users_in_cluster(test_users: list, cluster_num: int):

    with open(f'./user_segments/test_users_cluster{cluster_num}.pkl', 'wb') as file:
        pickle.dump(test_users, file)
    return f"Test users for cluster {cluster_num} saved."

In [13]:
[save_test_users_in_cluster(test_users[i], i+1) for i in range(len(test_users))]

['Test users for cluster 1 saved.',
 'Test users for cluster 2 saved.',
 'Test users for cluster 3 saved.',
 'Test users for cluster 4 saved.',
 'Test users for cluster 5 saved.',
 'Test users for cluster 6 saved.',
 'Test users for cluster 7 saved.',
 'Test users for cluster 8 saved.',
 'Test users for cluster 9 saved.',
 'Test users for cluster 10 saved.',
 'Test users for cluster 11 saved.',
 'Test users for cluster 12 saved.',
 'Test users for cluster 13 saved.']

In [14]:
def save_product_lookup(products: pd.DataFrame):
    product_lookup = dict(zip(products['product_id'].astype(str).to_list(), products['product_name'].to_list()))
    with open('product_lookup.pkl', 'wb') as file:
        pickle.dump(product_lookup, file)

In [15]:
save_product_lookup(products)

In [16]:
def subset_cluster(cluster: pd.DataFrame, train_users):
    return cluster[cluster['user_id'].isin(train_users)]

In [17]:
def get_orders_from_cluster(cluster_subset):
    return cluster_subset.groupby(['user_id', 'order_id'])['product_id'].apply(list).values

In [18]:
def generate_user_purchase_history_in_cluster(cluster: pd.DataFrame, train_users):
    cluster_subset = subset_cluster(cluster, train_users)
    purchase_history = get_orders_from_cluster(cluster_subset)
    filtered_purchase_history = [purchase for purchase in purchase_history if len(purchase) > 3] # A number of purchases 
    return purchase_history

In [19]:
purchase_history_in_cluster = [generate_user_purchase_history_in_cluster(clusters_separated[i], train_users[i]) for i in range(0, len(clusters_separated))]

In [22]:
def build_item2vec_model(purchases_data):

    model = Word2Vec(window=3, sg=1, hs=0, vector_size=100, negative=10, alpha=0.03, min_alpha=0.0007, seed=28101997, workers=6)

    model.build_vocab(purchases_data, progress_per=200)

    model.train(purchases_data, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

    return model

In [23]:
models = [build_item2vec_model(purchase_history) for purchase_history in purchase_history_in_cluster]

In [24]:
def save_cluster_model(model, id: int):
    model.save(f'./cluster_models/model_cluster_{id}.model')
    return f"Model for cluster {id} successfully saved."

In [25]:
[save_cluster_model(models[i], i) for i in range(len(models))]

['Model for cluster 0 successfully saved.',
 'Model for cluster 1 successfully saved.',
 'Model for cluster 2 successfully saved.',
 'Model for cluster 3 successfully saved.',
 'Model for cluster 4 successfully saved.',
 'Model for cluster 5 successfully saved.',
 'Model for cluster 6 successfully saved.',
 'Model for cluster 7 successfully saved.',
 'Model for cluster 8 successfully saved.',
 'Model for cluster 9 successfully saved.',
 'Model for cluster 10 successfully saved.',
 'Model for cluster 11 successfully saved.',
 'Model for cluster 12 successfully saved.']

## Link to [3. item2vec](../3_Item2Vec/3_1_Recommendation_Testing.ipynb)