# Kaggle Featured Prediction Competition: H&M Personalized Fashion Recommendations

In this [competition](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations), product recommendations have to be done based on previous purchases. There's a whole range of data available including customer meta data, product meta data, and meta data that spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

In this notebook we will be working with implicit's ALS library for our recommender systems. Please do check out the [docs](https://benfred.github.io/implicit/index.html) for more information

Install the kfp package by uncommenting the below line and restarting the kernel. Do comment it out once the kernel is restarted

In [None]:
# Install the kfp 
# !pip install kfp --upgrade 

Following are the imports required to build the pipeline and pass the data between components for building up the kubeflow pipeline

In [2]:
import kfp
from kfp.components import func_to_container_op
import kfp.components as comp
from typing import NamedTuple

All the essential imports required in a pipeline component are put together in a list which then is passed on to each pipeline component. Though this might not be efficient when you are dealing with lot of packages, so in cases with many packages and dependencies you can go for docker image which then can be passed to each pipeline component

In [3]:
import_packages = ['pandas', 'sklearn', 'implicit', 'kaggle', 'numpy', 'pyarrow']

In the following implementation of kubeflow pipeline we are making use of [lightweight python function components](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) to build up the pipeline. The data is passed between component instances(tasks) using InputPath and OutputPath. Different ways of storing and passing data between the pipelines have been explored in the following notebook.

The pipeline is divided into five components

    1. Download data from Kaggle
    2. Load and preprocess the data
    3. Creating sparse matrix
    4. Train model
    5. Predictions

### Download the data from Kaggle

Follow the prerequisites information in the Github README.md on how to create a secret for our credentials and mounting them to our pod using a pod-default resource. Once you have the secret mounted, you can use it to acccess the Username and key to download the files you need from kaggle. For the following competition, we have downloaded the files required instead of downloading the whole thing.

In [4]:
def download_kaggle_dataset(path:str)->str:
    
     import os

     # Retrieve the credentials from the secret mounted and 
     # bring it onto our working environment
     with open('/secret/kaggle/KAGGLE_KEY', 'r') as file:
          kaggle_key = file.read().rstrip()
     with open('/secret/kaggle/KAGGLE_USERNAME', 'r') as file:
          kaggle_user = file.read().rstrip()
     os.environ['KAGGLE_USERNAME'] = kaggle_user 
     os.environ['KAGGLE_KEY'] = kaggle_key

     os.chdir(os.getcwd())
     os.system("mkdir " + path)
     os.chdir(path)
    
     # Using Kaggle Public API to download the datasets
     import kaggle   
     from kaggle.api.kaggle_api_extended import KaggleApi
     
     api = KaggleApi()
     api.authenticate()
        
     # Download the required files individually. You can also choose to download the entire dataset if you want to work with images as well.   
     api.competition_download_file('h-and-m-personalized-fashion-recommendations','customers.csv')
     api.competition_download_file('h-and-m-personalized-fashion-recommendations','transactions_train.csv')
     api.competition_download_file('h-and-m-personalized-fashion-recommendations','articles.csv')
     api.competition_download_file('h-and-m-personalized-fashion-recommendations','sample_submission.csv')     
     
     return path   
    


In [5]:
download_data_op = func_to_container_op(download_kaggle_dataset, packages_to_install = import_packages)

### Load and Preprocess the data

In [6]:
def load_and_preprocess_data(path:str, preprocess_data_path: comp.OutputPath('ApacheParquet'))->NamedTuple('Outputs', [('data_path',str),('int_list', list)]):
    
    
    import pandas as pd
    import os
    from zipfile import ZipFile 
    from pyarrow import parquet
    import pyarrow as pa
    
    # Moving to current working directory and creating a new directory
    os.chdir(os.getcwd())
    print(os.listdir(path))
    os.chdir(path)
    
    # Extracting all files from individual zip files
    zipfile1 = ZipFile('customers.csv.zip', 'r')
    zipfile1.extract("customers.csv")
    zipfile1.close()
    
    zipfile2 = ZipFile('transactions_train.csv.zip', 'r')
    zipfile2.extract("transactions_train.csv")
    zipfile2.close()
    
    zipfile3 = ZipFile('articles.csv.zip', 'r')
    zipfile3.extract("articles.csv")
    zipfile3.close()
    
    zipfile4 = ZipFile('sample_submission.csv.zip', 'r')
    zipfile4.extract("sample_submission.csv")
    zipfile4.close()
    
    # Converting to pandas dataframe 
    customer_data = pd.read_csv("customers.csv")
    article_data = pd.read_csv("articles.csv")
    train_data = pd.read_csv("transactions_train.csv") 
        
    # create a new purchase count column that would gives us count of every article bought by the customers
    X = train_data.groupby(['customer_id', 'article_id'])['article_id'].count().reset_index(name = "purchase_count") 

    # Getting unique number of customers and articles using the customer and article metadata data files
    unique_customers = customer_data['customer_id'].unique()
    unique_articles = article_data['article_id'].unique()
    
    # length of the customers and articles
    n_customers = len(unique_customers)
    n_articles = len(unique_articles)

    # Create a mapping for customer_id to convert it from an object column to an int column for the sparse matrix creation
    customer_id_dict = {unique_customers[i]:i  for i in range(len(unique_customers))}
    reverse_customer_id_dict = {i:unique_customers[i] for i in range(len(unique_customers))} 
    numeric_cus_id = []
    for i in range(len(X['customer_id'])):
        numeric_cus_id.append(customer_id_dict.get(X['customer_id'][i]))
    X['customer_id'] = numeric_cus_id

    # Create a mapping for article_id so that the sparse matrix creation doesn't get large enough due to long int values of article_ids
    article_id_dict = {unique_articles[i]:i  for i in range(len(unique_articles))}
    rev_art_id_dict = {i:int(unique_articles[i]) for i in range(len(unique_articles))}
    numeric_art_id = []
    for i in range(len(X['article_id'])):
        numeric_art_id.append(article_id_dict.get(X['article_id'][i]))
    X['article_id'] = numeric_art_id
    
    # Convert from pandas to Arrow
    table = pa.Table.from_pandas(X)
    parquet.write_table(table, preprocess_data_path)
    
    values=[n_customers, n_articles]
    
    return (path, values)
    

In [7]:
load_and_preprocess_data_op = func_to_container_op(load_and_preprocess_data,packages_to_install = import_packages)

### Creating sparse matrix

In [8]:
def sparse_matrix_creation(data_path:str, list_val: list, file_path: comp.InputPath('ApacheParquet'), sparse_path: comp.OutputPath())->str:
    
    import pandas as pd
    from pyarrow import parquet
    import pyarrow as pa
    import scipy.sparse as sparse
    from scipy.sparse import coo_matrix
    from pathlib import Path
    import pickle
    
    X = parquet.read_pandas(file_path).to_pandas()
    
    n_customers = list_val[0]
    n_articles = list_val[1]

    # Constructing sparse matrices for alternating least squares algorithm    
    sparse_user_item_coo = sparse.coo_matrix((X.purchase_count, (X.customer_id, X.article_id)), shape = (n_customers, n_articles))
    sparse_user_item_csr = sparse.csr_matrix((X['purchase_count'], (X['customer_id'], X['article_id'])), shape = (n_customers, n_articles))

    pickle.dump(sparse_user_item_csr, open(sparse_path, 'wb'))
    
    return data_path  

In [9]:
sparse_matrix_creation_op = func_to_container_op(sparse_matrix_creation, packages_to_install = import_packages)

### Train the Model

In [10]:
def train_model(path:str, sparse_matrix_path: comp.InputPath(), model_path: comp.OutputPath())->str:
    
    import implicit
    import pandas as pd
    from pyarrow import parquet
    import pyarrow as pa
    import scipy.sparse as sparse
    import pickle
    
    # Loading the sparse user item matrix from pickle
    sparse_user_item_csr = pickle.load(open(sparse_matrix_path, 'rb'))
    
    # parameters for the model
    als_params = dict(
        factors = 200,         # number of latent factors - try between 50 to 1000
        regularization = 0.01, # regularization factor - try between 0.001 to 0.2
        iterations = 5,        # iterations            - try between 2 to 100
    )

    # initialize a model
    model = implicit.als.AlternatingLeastSquares(**als_params)

    # train the model on a sparse matrix of user/item/confidence weights    
    model.fit(sparse_user_item_csr)
    
    pickle.dump(model, open(model_path, 'wb'))
    
    return path

In [11]:
train_model_op = func_to_container_op(train_model, packages_to_install = import_packages)

### Predictions

In [12]:
def predictions(test_path:str, model_path : comp.InputPath(), sparse_path: comp.InputPath()):
    
    import pandas as pd
    import os
    from zipfile import ZipFile 
    import pickle
    from pyarrow import parquet
    import pyarrow as pa
    import scipy.sparse as sparse
    
    sparse_user_item_csr = pickle.load(open(sparse_path, 'rb'))
    model = pickle.load(open(model_path, 'rb'))
 
    os.chdir(os.getcwd())
    print(os.listdir(test_path))
    os.chdir(test_path)

    # Converting to pandas dataframe 
    customer_data = pd.read_csv("customers.csv")
    article_data = pd.read_csv("articles.csv") 
    test_data  = pd.read_csv("sample_submission.csv")
      
    # Getting unique number of customers and articles using the customer and article metadata data files
    unique_customers = customer_data['customer_id'].unique()
    unique_articles = article_data['article_id'].unique()
    
    # length of the customers and articles
    n_customers = len(unique_customers)
    n_articles = len(unique_articles)
    
    # Create a mapping for customer_id
    customer_id_dict = {unique_customers[i]:i  for i in range(len(unique_customers))}

    # Create a reverse mapping for article_id
    reverse_article_id_dict = {i:int(unique_articles[i]) for i in range(len(unique_articles))}

    predictions=[]
    count = 0
    for cust_id in test_data.customer_id:
        cust_id = customer_id_dict.get(cust_id)
        if(cust_id!=None):    
            recommendations = model.recommend(cust_id, sparse_user_item_csr[cust_id],10)
            result=[]
            for i in range(len(recommendations[0])):
                val = reverse_article_id_dict.get(recommendations[0][i])
                result.append(val)  
            predictions.append(result)
            
    test_data['prediction'] = predictions
    test_data

In [13]:
prediction_op = func_to_container_op(predictions, packages_to_install = import_packages)

### Defining function that implements the pipeline

In [14]:
def kfp_pipeline():
    
    vop = kfp.dsl.VolumeOp(
    name="create-volume",    
    resource_name="mypvc",
    size="10Gi",
    modes = kfp.dsl.VOLUME_MODE_RWM
    )
    
    download_task = download_data_op("/mnt/data/").add_pvolumes({"/mnt": vop.volume}).add_pod_label("kaggle-secret", "true")
    load_and_preprocess_data_task = load_and_preprocess_data_op(download_task.output).add_pvolumes({"/mnt": vop.volume})
    sparse_matrix_task = sparse_matrix_creation_op(data_path =load_and_preprocess_data_task.outputs['data_path'], 
                                                   file = load_and_preprocess_data_task.outputs['preprocess_data'], 
                                                   list_val = load_and_preprocess_data_task.outputs['int_list']).add_pvolumes({"/mnt": vop.volume})
    train_model_task = train_model_op(path = sparse_matrix_task.outputs['Output'], 
                                      sparse_matrix = sparse_matrix_task.outputs['sparse']).add_pvolumes({"/mnt": vop.volume})
    prediction_task = prediction_op(test_path = train_model_task.outputs['Output'],
                                    model = train_model_task.outputs['model'], 
                                    sparse = sparse_matrix_task.outputs['sparse']).add_pvolumes({"/mnt": vop.volume})
    
    

In [15]:
# Using kfp.Client() to run the pipeline from notebook itself
client = kfp.Client() # change arguments accordingly

# Running the pipeline
client.create_run_from_pipeline_func(
    kfp_pipeline,
    arguments={
    })

RunPipelineResult(run_id=61cfeaf5-5caa-49f2-89c1-c3f3e9e8d5b5)