# Kaggle Featured Prediction Competition: H&M Personalized Fashion Recommendations

In this [competition](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations), product recommendations have to be done based on previous purchases. There's a whole range of data available including customer meta data, product meta data, and meta data that spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

In this notebook we will be working with implicit's ALS library for our recommender systems. Please do check out the [docs](https://benfred.github.io/implicit/index.html) for more information.

## Install necessary packages

We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. We have put the dependencies in a `requirements.txt` file so we will use the former method.

Restart the kernel after installation

In [None]:
!pip install --user -r requirements.txt

## Download Data from Kaggle

Download relevant data from kaggle by running the below code cell. Follow the initial steps information mentioned in Github README.md to get the Kaggle username and key for authentication of Kaggle Public API. There's no need of secret to be created for the following step. The credentials will be present in the kaggle.json file. This cell needs to be run before starting Kale pipeline from  Kale deployment panel.

In [1]:
import os


# Get the Kaggle Username and password from the kaggle.json file
# and paste it in place of KAGGLE_USERNAME AND KAGGLE_KEY

os.environ['KAGGLE_USERNAME'] = "KAGGLE_USERNAME"
os.environ['KAGGLE_KEY'] = "KAGGLE_KEY"

path = "data/"

os.chdir(os.getcwd())
os.system("mkdir " + path)
os.chdir(path)

import kaggle
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
        
# Download the required files individually. You can also choose to download the entire dataset if you want to work with image data as well. The files will be in downloaded   
api.competition_download_file('h-and-m-personalized-fashion-recommendations','customers.csv')
api.competition_download_file('h-and-m-personalized-fashion-recommendations','transactions_train.csv')
api.competition_download_file('h-and-m-personalized-fashion-recommendations','articles.csv')
api.competition_download_file('h-and-m-personalized-fashion-recommendations','sample_submission.csv')   

# Get the path of the directory where the 
path_dir = os.getcwd()

from zipfile import ZipFile 

# Extracting all files from individual zip files
zipfile1 = ZipFile(path_dir + '/customers.csv.zip', 'r')
zipfile1.extract("customers.csv")
zipfile1.close()
    
zipfile2 = ZipFile(path_dir + '/transactions_train.csv.zip', 'r')
zipfile2.extract("transactions_train.csv")
zipfile2.close()
    
zipfile3 = ZipFile(path_dir + '/articles.csv.zip', 'r')
zipfile3.extract("articles.csv")
zipfile3.close()
    
zipfile4 = ZipFile(path_dir + '/sample_submission.csv.zip', 'r')
zipfile4.extract("sample_submission.csv")
zipfile4.close()



Downloading customers.csv.zip to /home/jovyan/examples-1/h-and-m-fash-rec-kaggle-competition/data


100%|██████████| 97.9M/97.9M [00:01<00:00, 80.3MB/s]



Downloading transactions_train.csv.zip to /home/jovyan/examples-1/h-and-m-fash-rec-kaggle-competition/data


100%|██████████| 584M/584M [00:08<00:00, 69.6MB/s] 



Downloading articles.csv.zip to /home/jovyan/examples-1/h-and-m-fash-rec-kaggle-competition/data


100%|██████████| 4.26M/4.26M [00:00<00:00, 240MB/s]







Downloading sample_submission.csv.zip to /home/jovyan/examples-1/h-and-m-fash-rec-kaggle-competition/data


100%|██████████| 50.3M/50.3M [00:00<00:00, 81.6MB/s]



/home/jovyan/examples-1/h-and-m-fash-rec-kaggle-competition/data/customers.csv.zip


## Imports

In [None]:
import numpy as np
import pandas as pd
import implicit 
import scipy.sparse as sparse

## Load Data

In [None]:
path = "data/"
train_data_filepath = path + "transactions_train.csv"
article_metadata_filepath = path + "articles.csv"
customer_metadata_filepath = path + "customers.csv"
test_data_filepath = path + "sample_submission.csv"

In [None]:
train_data = pd.read_csv(train_data_filepath)
test_data = pd.read_csv(test_data_filepath)
customer_data = pd.read_csv(customer_metadata_filepath)
article_data = pd.read_csv(article_metadata_filepath)

## Exploring the dataset

In [None]:
train_data.head()

In [None]:
train_data.info()

In [None]:
train_data.describe()

In [None]:
test_data.tail()

In [None]:
customer_data.tail()

In [None]:
article_data.tail()

In [None]:
# We will be dropping t_dat, sales_channel and price as this won't be part of the recommendation system we will be building 
train_data.drop(['t_dat','sales_channel_id','price'], axis= 1, inplace = True)

In [None]:
train_data.head()

## Preprocess Data

In [None]:
# create a new purchase count column that would gives us count of every article bought by the customers
X = train_data.groupby(['customer_id', 'article_id'])['article_id'].count().reset_index(name = "purchase_count") 

# Getting unique number of customers and articles using the customer and article metadata data files
unique_customers = customer_data['customer_id'].unique()
unique_articles = article_data['article_id'].unique()

# length of the customers and articles
n_customers = len(unique_customers)
n_articles = len(unique_articles)

# Create a mapping for customer_id to convert it from an object column to an int column for the sparse matrix creation
customer_id_dict = {unique_customers[i]:i  for i in range(len(unique_customers))}
reverse_customer_id_dict = {i:unique_customers[i] for i in range(len(unique_customers))} 
numeric_cus_id = []
for i in range(len(X['customer_id'])):
    numeric_cus_id.append(customer_id_dict.get(X['customer_id'][i]))
X['customer_id'] = numeric_cus_id

# Create a mapping for article_id so that the sparse matrix creation doesn't get large enough due to long int values of article_ids
article_id_dict = {unique_articles[i]:i  for i in range(len(unique_articles))}
reverse_article_id_dict = {i:unique_articles[i] for i in range(len(unique_articles))}
numeric_art_id = []
for i in range(len(X['article_id'])):
    numeric_art_id.append(article_id_dict.get(X['article_id'][i]))
X['article_id'] = numeric_art_id

In [None]:
X.head()

## Sparse Matrix Creation

In [None]:
# Constructing sparse matrices for alternating least squares algorithm    
sparse_user_item_coo = sparse.coo_matrix((X.purchase_count, (X.customer_id, X.article_id)), shape = (n_customers, n_articles))
sparse_user_item_csr = sparse.csr_matrix((X['purchase_count'], (X['customer_id'], X['article_id'])), shape = (n_customers, n_articles))

In [None]:
sparse_user_item_csr

## Model Training

In [None]:
# parameters for the model
als_params = dict(
    factors = 200,         # number of latent factors - try between 50 to 1000
    regularization = 0.01, # regularization factor - try between 0.001 to 0.2
    iterations = 5,        # iterations            - try between 2 to 100
)

# initialize a model
model = implicit.als.AlternatingLeastSquares(**als_params)

# train the model on a sparse matrix of user/item/confidence weights    
model.fit(sparse_user_item_csr)

In [None]:
model

## Predictions

In [None]:
predictions=[]
count = 0
for cust_id in test_data.customer_id:
    cust_id = customer_id_dict.get(cust_id)
#     if(cust_id!=None):    
    recommendations = model.recommend(cust_id, sparse_user_item_csr[cust_id],10)
    result=[]
    for i in range(len(recommendations[0])):
        val = reverse_article_id_dict.get(recommendations[0][i])
        result.append(val)  
    predictions.append(result)


In [None]:
test_data['prediction'] = predictions
test_data

### Final Submission

In [None]:
test_data.to_csv('data/submission.csv', index=False)