<a href="https://www.kaggle.com/code/niramay/h-m-recommendations?scriptVersionId=111361451" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Problem Statement
To develop product recommendations based on data from previous transactions, as well as from customer and product meta data. The available meta data spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.

Although this problem can employ NLP or image processing to improve recommendations, in this notebook, I'm going to take a simpler approach. 

# Approach 
For recommending products to a user, the approach employed here is: 
* Recommending products based on previously purchased items 
* Recommending products that are usually bought together with the previously bought products 
* Recommending popular products

# Importing Necessary Libraries

In [1]:
import numpy as np 
import pandas as pd 
import cudf
import os

In [2]:
train = cudf.read_csv('../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv')
train.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2


In [3]:
train['customer_id'][0]

'000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318'

Processing the Customer ID column since the entire ID is not required for preserving the uniqueness of a customer. 

In [4]:
train['customer_id'] = train['customer_id'].str[-16:].str.hex_to_int().astype('int64')
train['article_id'] = train.article_id.astype('int32')
train.t_dat = cudf.to_datetime(train.t_dat)
train = train[['t_dat','customer_id','article_id']]
train.to_parquet('train.pqt',index=False)
print( train.shape )

(31788324, 3)


Lets find out the previous week's purchases for each customer. 

In [5]:
tmp = train.groupby('customer_id').t_dat.max().reset_index()
tmp.head()

Unnamed: 0,customer_id,t_dat
0,-5930446966655949845,2020-03-25
1,6138898004712415003,2020-07-28
2,-3758009466528006904,2019-04-15
3,-4320000672183660287,2019-12-03
4,-4194961289286638255,2020-04-20


In [6]:
tmp.columns = ['customer_id','max_dat']
train = train.merge(tmp,on=['customer_id'],how='left')
train.head()

Unnamed: 0,t_dat,customer_id,article_id,max_dat
0,2018-09-21,-4745167340148134637,637194001,2020-07-11
1,2018-09-21,-4745167340148134637,627147001,2020-07-11
2,2018-09-21,-4745167340148134637,627147002,2020-07-11
3,2018-09-21,-4745167340148134637,464454011,2020-07-11
4,2018-09-21,-4745167340148134637,637194001,2020-07-11


In [7]:
train['diff_dat'] = (train.max_dat - train.t_dat).dt.days
train = train.loc[train['diff_dat']<=6]
train.head()

Unnamed: 0,t_dat,customer_id,article_id,max_dat,diff_dat
64,2018-09-20,1724137143448920012,653337002,2018-09-20,0
65,2018-09-20,1724137143448920012,650799002,2018-09-20,0
66,2018-09-20,1724137143448920012,575074001,2018-09-20,0
67,2018-09-20,1724137143448920012,553488008,2018-09-20,0
68,2018-09-20,1724137143448920012,673806003,2018-09-20,0


# PART 1: Items previously purchased(most often)

In [8]:
tmp = train.groupby(['customer_id','article_id'])['t_dat'].agg('count').reset_index()
tmp.columns = ['customer_id','article_id','ct']
tmp.head()


Unnamed: 0,customer_id,article_id,ct
0,7081293666638850256,843873007,1
1,78461283201566517,860815001,1
2,-5552346397672546557,886229001,1
3,8010096731079176441,834924005,1
4,195838681510355618,799421001,1


In [9]:
train = train.merge(tmp,on=['customer_id','article_id'],how='left')
train = train.sort_values(['ct','t_dat'],ascending=False)
train.head()

Unnamed: 0,t_dat,customer_id,article_id,max_dat,diff_dat,ct
1132001,2019-07-16,2729025827381139556,719348003,2019-07-16,0,100
1132003,2019-07-16,2729025827381139556,719348003,2019-07-16,0,100
1132005,2019-07-16,2729025827381139556,719348003,2019-07-16,0,100
1132007,2019-07-16,2729025827381139556,719348003,2019-07-16,0,100
1132009,2019-07-16,2729025827381139556,719348003,2019-07-16,0,100


In [10]:
train = train.drop_duplicates(['customer_id','article_id'])
train = train.sort_values(['ct','t_dat'],ascending=False)
train.head()

Unnamed: 0,t_dat,customer_id,article_id,max_dat,diff_dat,ct
1132001,2019-07-16,2729025827381139556,719348003,2019-07-16,0,100
69344,2018-10-04,4485518665254175540,557247001,2018-10-04,0,86
2142658,2020-03-06,-906958334866810496,852521001,2020-03-06,0,81
3390624,2020-07-06,3601599666106972342,685813001,2020-07-06,0,80
853057,2019-05-14,-4601407992705575197,695545001,2019-05-14,0,80
