### Data Description

**articles.csv** - detailed metadata for each article_id available for purchase
<br>
**customers.csv** - metadata for each customer_id in dataset
<br>
**sample_submission.csv** - a sample submission file in the correct format
<br>
**transactions_train.csv** - the training data, consisting of the purchases each customer for each date, as well as additional information. Duplicate rows correspond to multiple purchases of the same item. Your task is to predict the article_ids each customer will purchase during the 7-day period immediately after the training data period.

[Data Link to Kaggle](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/data)

### Brainstorming

**How to recommend an article to a customer A?**

*Recommendation based on siimilar customer taste*
- If A bought 5 articles in the past, and 100 other customers bought the same articles, we can cluster these customers together. 
- take a popularity list of articles bought by these other 100 customers but not bought by A in a descending order
- recommend the most popular articles amaong the 100 other customers to A

*Recommendation based on customer taste, unrelated to other customers*
- if a customer bought 5 items in the past, 
- could we use extensive customer data to create customer personas where each persona represents a style?
- use text prediction LSTM?


## Import Data

In [1]:
# linear algebra
import numpy as np
import pandas as pd

import tensorflow.keras as keras
from tensorflow.keras import layers

# ML Framework
from keras.models import Sequential
from keras.layers import Dense, Activation

# Visualization
import matplotlib.pyplot as plt

Using TensorFlow backend.


In [47]:
# importing training and test data
trans_df = pd.read_csv("../hm_data/transactions_train.csv")

In [48]:
article_df = pd.read_csv("../hm_data/articles.csv")

In [49]:
cust_df = pd.read_csv("../hm_data/customers.csv")

## Data Preprocessing



There is too much data for initial modelling.

**Pseudo code:**
1. Get the popularity list (pop_list) for articles: decided by the no of times an article is bought
    - count for the unique article_id in the trans_df in descending order, so first article is most popular
2. Merge the popularity list into the article_df. the article_df will have an additional column named popularity
3. Sort the article_df based on popularity
4. The most popular articles are espected to be common regular use articles such as socks, underwear etc. We want to throw away these data points. Look at article_df with human eye and distinguish where the common articles end



<br>
3. Take the 10000 most popular articles (top). top 10k of the pop_list
4. throw away data not related to top 10000 popular articles from both trans_df and article_df
5. 


We take the data for 


**Extra:**
- should I convert the candidate_id to something more manageable?
- 

In [139]:
# pop_list = popularity list for articles
# counts the unique values in article_id and gives a series in descending order
pop_list = trans_df["article_id"].value_counts()
# series to df
pop_list = pop_list.to_frame()
pop_list = pop_list.rename(columns={"article_id":"popularity"})
pop_list = pop_list.reset_index()
pop_list = pop_list.rename(columns={"index": "article_id"})
pop_list

Unnamed: 0,article_id,popularity
0,706016001,50287
1,706016002,35043
2,372860001,31718
3,610776002,30199
4,759871002,26329
...,...,...
104542,520736002,1
104543,619777003,1
104544,586904003,1
104545,512385003,1


In [142]:
# Merge pop_list to article_df. Below code throws away rows in article_df that do not have a common key in pop_list
article_df = pd.merge(article_df, pop_list, on="article_id")
# Now we have an article_df with a popularity count for each article

In [146]:
# sorting article_df in descending order based on popularity column
article_df = article_df.sort_values(by=["popularity"], ascending= False)

In [156]:
article_df

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc,popularity
53832,706016001,706016,Jade HW Skinny Denim TRS,272,Trousers,Garment Lower body,1010016,Solid,9,Black,...,D,Divided,2,Divided,53,Divided Collection,1009,Trousers,High-waisted jeans in washed superstretch deni...,50287
53833,706016002,706016,Jade HW Skinny Denim TRS,272,Trousers,Garment Lower body,1010016,Solid,71,Light Blue,...,D,Divided,2,Divided,53,Divided Collection,1009,Trousers,High-waisted jeans in washed superstretch deni...,35043
1711,372860001,372860,7p Basic Shaftless,302,Socks,Socks & Tights,1010016,Solid,9,Black,...,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Fine-knit trainer socks in a soft cotton blend.,31718
24808,610776002,610776,Tilly (1),255,T-shirt,Garment Upper body,1010016,Solid,9,Black,...,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,T-shirt in lightweight jersey with a rounded h...,30199
70124,759871002,759871,Tilda tank,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,D,Divided,2,Divided,80,Divided Complements Other,1002,Jersey Basic,"Cropped, fitted top in cotton jersey with narr...",26329
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8205,521266001,521266,Patsy PU Glove,71,Gloves,Accessories,1010016,Solid,9,Black,...,C,Ladies Accessories,1,Ladieswear,65,Womens Big accessories,1019,Accessories,"Gloves in soft, supple imitation leather. Lined.",1
36666,651538001,651538,Sigge leather helmet bag,66,Bag,Accessories,1010016,Solid,17,Yellowish Brown,...,F,Menswear,3,Menswear,25,Men Accessories,1019,Accessories,Leather bag with two handles and a zip at the ...,1
8211,521302001,521302,Make the boys wink,252,Sweater,Garment Upper body,1010010,Melange,23,Dark Yellow,...,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1003,Knitwear,Wide jumper in a soft rib knit containing some...,1
36710,651645003,651645,HARIBO TEE,255,T-shirt,Garment Upper body,1010005,Colour blocking,91,Light Green,...,F,Menswear,3,Menswear,20,Contemporary Smart,1005,Jersey Fancy,Block-print T-shirt in cotton jersey.,1


In [155]:
article_df[article_df["product_group_name"] == "Socks & Tights"]

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc,popularity
1711,372860001,372860,7p Basic Shaftless,302,Socks,Socks & Tights,1010016,Solid,9,Black,...,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Fine-knit trainer socks in a soft cotton blend.,31718
1712,372860002,372860,7p Basic Shaftless,302,Socks,Socks & Tights,1010016,Solid,10,White,...,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Fine-knit trainer socks in a soft cotton blend.,24458
67,156231001,156231,Box 4p Tights,304,Underwear Tights,Socks & Tights,1010016,Solid,9,Black,...,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Matt tights with an elasticated waist. 20 denier.,21013
24237,608776002,608776,Scallop 5p Socks,302,Socks,Socks & Tights,1010016,Solid,9,Black,...,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Fine-knit socks with a scalloped edge.,17886
74,160442007,160442,3p Sneaker Socks,302,Socks,Socks & Tights,1010016,Solid,9,Black,...,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,"Short, fine-knit socks designed to be hidden b...",17866
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14448,562990013,562990,Billie 10-p sock BG,302,Socks,Socks & Tights,1010016,Solid,7,Grey,...,I,Children Sizes 134-170,4,Baby/Children,79,Girls Underwear & Basics,1021,Socks and Tights,Fine-knit socks in a cotton blend with elastic...,1
14402,562734022,562734,Liseberg 5-p sock BG,302,Socks,Socks & Tights,1010013,Other pattern,7,Grey,...,I,Children Sizes 134-170,4,Baby/Children,79,Girls Underwear & Basics,1021,Socks and Tights,Jacquard-knit socks in a soft cotton blend in ...,1
14401,562734021,562734,Liseberg 5-p sock BG,302,Socks,Socks & Tights,1010013,Other pattern,8,Dark Grey,...,I,Children Sizes 134-170,4,Baby/Children,79,Girls Underwear & Basics,1021,Socks and Tights,Jacquard-knit socks in a soft cotton blend in ...,1
781,293433012,293433,2-p basic cotton tights SG,304,Underwear Tights,Socks & Tights,1010016,Solid,43,Dark Red,...,H,Children Sizes 92-140,4,Baby/Children,79,Girls Underwear & Basics,1021,Socks and Tights,"Tights in a soft, fine knit with an elasticate...",1


In [153]:
article_df["product_group_name"].value_counts().to_frame()

Unnamed: 0,product_group_name
Garment Upper body,42313
Garment Lower body,19661
Garment Full body,13160
Accessories,11023
Underwear,5447
Shoes,5228
Swimwear,3127
Socks & Tights,2420
Nightwear,1882
Unknown,113


In [143]:
len(article_df)

104547

pop_list = list(pop_list.index.values)

top = top[:10000]

In [38]:
trans_df[trans_df['article_id'].isin(top)]

In [77]:
len(article_df)

105542

In [82]:
article_df[article_df['article_id'].isin(top)]

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
58,153115020,153115,OP Strapless^,306,Bra,Underwear,1010016,Solid,12,Light Beige,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Strapless bra in microfibre with underwired, p..."
186,189626001,189626,Jodi skirt,275,Skirt,Garment Lower body,1010016,Solid,9,Black,...,Basic 1,D,Divided,2,Divided,51,Divided Basics,1002,Jersey Basic,"Short, bell-shaped skirt in stretch jersey wit..."
230,201219001,201219,Heavy plain 2 p tights,304,Underwear Tights,Socks & Tights,1010016,Solid,9,Black,...,Tights basic,B,Lingeries/Tights,1,Ladieswear,62,"Womens Nightwear, Socks & Tigh",1021,Socks and Tights,Fine-knit tights with an elasticated waist.
261,212629004,212629,Alcazar strap dress,265,Dress,Garment Full body,1010016,Solid,9,Black,...,Basic 1,D,Divided,2,Divided,51,Divided Basics,1002,Jersey Basic,"Long, sleeveless dress in jersey with narrow s..."
505,253448002,253448,OP Push Melbourne^,306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Push-up bra in microfibre with underwired, mou..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97827,876053003,876053,Ruby,258,Blouse,Garment Upper body,1010001,All over pattern,92,Green,...,Blouse,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1010,Blouses,Blouse in a crêpe weave with a sweetheart neck...
98422,879248001,879248,Glamping,274,Shorts,Garment Lower body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Short shorts in lightweight sweatshirt fabric ...
98796,880839001,880839,Eleonor button dress,265,Dress,Garment Full body,1010016,Solid,9,Black,...,Basic 1,D,Divided,2,Divided,51,Divided Basics,1002,Jersey Basic,Calf-length dress in ribbed jersey with a V-ne...
99169,882888002,882888,Agneta jumpsuit,272,Trousers,Garment Lower body,1010017,Stripe,12,Light Beige,...,Jersey fancy,A,Ladieswear,1,Ladieswear,15,Womens Everyday Collection,1005,Jersey Fancy,Ankle-length jumpsuit in airy jersey crêpe wit...


In [27]:
len(cust_df)

1371980