## Data Loading

This notebook loads the cleaned H&M interaction datasets from a Kaggle
attached dataset. Files are accessed via Kaggle's `/kaggle/input`
directory, which is read-only and automatically mounted.


In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict

pd.set_option("display.max_columns", None)

In [2]:
import os

os.listdir("/kaggle/input/hm-cleaned")

['data_export', 'data']

In [3]:
os.listdir("/kaggle/input/hm-cleaned/data")

['valid.csv', 'train.csv']

In [4]:
os.listdir("/kaggle/input/hm-cleaned/data_export")

['valid.csv', 'train.csv']

In [5]:
import pandas as pd

train = pd.read_csv(
    "/kaggle/input/hm-cleaned/data/train.csv",
    dtype={
        "customer_id": "string",
        "article_id": "int32"
    }
)

valid = pd.read_csv(
    "/kaggle/input/hm-cleaned/data/valid.csv",
    dtype={
        "customer_id": "string",
        "article_id": "int32"
    }
)

In [6]:
print("Train shape:", train.shape)
print("Valid shape:", valid.shape)

train.head()

Train shape: (24627945, 7)
Valid shape: (6131468, 7)


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id,year,month
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2,2018,9
1,2018-09-20,a8d14751a68b4cab69fed60b169c03c5d62f1c8b73fb1c...,572124001,0.040661,2,2018,9
2,2018-09-20,a8d14751a68b4cab69fed60b169c03c5d62f1c8b73fb1c...,615959001,0.016932,2,2018,9
3,2018-09-20,a8d14751a68b4cab69fed60b169c03c5d62f1c8b73fb1c...,477507002,0.013542,2,2018,9
4,2018-09-20,a8d14751a68b4cab69fed60b169c03c5d62f1c8b73fb1c...,635673002,0.015237,2,2018,9


## Exploratory Analysis

In [7]:
import pandas as pd
import numpy as np
from collections import defaultdict


In [8]:
print("Train interactions:", len(train))
print("Validation interactions:", len(valid))

print("Unique users (train):", train["customer_id"].nunique())
print("Unique items (train):", train["article_id"].nunique())

Train interactions: 24627945
Validation interactions: 6131468
Unique users (train): 880152
Unique items (train): 81656


In [9]:
# Exploratory Insight
top_items = (
    train["article_id"]
    .value_counts()
    .head(10)
)

top_items


article_id
706016001    40678
706016002    30173
372860001    24457
610776002    22530
464297007    20190
399223001    20081
562245001    19476
759871002    18838
156231001    18819
562245046    18167
Name: count, dtype: int64

In [10]:
# Build Popularity Recommender
TOP_K = 12

popular_items = (
    train["article_id"]
    .value_counts()
    .head(TOP_K)
    .index
    .tolist()
)

popular_items

[706016001,
 706016002,
 372860001,
 610776002,
 464297007,
 399223001,
 562245001,
 759871002,
 156231001,
 562245046,
 351484002,
 706016003]

In [11]:
# Prepare User â†’ Ground Truth Mapping
valid_user_items = (
    valid
    .groupby("customer_id")["article_id"]
    .apply(set)
    .to_dict()
)

In [12]:
# Recall@K Metric
def recall_at_k(recommended_items, ground_truth_items, k):
    if len(ground_truth_items) == 0:
        return 0.0
    hits = len(set(recommended_items[:k]) & ground_truth_items)
    return hits / len(ground_truth_items)

In [13]:
recalls = []

for user, true_items in valid_user_items.items():
    if len(true_items) == 0:
        continue
    rec = recall_at_k(popular_items, true_items, TOP_K)
    recalls.append(rec)

mean_recall = np.mean(recalls)
mean_recall


np.float64(0.008661540850496847)

In [14]:
import numpy as np

np.save("popular_items.npy", popular_items)

## Baseline Results

We evaluated a popularity-based recommender using Recall@12.

This model serves as a non-personalized baseline and provides a lower
bound for more advanced models such as collaborative filtering or
neural recommenders.

Despite its simplicity, popularity-based methods are strong baselines
in fashion recommendation systems.
