# Exploratory Data Analysis

This notebook shows the results of analysis of the dataset for the challenge [H&M Personalized Fashion Recommendations
](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/overview) which consist in providing product recommendations based on previous purchases for H&M.


The first thing we need to do is load the datasets, printing some summary statistics, and some samples to understand a little bit of the data.


In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import min, max
from pyspark.sql.functions import unix_timestamp, lit

sc = SparkSession.builder.appName("h&m_recommendations").config("spark.sql.files.maxPartitionBytes", 5000000).getOrCreate()

# Load transactions
transactions = sc.read.option("header",True) \
              .csv("./inputs/transactions_train.csv")
min_date, max_date = transactions.select(min("t_dat"), max("t_dat")).first()
print('----------TRANSACTIONS------------')
transactions.printSchema()
print(f'Number of rows: {transactions.count()}')
print(f'Min Date: {min_date}')
print(f'Max Date: {max_date}')
print(f'Number of unique customers: {transactions.select("customer_id").distinct().count()}')
print(f'Number of unique articles: {transactions.select("article_id").distinct().count()}')
transactions.show(5)


# Load articles
print('\n-------------ARTICLES-------------')
articles = sc.read.option("header",True) \
              .csv("./inputs/articles.csv")
articles.printSchema()
print(f'Number of rows: {articles.count()}')
articles.show(5)

# Load customers
print('\n-------------CUSTOMERS-------------')
customers = sc.read.option("header",True) \
              .csv("./inputs/customers.csv")
customers.printSchema()
print(f'Number of rows: {customers.count()}')
customers.show(5)


----------TRANSACTIONS------------
root
 |-- t_dat: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- article_id: string (nullable = true)
 |-- price: string (nullable = true)
 |-- sales_channel_id: string (nullable = true)

Number of rows: 31788324
Min Date: 2018-09-20
Max Date: 2020-09-22
Number of unique customers: 1362281
Number of unique articles: 104547
+----------+--------------------+----------+--------------------+----------------+
|     t_dat|         customer_id|article_id|               price|sales_channel_id|
+----------+--------------------+----------+--------------------+----------------+
|2018-09-20|000058a12d5b43e67...|0663713001|0.050830508474576264|               2|
|2018-09-20|000058a12d5b43e67...|0541518023| 0.03049152542372881|               2|
|2018-09-20|00007d2de826758b6...|0505221004| 0.01523728813559322|               2|
|2018-09-20|00007d2de826758b6...|0685687003|0.016932203389830508|               2|
|2018-09-20|00007d2de826758b6...|0

We want to know what are the most popular articles and customers in the dataset.

In [4]:
customer_count = transactions.groupBy(['customer_id']).count().orderBy('count', ascending=False)
customer_count.show()

article_count = transactions.groupBy(['article_id']).count().orderBy('count', ascending=False)
article_count.show()

+--------------------+-----+
|         customer_id|count|
+--------------------+-----+
|be1981ab818cf4ef6...| 1895|
|b4db5e5259234574e...| 1441|
|49beaacac0c7801c2...| 1364|
|a65f77281a528bf5c...| 1361|
|cd04ec2726dd58a8c...| 1237|
|55d15396193dfd458...| 1208|
|c140410d72a41ee5e...| 1170|
|8df45859ccd71ef1e...| 1169|
|03d0011487606c37c...| 1157|
|6cc121e5cc202d2bf...| 1143|
|e34f8aa5e7c8c2585...| 1117|
|3493c55a7fe252c84...| 1115|
|0bf4c6fd4e9d33f9b...| 1099|
|e6498c7514c61d3c2...| 1068|
|d80ed4ababfa96812...| 1066|
|1320d4b3dd6481cde...| 1059|
|a76cf5ea515d09f22...| 1038|
|e238725cbff3774b7...| 1022|
|689f4eda82fdf3d9b...| 1022|
|e97c3a6c680cd3569...| 1009|
+--------------------+-----+
only showing top 20 rows

+----------+-----+
|article_id|count|
+----------+-----+
|0706016001|50287|
|0706016002|35043|
|0372860001|31718|
|0610776002|30199|
|0759871002|26329|
|0464297007|25025|
|0372860002|24458|
|0610776001|22451|
|0399223001|22236|
|0706016003|21241|
|0720125001|21063|
|0156231001|