# Exploratory Data Analysis

This notebook shows the results of analysis of the dataset for the challenge [H&M Personalized Fashion Recommendations
](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/overview) which consist in providing product recommendations based on previous purchases for H&M.


The first thing we need to do is load the datasets, printing some summary statistics, and some samples to understand a little bit of the data.


In [10]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import min, max
from pyspark.sql.functions import unix_timestamp, lit

sc = SparkSession.builder.appName("h&m_recommendations").config("spark.sql.files.maxPartitionBytes", 5000000).getOrCreate()
spark = SparkSession(sc)

# Load transactions
transactions = spark.read.option("header",True) \
              .csv("./inputs/transactions_train.csv")
min_date, max_date = transactions.select(min("t_dat"), max("t_dat")).first()
print('----------TRANSACTIONS------------')
transactions.printSchema()
print(f'Number of rows: {transactions.count()}')
print(f'Min Date: {min_date}')
print(f'Max Date: {max_date}')
print(f'Number of unique customers: {transactions.select("customer_id").distinct().count()}')
print(f'Number of unique articles: {transactions.select("article_id").distinct().count()}')
transactions.show(5)


# Load articles
print('\n-------------ARTICLES-------------')
articles = spark.read.option("header",True) \
              .csv("./inputs/articles.csv")
articles.printSchema()
print(f'Number of rows: {articles.count()}')
articles.show(5)

# Load customers
print('\n-------------CUSTOMERS-------------')
customers = spark.read.option("header",True) \
              .csv("./inputs/customers.csv")
customers.printSchema()
print(f'Number of rows: {customers.count()}')
customers.show(5)


----------TRANSACTIONS------------
root
 |-- t_dat: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- article_id: string (nullable = true)
 |-- price: string (nullable = true)
 |-- sales_channel_id: string (nullable = true)

Number of rows: 31788324
Min Date: 2018-09-20
Max Date: 2020-09-22
Number of unique customers: 1362281
Number of unique articles: 104547
+----------+--------------------+----------+--------------------+----------------+
|     t_dat|         customer_id|article_id|               price|sales_channel_id|
+----------+--------------------+----------+--------------------+----------------+
|2018-09-20|000058a12d5b43e67...|0663713001|0.050830508474576264|               2|
|2018-09-20|000058a12d5b43e67...|0541518023| 0.03049152542372881|               2|
|2018-09-20|00007d2de826758b6...|0505221004| 0.01523728813559322|               2|
|2018-09-20|00007d2de826758b6...|0685687003|0.016932203389830508|               2|
|2018-09-20|00007d2de826758b6...|0