# RecSys Challenge - EDA and Data Cleaning

*Alvin Karanja*

In this notebook, I will perform an exploratory data analysis (EDA) and any relevant data cleaning on the provided dataset. The goal is to understand the data, identify any issues, and prepare it for further analysis and modeling.

To reproduce the results in this notebook, please ensure the relevant assignment data `yoochoose-buys.dat` and `yoochoose-clicks.dat` are located within a folder named `data` in the same directory as this notebook.

## 1. Preprocessing the Data

The raw data is saved as a generic `.dat` file which can be confusing to work with. We can observe that the data is actually comma delimited, so we can read the data as a CSV file in pandas, additionally we specify column names in accordance with the documentation, setting columns to appropriate data types.

Note that in parsing the time column, we keep the data timezone aware by 'casting' the time to UTC. This is important for any time series analysis we may perform later.

The results are saved as new `.csv` files, which will be used for further analysis.

In [None]:
import os
import pandas as pd

# Destination file names
destination_buys = 'data/yoochoose-buys.csv'
src_buys = 'data/yoochoose-buys.dat'

destination_clicks = 'data/yoochoose-clicks.csv'
src_clicks = 'data/yoochoose-clicks.dat'

# Check if data already exists
if not os.path.exists(destination_buys) or not os.path.exists(destination_clicks):

    buys_data_raw = pd.read_csv(src_buys)
    clicks_data_raw = pd.read_csv(src_clicks)

    # Assigning column names
    buys_data_raw.columns = ['session_id', 'time', 'item_id', 'price', 'quantity']
    clicks_data_raw.columns = ['session_id', 'time', 'item_id', 'category']

    # Setting appropriate data types
    buys_data_raw = buys_data_raw.astype({
        'session_id': 'str',
        'time': 'datetime64[ns, UTC]',
        'item_id': 'str',
        'price': 'float64',
        'quantity': 'int64'
    })

    clicks_data_raw = clicks_data_raw.astype({
        'session_id': 'str',
        'time': 'datetime64[ns, UTC]',
        'item_id': 'str',
        'category': 'str'
    })

    # Saving the processed data to CSV files
    buys_data_raw.to_csv('data/yoochoose-buys.csv', index=False)
    clicks_data_raw.to_csv('data/yoochoose-clicks.csv', index=False)

    # Clear the raw data variables 
    # to free up memory
    del buys_data_raw
    del clicks_data_raw

Next we proceed to check the data for missing values.

In [6]:
buys_df = pd.read_csv(
    destination_buys,
    dtype={
        'session_id': 'str',
        'item_id': 'str',
        'price': 'float64',
        'quantity': 'int64'
    },
    parse_dates=['time'],
)
print(buys_df.shape)
buys_df.isnull().sum()

(1150752, 5)


session_id    0
time          0
item_id       0
price         0
quantity      0
dtype: int64

In [7]:
clicks_df = pd.read_csv(
    destination_clicks,
    dtype={
        'session_id': 'str',
        'item_id': 'str',
        'category': 'str'
    },
    parse_dates=['time'],
)
print(clicks_df.shape)
clicks_df.isnull().sum()

(33003943, 4)


session_id    0
time          0
item_id       0
category      0
dtype: int64

We observe that there are no missing values in the dataset and all columns are populated as expected. 

Next we inspect the date ranges for the data to understand the time period covered.

In [5]:
# Buys data date range
buys_date_range = buys_df['time'].min(), buys_df['time'].max()
clicks_date_range = clicks_df['time'].min(), clicks_df['time'].max()

print(f"Clicks data date range: {clicks_date_range}")
print(f"Buys data date range: {buys_date_range}")

Clicks data date range: ('2014-04-01 03:00:00.124000+00:00', '2014-09-30 02:59:59.430000+00:00')
Buys data date range: ('2014-04-01 03:05:31.743000+00:00', '2014-09-30 02:35:12.859000+00:00')


We note that the timelines for the data are consistent across the two datasets, covering the period from 2014-04-01 to 2014-09-30.

## 2. Data Cleaning

The data appears to be clean, with no missing values or obvious outliers. This allows us to focus on the implementation of the prediction algorithm without having to worry about data quality issues.

### 2.1 Duplicate Data Observation

In the data, we observe that there are some instances where the user appears to purchase the same item multiple times within a very short time frame (less than 1 minute). This may be due to several reasons such as:

1. The user may be purchasing the same item multiple times within a single checkout session.
2. System level logging artifacts where the same item is logged multiple times due to system retries or errors.

![Duplicate Entries](assets/duplicate_data.png)

Research into the dataset suggested that the first case is more likely, and the purpose of the dataset is to focus on producing a suitable prediction algorithm as opposed to tasks such as data cleaning or anomaly detection. To that end, we will not remove these duplicate entries from the dataset, as they may be relevant for the prediction algorithm.

We refer to section 2 from the publication of the RecSys Challenge 2015 for this decision (Ben-Shimon et al., 2015), as there is no specific mention of any data cleaning preprocess necessary for the dataset, and it appears to be implied that the dataset is ready for use in prediction tasks.

## *References*

- Ben-Shimon, D., Tsikinovsky, A., Friedmann, M., Shapira, B., Lior Rokach and Hoerle, J. (2015). RecSys Challenge 2015 and the YOOCHOOSE Dataset. doi:https://doi.org/10.1145/2792838.2798723.