# EDA & Data Cleaning — Seoul Bike Sharing Demand

The purpose of this exploratory data analysis (EDA) is fourfold. First, the raw Seoul Bike Sharing Demand dataset is examined to provide an overview of the data structure, the distribution of the target variable, and the presence of missing values or outliers. Second, the dataset is preprocessed and cleaned using the `clean_seoul_bike_data` pipeline to ensure consistency and reproducibility. Third, key relationships between explanatory variables and bike rental demand are explored to inform feature selection and modelling choices. Finally, the cleaned dataset is saved in a reproducible parquet format under `data/processed/` for subsequent modelling and evaluation.

In [None]:
import sys
from pathlib import Path

import matplotlib.pyplot as plt
import seaborn as sns

To ensure reproducibility and modularity, the project root directory is programmatically identified and the `src` directory is added to the Python path. This allows project-specific data loading, cleaning, and plotting functions to be imported directly within the notebook.

In [None]:
repo_root = Path.cwd().resolve()
if repo_root.name == "notebooks":
    repo_root = repo_root.parent
sys.path.insert(0, str(repo_root / "src"))

repo_root

## Data overview
The exploratory analysis begins by loading the raw dataset using a dedicated data-loading function. Encapsulating data access within a reusable function ensures that the same raw input is consistently used across notebooks and modelling scripts. An initial inspection of the first few rows is performed to verify variable names, data types, and overall structure.

In [None]:
from bike_demand.data.load_data import load_data

df_raw = load_data()
df_raw.head()

Then, this analysis inspect dataset size, column names, dtypes, and summary statistics.

In [None]:
# shape + columns
df_raw.shape, df_raw.columns.tolist()[:10]

In [None]:
# info / dtypes
df_raw.info()

In [None]:
# numeric describe
df_raw.describe().T

In [None]:
# categorical describe
df_raw.select_dtypes(include=["object", "category"]).describe().T

## Data cleaning plan 

Based on the inspection of the raw dataset, the following data cleaning and preprocessing steps are implemented to ensure consistency, reproducibility, and suitability for subsequent modelling:

- The dataset consists of 8,760 hourly observations (365 × 24) with no missing values, indicating no need for addressing the missing values 
- `Date` variable is stored as an object and is therefore parsed into a proper datetime format to enable reliable temporal feature extraction and indexing.
- Column names contain spaces and measurement units; these are standardised to improve code robustness and facilitate downstream processing within the modelling pipeline.
- Categorical variables, including (`Seasons`, `Holiday`, `Functioning Day`), are initially stored as strings and are converted to categorical types, ensuring appropriate handling during feature engineering and model estimation..
- We firstly retain extreme weather values as they are plausible and informative for demand prediction.

In [None]:
from bike_demand.preprocessing import clean_seoul_bike_data

df_clean = clean_seoul_bike_data(df_raw)

print(df_clean.shape)
df_clean.info()
df_clean.head()

## Exploratory analysis on cleaned data

After standardising the dataset schema, we conduct exploratory analysis on the cleaned data to:
- understand the target distribution,
- identify the distribution of explantory variables and analyse potential outliers,
- examine relationships between key features and rental demand,
- uncover temporal and seasonal patterns that may inform feature engineering and model choice.
- check the correlation between explantory variables and target variables 
(in addition to this, also use heatamap to check the correlation between each explantory variables )

### Target distribution

The distribution of the target variable, hourly bike rental demand, is examined using a histogram with a kernel density estimate and a complementary boxplot. This visualisation provides an overview of the central tendency, dispersion, skewness, and the presence of extreme values in the cleaned dataset.

In [None]:
# Target distribution
from bike_demand.plotting import plot_target_distribution

plot_target_distribution(df_clean, target_col="rented_bike_count")
plt.show()

The distribution of hourly bike rental demand is strongly right-skewed, with a large concentration of observations at relatively low demand levels and a long right tail corresponding to periods of exceptionally high usage. The boxplot further highlights substantial dispersion and the presence of extreme values, particularly during peak demand periods. 

Moreover, the pronounced skewness and heavy-tailed nature of the target distribution indicate that Gaussian error assumptions are unlikely to be appropriate. Instead, the distributional characteristics are consistent with a non-negative, right-skewed outcome featuring both frequent moderate demand and occasional extreme peaks. This motivates the use of flexible modelling approaches and alternative distributional assumptions, such as the Tweedie family, which can accommodate the observed dispersion and tail behaviour more effectively than standard linear models.

### Explantory Variable Distribution 

In the following, I examine the marginal distributions of several explantory variables to assess
skewness, zero inflation, outliers, and scale differences that may affect model
choice and feature engineering.


In [None]:
from bike_demand.plotting import plot_combined_explanatory_distributions

vars_to_plot = [
    "temperature",
    "humidity",
    "wind_speed",
    "visibility",
    "dew_point_temp",
    "solar_radiation",
]

plot_combined_explanatory_distributions(df_clean, vars_to_plot, bins=30, kde=True)
plt.show()

The continuous weather variables exhibit heterogeneous distributional shapes. Temperature and dew point temperature display relatively smooth, approximately unimodal distributions spanning a wide range of values, consistent with seasonal variation over the annual cycle. Humidity is moderately concentrated around mid-range values, with fewer observations at extreme low or high levels. Wind speed shows pronounced right skewness, with most observations concentrated at low values and a long right tail corresponding to occasional strong wind events. Visibility is heavily concentrated near its upper bound, indicating that clear-weather conditions dominate the sample, while reduced visibility occurs relatively infrequently. Solar radiation exhibits extreme right skewness and substantial mass near zero, reflecting the diurnal cycle and the absence of solar radiation during nighttime hours.

Overall, the observed distributions suggest that several weather variables are non-Gaussian and exhibit skewness or bounded behaviour, which may has implications on limitation of the suitability of linear modelling assumptions and motivate the use of flexible modelling approaches capable of capturing non-linear effects and heterogeneous dispersion.

In [None]:
vars_to_plot2 = [
    "rainfall",
    "snowfall",
]

plot_combined_explanatory_distributions(df_clean, vars_to_plot2, bins=30, kde=True)
plt.show()

As shown in the full distributions, precipitation is absent for most hours, while positive values occur relatively infrequently. This results in highly skewed unconditional distributions dominated by a large mass at zero, obscuring the behaviour of precipitation intensity when it does occur. I then try to calculate the proportion of zero rainfall and snowfall.

In [None]:
df_clean["rain_binary"] = (df_clean["rainfall"] > 0).astype(int)
rain_occurrence_table = (
    df_clean["rain_binary"]
    .value_counts()
    .rename(index={0: "No Rainfall", 1: "Rainfall > 0"})
    .to_frame(name="Count")
)

rain_occurrence_table["Proportion"] = (
    rain_occurrence_table["Count"] / rain_occurrence_table["Count"].sum()
)

rain_occurrence_table

In [None]:
df_clean["snow_binary"] = (df_clean["snowfall"] > 0).astype(int)
snow_occurrence_table = (
    df_clean["snow_binary"]
    .value_counts()
    .rename(index={0: "No Snowfall", 1: "Snowfall > 0"})
    .to_frame(name="Count")
)

snow_occurrence_table["Proportion"] = (
    snow_occurrence_table["Count"] / snow_occurrence_table["Count"].sum()
)

snow_occurrence_table

For these two precipitation variables, both rainfall and snowfall display severe zero inflation, with over 90% of observations equal to zero, as shown by the proportion distribution table. To avoid the dominance of zero values masking the behaviour of precipitation intensity, the analysis therefore examines the distribution of positive values conditional on precipitation occurring to better understand the informational content of these variables.

In [None]:
rain_pos = df_clean.loc[df_clean["rainfall"] > 0, "rainfall"]

plt.figure(figsize=(8, 5))
sns.histplot(rain_pos, bins=20, kde=True)
plt.title("Distribution of positive rainfall values")
plt.xlabel("Rainfall (mm)")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

In [None]:
snow_pos = df_clean.loc[df_clean["snowfall"] > 0, "snowfall"]

plt.figure(figsize=(8, 5))
sns.histplot(snow_pos, bins=20, kde=True)
plt.title("Distribution of positive snowfall values")
plt.xlabel("Snowfall (cm)")
plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

These distributional patterns suggest that rainfall and snowfall may affect demand through two distinct mechanisms: first, whether precipitation occurs at all, and second, the magnitude of precipitation when it does occur. However, the conditional distributions indicate that most non-zero precipitation values are relatively small, with only a limited number of extreme events. Including precipitation intensity directly as a continuous regressor may therefore contribute little additional explanatory power while introducing numerical instability and negative influence on models from these rare extreme observations. To address this, binary indicator variables are constructed to capture the occurrence of rainfall and snowfall, allowing the models to focus on the presence of precipitation as the primary signal. The original continuous measures are retained only to represent intensity conditional on precipitation occurring, where appropriate.

After examining the distributions of continuous weather variables, the analysis proceeds to assess calendar-related variables to ensure that the sample is evenly distributed across time.

In [None]:
from bike_demand.plotting import plot_categorical_frequency

plot_categorical_frequency(df_clean, col="hour", order=list(range(24)), title="Hour Frequency")
plt.show()

plot_categorical_frequency(
    df_clean, col="day_of_week", order=list(range(7)), title="Day of Week Frequency"
)
plt.show()

plot_categorical_frequency(df_clean, col="month", order=list(range(1, 13)), title="Month Frequency")
plt.show()

The frequency distributions of hour, day of week, and month are approximately uniform, indicating balanced temporal coverage across intraday, weekly, and monthly dimensions, with no systematic over- or under-representation of specific time periods in the sample. Thus, the sample is evenly distributed across time.

In [None]:
plot_categorical_frequency(df_clean, col="seasons", title="Distribution of seasons", palette="Set2")

print(df_clean["seasons"].value_counts())
plt.show()

Seasonal frequencies are similarly balanced.

Next, the analysis checks the distributions of binary variables.

In [None]:
# 1. holiday
plot_categorical_frequency(df_clean, col="holiday", title="Holiday frequency", palette="Set2")

print(df_clean["holiday"].value_counts())
plt.show()

In [None]:
# 2. functioning_day
plot_categorical_frequency(
    df_clean, col="functioning_day", title="Functioning day frequency", palette="Set2"
)

print(df_clean["functioning_day"].value_counts())
plt.show()

The frequency distributions of the holiday and functioning_day indicators reveal class imbalance. Only a small fraction of observations correspond to public holidays or non-functioning days, while the vast majority of hours occur on non-holidays and days when the bike-sharing system is operational. This pattern reflects the underlying calendar structure rather than sampling bias.

### Preliminary Correlation Between Explanatory Variables and Demand
The analysis first examines the hourly pattern of the target variable and assesses whether this intraday structure is stable across the year or varies seasonally, thereby assessing the presence of seasonal effects.

In [None]:
# hourly pattern of target variable
import matplotlib.pyplot as plt

hourly = df_clean.groupby("hour", observed=True)["rented_bike_count"].mean()

plt.figure(figsize=(10, 4))
plt.plot(hourly.index, hourly.values)
plt.title("Average bike rentals by hour of day")
plt.xlabel("Hour of day")
plt.ylabel("Average rented bike count")
plt.xticks(range(0, 24, 2))
plt.show()

The average hourly rental profile exhibits a pronounced diurnal pattern. Demand is lowest during the early morning hours, increases sharply during the morning commuting period with a clear peak around 8 a.m., and declines slightly during mid-morning. Demand then rises steadily throughout the afternoon, reaching a second and higher peak during the evening commuting hours around 6 p.m., before gradually declining into the night.

This bimodal intraday structure is consistent with commuter-driven usage patterns and indicates strong temporal dependence at the hourly level. The sharp changes in demand across adjacent hours suggest that hour-of-day is a critical explanatory variable and that linear time trends are unlikely to adequately capture these dynamics.

In [None]:
from bike_demand.plotting import plot_hourly_trend_by_season

plot_hourly_trend_by_season(df_clean)
plt.show()

When disaggregated by season, the same bimodal intraday pattern is observed across all seasons, indicating a stable underlying temporal structure in bike usage. However, there are varies of bike demandd by seasons. Summer consistently exhibits the highest rental volumes throughout the day, particularly during peak commuting hours, while winter demand remains markedly lower at all times. Spring and autumn display intermediate demand levels with similar intraday shapes.

The preservation of the hourly demand profile across seasons, combined with systematic seasonal scaling, suggests an interaction between hour-of-day and seasonal effects. This motivates the inclusion of both temporal and seasonal variables in the modelling framework, as well as the use of flexible models capable of capturing interaction effects and non-linear temporal patterns.

After analysing the temporal effects of the target variable, the analysis proceeds to explore the correlations between explanatory variables and the target. As an initial step, Pearson correlation coefficients are computed between each numerical explanatory variable and hourly bike rental demand.

In [None]:
# Correlation
num_cols = [
    "temperature",
    "humidity",
    "wind_speed",
    "visibility",
    "dew_point_temp",
    "solar_radiation",
    "rainfall",
    "snowfall",
    "month",
    "day_of_week",
    "hour",
]

corr = (
    df_clean[num_cols + ["rented_bike_count"]]
    .corr(numeric_only=True)["rented_bike_count"]
    .sort_values(ascending=False)
)
corr

Followed by identifying that most continuous explanatory variables exhibit different extend of correlations with bike rental demand, the analysis proceeds to visual inspection of their relationships with the target variable. Scatter plots of hourly rental counts against each continuous feature are constructed to assess the functional form of these associations, identify potential non-linear patterns, and detect heteroskedasticity or threshold effects that may not be captured by simple correlation measures.

In [None]:
from bike_demand.plotting import plot_target_vs_continuous

continuous_feature = [
    "temperature",
    "humidity",
    "wind_speed",
    "visibility",
    "dew_point_temp",
    "solar_radiation",
    "rainfall",
    "snowfall",
]


plot_target_vs_continuous(df_clean, continuous_feature);

The scatter plots reveal heterogeneous and largely non-linear relationships between continuous weather variables and hourly bike rental demand. Temperature and dew point temperature exhibit a clear positive association with demand, particularly at moderate to warm levels, although dispersion increases substantially at higher values. Humidity and wind speed show weaker and more diffuse relationships with demand. While high humidity and strong winds are generally associated with lower rental volumes, the relationship is highly dispersed and does not follow a simple linear pattern. Visibility displays a weak positive association with rentals, with higher demand occurring predominantly under clearer conditions; however, substantial variation remains across the full range of visibility values. Solar radiation exhibits a non-linear relationship, reflecting its strong interaction with time of day: demand is low when solar radiation is near zero (nighttime) and increases during daylight hours, but the marginal effect weakens at higher radiation levels.

Rainfall and snowfall are characterised by pronounced zero inflation and sharp discontinuities. Demand drops markedly when precipitation occurs, even at low intensities, while further increases in precipitation intensity have comparatively limited additional effect. This pattern supports modelling precipitation primarily through occurrence indicators rather than relying solely on continuous intensity measures.

Following the analysis of continuous explanatory variables, the analysis turns to temporal variables that enter the dataset in categorical form, including month, hour, season, and day of week. These variables capture systematic temporal structures in bike rental demand, such as diurnal commuting cycles and seasonal usage patterns, which are not adequately summarised by simple correlation measures. Examining these temporal categories provides insight into how demand varies across different time segments and whether such variation is stable and economically meaningful.

In [None]:
from bike_demand.plotting import plot_target_vs_categorical_mean

temporal_categorical_features = ["month", "hour", "seasons", "day_of_week"]

plot_target_vs_categorical_mean(df_clean, temporal_categorical_features, ncols=2);

To assess the relationship between temporal categorical variables and bike rental demand, the average hourly rental count is computed for each category of month, hour, season, and day of week. Plotting category-specific means allows differences in average demand across time segments to be visualised directly, offering an intuitive assessment of the magnitude and structure of temporal effects. This approach is suitable for categorical variables, such that it can support interpretationn of correlation coefficients.

In [None]:
categorical_indicators = ["functioning_day"]

plot_target_vs_categorical_mean(df_clean, categorical_indicators, ncols=2);

In [None]:
categorical_indicators = ["holiday"]

plot_target_vs_categorical_mean(df_clean, categorical_indicators, ncols=2);

To further assess relationships among explanatory variables and identify potential multicollinearity, a correlation heatmap of numerical features is constructed using Pearson correlation coefficients.

In [None]:
from bike_demand.plotting import plot_correlation_heatmap

# Exclude target and obvious identifiers
corr = plot_correlation_heatmap(
    df_clean,
    cmap="coolwarm",
    exclude_cols=["rented_bike_count", "date"],
)

The heatmap reveals several strong correlations among weather-related variables, most notably between temperature and dew point temperature, indicating substantial redundancy in their information content. In contrast, most temporal variables, such as hour and day of week, exhibit relatively weak correlations with weather features, suggesting that they capture distinct dimensions of demand variation. Binary precipitation indicators are, as expected, strongly correlated with their corresponding continuous measures, reflecting the construction of these variables. Overall, the heatmap highlights the presence of correlated feature groups while confirming that many explanatory variables provide complementary information for modelling.

In [None]:
from bike_demand.preprocessing import save_processed_data

save_processed_data(df_clean)