# 1. Data Preparation
This notebook handles data cleaning and feature engineering for the city bike data. It assumes that city bike ride and station location data have already been downloaded and saved to `/data/raw/` by the separate scripts. 

In [1]:
import pandas as pd
from pathlib import Path
from citybike.data_cleaning import merge_station_info, handle_wind_speed_gaps

# Define data directories
RAW_DIR = Path("../data/raw")
CLEAN_DIR = Path("../data/clean")
CLEAN_DIR.mkdir(parents=True, exist_ok=True)

pd.set_option('display.float_format', '{:.2f}'.format)

### Load the data

In [2]:
dtypes = {'departure_id': str, 'departure_name': str, 
            'return_id': str, 'return_name': str}
bike_df = pd.read_csv(RAW_DIR / 'bike_rides.csv', dtype=dtypes, parse_dates=['departure', 'return'])
bike_df.head()

Unnamed: 0,departure,return,departure_id,departure_name,return_id,return_name,distance,duration
0,2020-03-23 06:09:44,2020-03-23 06:16:26,86,Kuusitie,111,Esterinportti,1747.0,401.0
1,2020-03-23 06:11:58,2020-03-23 06:26:31,26,Kamppi (M),10,Kasarmitori,1447.0,869.0
2,2020-03-23 06:16:29,2020-03-23 06:24:23,268,Porolahden koulu,254,Agnetankuja,1772.0,469.0
3,2020-03-23 06:33:53,2020-03-23 07:14:03,751,Vallipolku,106,Korppaanmäentie,7456.0,2406.0
4,2020-03-23 06:36:09,2020-03-23 07:04:10,62,Välimerenkatu,121,Vilhonvuorenkatu,7120.0,1679.0


In [3]:
station_df = pd.read_csv(RAW_DIR / 'stations.csv')
station_df = station_df.set_index('id')
# Add leading zeros to IDs
station_df.index = station_df.index.fillna(-1).astype(int).astype(str).str.zfill(3)  
station_df.head()

Unnamed: 0_level_0,name,lat,lon,capacity,source
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
150,Töölönlahden puisto,60.17,24.94,24.0,HSL
161,Eteläesplanadi,60.17,24.95,34.0,HSL
162,Leppäsuonaukio,60.17,24.93,28.0,HSL
163,Lehtisaarentie,60.18,24.85,12.0,HSL
118,Fleminginkatu,60.19,24.95,22.0,HSL


In [4]:
weather_df = pd.read_csv(RAW_DIR / 'weather.csv', index_col='time')
weather_df.head()

Unnamed: 0_level_0,temperature,wind_speed,precipitation
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-04-01 00:00:00,2.5,6.4,0.0
2020-04-01 01:00:00,3.1,4.4,0.0
2020-04-01 02:00:00,3.3,4.1,0.0
2020-04-01 03:00:00,3.1,3.7,0.0
2020-04-01 04:00:00,3.0,4.4,0.0


### Data Cleaning

#### Missing weather values

In [5]:
missing_counts = weather_df.isnull().sum()
missing_percent = 100 * missing_counts / len(weather_df)
missing_summary = pd.DataFrame({'count': missing_counts, 'percent': round(missing_percent, 3)})
missing_summary

Unnamed: 0,count,percent
temperature,24,0.09
wind_speed,154,0.6
precipitation,49,0.19


Filling even single missing precipitation values could introduce false rainfall events. Therefore we replace missing values with -1 and add a flag column to indicate missing data.

In [6]:
weather_df['precip_missing'] = weather_df['precipitation'].isna().astype(int)

# Replace all NaNs with -1 to preserve missing information
weather_df['precipitation'] = weather_df['precipitation'].fillna(-1)

Short gaps in temperature and wind speed can be interpolated or forward-filled, but first we investigate the lengths of gaps to decide on an appropriate strategy.

In [7]:
gap_limit = 6
for col in ['temperature', 'wind_speed']:
    consecutive_nans = weather_df[col].isna().astype(int).groupby(weather_df[col].notna().astype(int).cumsum()).sum().tolist()
    gaps = [gap for gap in consecutive_nans if gap > gap_limit]
    print(f'{col}: {len(gaps)} gaps > {gap_limit} hours, lengths = {gaps}")')


temperature: 0 gaps > 6 hours, lengths = []")
wind_speed: 7 gaps > 6 hours, lengths = [12, 8, 9, 8, 8, 12, 18]")


Temperature changes slowly, so small gaps (<6 hours) can be forward-filled without introducing unrealistic trends.

In [8]:
weather_df['temperature'] = weather_df['temperature'].ffill()

Wind speed is more variable and requires different ways to handle the gaps:
- Short gaps (<6 hours) are interpolated linearly to preserve natural fluctuations.
- Longer gaps are replaced with -1 and flagged to indicate missing data.

In [9]:
weather_df = handle_wind_speed_gaps(weather_df)

#### Missing values in the bike ride data
The city bike season typically runs from the beginning of April to the end of October. In 2020, the season started unusually early on March 23rd. To maintain consistency across seasons, rides from March 2020 are excluded from the analysis.

In [10]:
bike_df = bike_df[bike_df['departure'].dt.month != 3]

Check the percentage of missing values for each column in the ride dataset.

In [11]:
missing_counts = bike_df.isnull().sum()
missing_percent = 100 * missing_counts / len(bike_df)
missing_summary = pd.DataFrame({'count': missing_counts, 'percent': round(missing_percent, 3)})
missing_summary

Unnamed: 0,count,percent
departure,69,0.0
return,20,0.0
departure_id,0,0.0
departure_name,0,0.0
return_id,79,0.0
return_name,79,0.0
distance,8479,0.06
duration,209240,1.54


The duration column has the most missing values and requires further investigation to decide how to handle them.
The other columns have such a small percentage of missing data that removing the rows with missing values will have a negligible impact on the analysis.

####  Exploring the Missing Duration Values


In [12]:
bike_df[(bike_df['departure'].dt.month == 10) & (bike_df['departure'].dt.year == 2021) & (bike_df['duration'].notna())]

Unnamed: 0,departure,return,departure_id,departure_name,return_id,return_name,distance,duration


All rows in October 2021 are missing the duration values, while other months are unaffected. This indicates the missing values are not random, but likely due to a data collection error. Removing months worth of data could impact the analysis results, when exploring seasonal patters and comparing yearly data. 

Since the data contains both departure and return timestamps, the duration can be calculated using these columns. Check if the timestamps are consistent with the duration values.

In [13]:
bike_df['duration_calc'] = (
    pd.to_datetime(bike_df['return']) - pd.to_datetime(bike_df['departure'])
).dt.total_seconds()

bike_df['duration_diff'] = abs(bike_df['duration_calc'] - bike_df['duration'])

bike_df['duration_diff'].describe()

count   13345181.00
mean         284.94
std        10589.12
min            0.00
25%            3.00
50%            4.00
75%            5.00
max      4319138.00
Name: duration_diff, dtype: float64

Although the mean difference between the calculated and reported durations is large (≈5 minutes), this is skewed by a few extreme outliers (with a maximum of ≈56 days). However, 75% of the calculated values are within 5 seconds of the reported values.

In [14]:
bike_df.loc[bike_df['duration'].isna(), 'duration'] = bike_df.loc[bike_df['duration'].isna(), 'duration_calc']

print('Duration for missing values filled with calculated values:')
print(bike_df[(bike_df['departure'].dt.month == 10) & (bike_df['departure'].dt.year == 2021)]['duration'].describe())
print('Duration October of the other years:')
print(bike_df[(bike_df['departure'].dt.month == 10) & (bike_df['departure'].dt.year != 2021)]['duration'].describe())

Duration for missing values filled with calculated values:
count    209240.00
mean       1611.26
std       16668.16
min       -3205.00
25%         328.00
50%         554.00
75%         943.00
max     2499520.00
Name: duration, dtype: float64
Duration October of the other years:
count    846614.00
mean       1021.85
std       11078.84
min           0.00
25%         320.00
50%         545.00
75%         926.00
max     2914721.00
Name: duration, dtype: float64


The calculated duration values for the missing October 2021 data appear consistent with the data from October in other years. The median and 25 / 75th percentiles align closely, suggesting that the calculated values are reliable. However, both sets contain extreme outliers, which should be removed.

#### Remove the rows with missing values

In [15]:
# Drop rows with missing values and temporary duration columns
bike_df = bike_df.drop(columns=['duration_calc', 'duration_diff']).dropna()

# Convert seconds to minutes
bike_df['duration'] = bike_df['duration'] / 60  
bike_df.describe()

print(f'Percentage of rides that are over 5 hours: {round(len(bike_df[bike_df.duration > 5 * 60]) / len(bike_df), 3)}')
print(f'Percentage of rides that are over 15 kilometers: {round(len(bike_df[bike_df.distance > 15000]) / len(bike_df), 3)}')


Percentage of rides that are over 5 hours: 0.003
Percentage of rides that are over 15 kilometers: 0.001


From 2020 to 2023, the bike pass allowed free rides up to 30 minutes, with a charge of 1 euro for every additional 30 minutes, up to a total of 5 hours. After 5 hours, a delay fee of 80 euros applies, plus 9 euros for each additional 30 minutes. In 2024, the free ride period was extended to one hour, while the rest of the pricing structure remained the same.

The pricing incentivizes the users to utilize the bikes for short rides and maximizes the availability of the bikes. Only 0.3% of rides last over 5 hours, which is beyond the intended duration. Therefore, rows with a duration longer than 5 hours are removed. The activation and return of the bike require time, and thus rows with a duration under 1 minute are also removed, as these are likely errors or cases where the user activates and returns the bike to the station without riding it. For the same reason, rides covering less than 50 meters are removed. Rides with a distance over 15 km are also removed, as the bikes are designed for short rides, and longer distances are likely due to GPS or data recording errors.

In [16]:
bike_df = bike_df[(bike_df['duration'] <= 5 * 60) & (bike_df['duration'] > 1) & (bike_df['distance'] > 50) & (bike_df['distance'] < 15000)]
bike_df.reset_index(drop=True, inplace=True)
bike_df[['distance', 'duration']].describe()

Unnamed: 0,distance,duration
count,12960378.0,12960378.0
mean,2480.29,13.87
std,1775.17,13.47
min,51.0,1.02
25%,1178.0,6.5
50%,1999.0,10.83
75%,3310.0,17.72
max,14998.0,300.0


### Merge station location information

We merge station location and capacity information into the dataset to enable spatial analysis and explore how geography impacts bike usage.
- `lat` and `lon`: Latitude and longitude coordinates that enable geospatial analysis and mapping. These help identify patterns in demand by location and station clustering.
- `capacity`: The maximum number of designated docking spots at a station. While this is the official station size, bikes can still be returned even when a station is full by locking them to existing bikes. As a result, actual usage can exceed the stated capacity.

In [17]:
# Merge with station info
bike_df = merge_station_info(bike_df, station_df, station_type='departure')
bike_df = merge_station_info(bike_df, station_df, station_type='return')

bike_df.head()

Unnamed: 0,departure,return,departure_id,departure_name,return_id,return_name,distance,duration,departure_lat,departure_lon,departure_capacity,return_lat,return_lon,return_capacity
0,2020-04-01 00:04:08,2020-04-01 00:21:27,62,Välimerenkatu,62,Välimerenkatu,999.0,17.3,60.16,24.92,16.0,60.16,24.92,16.0
1,2020-04-01 00:12:31,2020-04-01 00:21:34,149,Toinen linja,16,Liisanpuistikko,2372.0,8.97,60.18,24.94,22.0,60.17,24.96,17.0
2,2020-04-01 00:16:46,2020-04-01 00:46:09,118,Fleminginkatu,105,Tilkantori,4299.0,18.02,60.19,24.95,22.0,60.2,24.89,16.0
3,2020-04-01 00:19:29,2020-04-01 00:30:13,17,Varsapuistikko,13,Merisotilaantori,1923.0,10.65,60.17,24.95,28.0,60.17,24.98,24.0
4,2020-04-01 00:22:32,2020-04-01 00:27:29,30,Itämerentori,67,Perämiehenkatu,1376.0,4.87,60.16,24.91,40.0,60.16,24.93,16.0


### Save cleaned data

In [20]:
bike_df.to_csv(CLEAN_DIR / 'bike_rides_cleaned.csv', index=False)

weather_df.to_csv(CLEAN_DIR / 'weather_cleaned.csv', index=True)

### Data Cleaning Summary
- Handling missing weather values:
    - Precipitation: missing values replaced with -1 and flagged to indicate missing data. 
    - Temperature: missing values forward-filled. 
    - Wind speed: Short gaps (<6 hours) interpolated linearly, longer gaps replaced with -1 and flagged.
- Calculated missing duration values using departure and return timestamps.
- Removed data from March 2020 to maintain consistency in the city bike season.
- Removed rows with missing values.
- Filtered out rides with unrealistic durations and distances.
- Merged station location information and capacity.