# Feature Engineering - Divvy Bike Usage Prediction

**Project:** Pr√©diction de l'utilisation des v√©los Divvy √† Chicago  
**Author:** No√©  
**Date:** 2026-01-19  
**Objective:** Cr√©er des features simples et explicables pour la mod√©lisation

---

## Objective

**Pr√©dire le nombre de trajets PAR HEURE** pour savoir combien de v√©los sont n√©cessaires

### Feature Strategy:
- **Temporal**: hour, day_of_week, month, is_weekend, season
- **Weather**: temperature, precipitation, wind_speed (daily aggregates)
- **Calendar**: is_holiday

### Approach:
- Hourly trip aggregation
- Simple, interpretable features (no lag or rolling features)
- Training on 2024, testing on 2025

---

## Table of Contents

1. [Data Loading](#1-data-loading)
2. [Hourly Aggregation](#2-hourly-aggregation)
3. [Temporal Features](#3-temporal-features)
4. [Weather Features](#4-weather-features)
5. [Calendar Features](#5-calendar-features)
6. [Final Dataset](#6-final-dataset)
7. [Train/Test Split](#7-train-test-split)

---

## 1. Data Loading

Charger les donn√©es pr√©par√©es dans l'EDA

In [1]:
# Import des biblioth√®ques
import pandas as pd
import plotly.express as px
import warnings
from pathlib import Path
from datetime import datetime

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

print("Libraries loaded")

Libraries loaded


In [2]:
# Chemins
DATA_PATH = Path('../data')
RAW_PATH = DATA_PATH / 'raw'
PROCESSED_PATH = DATA_PATH / 'processed'

DIVVY_2024_PATH = RAW_PATH / 'divvy' / '2024'
DIVVY_2025_PATH = RAW_PATH / 'divvy' / '2025'
WEATHER_PATH = RAW_PATH / 'weather'
HOLIDAYS_PATH = RAW_PATH / 'holidays'

# Cr√©er le dossier processed si n√©cessaire
PROCESSED_PATH.mkdir(exist_ok=True)

print("Paths configured")
print(f"   2024 data: {DIVVY_2024_PATH}")
print(f"   2025 data: {DIVVY_2025_PATH}")
print(f"   Output: {PROCESSED_PATH}")

Paths configured
   2024 data: ..\data\raw\divvy\2024
   2025 data: ..\data\raw\divvy\2025
   Output: ..\data\processed


### 1.1 Loading Divvy 2024 (Train)

In [3]:
# Charger tous les fichiers 2024
print("Loading Divvy 2024...")
divvy_files_2024 = sorted(DIVVY_2024_PATH.glob('*.csv'))
print(f"   Files found: {len(divvy_files_2024)}")

dfs_2024 = []
for file in divvy_files_2024:
    df_month = pd.read_csv(file)
    dfs_2024.append(df_month)
    print(f"   {file.name}: {len(df_month):,} rows")

df_divvy_2024 = pd.concat(dfs_2024, ignore_index=True)
print(f"\nTotal 2024: {len(df_divvy_2024):,} trips")

Loading Divvy 2024...
   Files found: 12
   202401-divvy-tripdata.csv: 144,873 rows
   202402-divvy-tripdata.csv: 223,164 rows
   202403-divvy-tripdata.csv: 301,687 rows
   202404-divvy-tripdata.csv: 415,025 rows
   202405-divvy-tripdata.csv: 609,493 rows
   202406-divvy-tripdata.csv: 710,721 rows
   202407-divvy-tripdata.csv: 748,962 rows
   202408-divvy-tripdata.csv: 755,639 rows
   202409-divvy-tripdata.csv: 821,276 rows
   202410-divvy-tripdata.csv: 616,281 rows
   202411-divvy-tripdata.csv: 335,075 rows
   202412-divvy-tripdata.csv: 178,372 rows

Total 2024: 5,860,568 trips


### 1.2 Loading Divvy 2025 (Test)

**Note**: Les donn√©es 2025 sont maintenant disponibles (janvier 2026)

In [4]:
# Charger les fichiers 2025
print("Loading Divvy 2025...")
divvy_files_2025 = sorted(DIVVY_2025_PATH.glob('*.csv'))
print(f"   Files found: {len(divvy_files_2025)}")

dfs_2025 = []
for file in divvy_files_2025:
    df_month = pd.read_csv(file)
    dfs_2025.append(df_month)
    print(f"   {file.name}: {len(df_month):,} rows")

df_divvy_2025 = pd.concat(dfs_2025, ignore_index=True)
print(f"\nTotal 2025: {len(df_divvy_2025):,} trips")

Loading Divvy 2025...
   Files found: 12
   202501-divvy-tripdata.csv: 138,689 rows
   202502-divvy-tripdata.csv: 151,880 rows
   202503-divvy-tripdata.csv: 298,155 rows
   202504-divvy-tripdata.csv: 371,341 rows
   202505-divvy-tripdata.csv: 502,456 rows
   202506-divvy-tripdata.csv: 678,904 rows
   202507-divvy-tripdata.csv: 763,432 rows
   202508-divvy-tripdata.csv: 790,177 rows
   202509-divvy-tripdata.csv: 714,759 rows
   202510-divvy-tripdata.csv: 646,039 rows
   202511-divvy-tripdata.csv: 356,628 rows
   202512-divvy-tripdata.csv: 140,534 rows

Total 2025: 5,552,994 trips


### 1.3 Loading Weather Data

In [5]:
# Charger m√©t√©o 2024 et 2025
df_weather_2024 = pd.read_csv(WEATHER_PATH / '2024_weather_chicago.csv')
df_weather_2025 = pd.read_csv(WEATHER_PATH / '2025_weather_chicago.csv')

# Combiner
df_weather = pd.concat([df_weather_2024, df_weather_2025], ignore_index=True)

# Pr√©parer la date
df_weather['date'] = pd.to_datetime(df_weather['date']).dt.date

print(f"Weather data loaded: {len(df_weather)} days")
print(f"   Columns: {df_weather.columns.tolist()}")
df_weather.head()

Weather data loaded: 731 days
   Columns: ['date', 'tavg', 'tmin', 'tmax', 'prcp', 'snow', 'wdir', 'wspd', 'wpgt', 'pres', 'tsun']


Unnamed: 0,date,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
0,2024-01-01,-1.0,-2.8,0.0,0.0,,,15.8,,1025.8,
1,2024-01-02,-0.5,-3.3,2.8,0.0,,,19.3,,1021.6,
2,2024-01-03,0.9,-0.6,2.2,0.0,,,11.0,,1019.3,
3,2024-01-04,-0.5,-3.3,1.7,0.0,,,7.1,,1027.3,
4,2024-01-05,-0.4,-4.4,2.8,0.0,,,7.6,,1023.0,


### 1.4 Loading Holidays Data

In [6]:
# Charger jours f√©ri√©s
df_holidays = pd.read_csv(HOLIDAYS_PATH / 'us_holidays_2024_2025.csv')
df_holidays['date'] = pd.to_datetime(df_holidays['date']).dt.date
holiday_dates = df_holidays['date'].tolist()

print(f"Holidays loaded: {len(df_holidays)} holidays")
print(f"\nOverview:")
df_holidays.head(10)

Holidays loaded: 22 holidays

Overview:


Unnamed: 0,date,holiday_name,type
0,2024-01-01,New Year's Day,federal
1,2024-01-15,Martin Luther King Jr. Day,federal
2,2024-02-19,Washington's Birthday,federal
3,2024-05-27,Memorial Day,federal
4,2024-06-19,Juneteenth National Independence Day,federal
5,2024-07-04,Independence Day,federal
6,2024-09-02,Labor Day,federal
7,2024-10-14,Columbus Day,federal
8,2024-11-11,Veterans Day,federal
9,2024-11-28,Thanksgiving Day,federal


---

## 2. Hourly Data Aggregation

**Objective**: Cr√©er un dataset avec le nombre de trajets PAR HEURE

### 2.1 Preparing 2024 Data

In [7]:
# Convertir les dates (format ISO8601 pour g√©rer les millisecondes)
print("Converting datetime columns 2024...")
df_divvy_2024['started_at'] = pd.to_datetime(df_divvy_2024['started_at'], format='ISO8601')

# Cr√©er une colonne datetime_hour (arrondie √† l'heure)
df_divvy_2024['datetime_hour'] = df_divvy_2024['started_at'].dt.floor('H')

print("Dates converted")
print(f"   Period: {df_divvy_2024['started_at'].min()} ‚Üí {df_divvy_2024['started_at'].max()}")

Converting datetime columns 2024...
Dates converted
   Period: 2024-01-01 00:00:39 ‚Üí 2024-12-31 23:56:49.854000


In [8]:
# Agr√©gation HORAIRE
print("Aggregating data by hour 2024...")
df_hourly_2024 = df_divvy_2024.groupby('datetime_hour').agg({
    'ride_id': 'count'  # Nombre de trajets
}).reset_index()

df_hourly_2024.columns = ['datetime_hour', 'trip_count']

print(f"Aggregation complete: {len(df_hourly_2024):,} hours")
print(f"\nStatistics:")
print(df_hourly_2024['trip_count'].describe())
df_hourly_2024.head()

Aggregating data by hour 2024...
Aggregation complete: 8,782 hours

Statistics:
count    8782.000000
mean      667.338647
std       667.821275
min         1.000000
25%       131.000000
50%       442.500000
75%       997.000000
max      3789.000000
Name: trip_count, dtype: float64


Unnamed: 0,datetime_hour,trip_count
0,2024-01-01 00:00:00,180
1,2024-01-01 01:00:00,373
2,2024-01-01 02:00:00,238
3,2024-01-01 03:00:00,49
4,2024-01-01 04:00:00,23


### 2.2 Preparing 2025 Data

In [9]:
# Pr√©parer 2025
print("Converting datetime columns 2025...")
df_divvy_2025['started_at'] = pd.to_datetime(df_divvy_2025['started_at'], format='ISO8601')
df_divvy_2025['datetime_hour'] = df_divvy_2025['started_at'].dt.floor('H')

# Filtrer pour ne garder QUE les dates 2025
print("Filtering 2025 data...")
avant_filtre = len(df_divvy_2025)
df_divvy_2025 = df_divvy_2025[df_divvy_2025['started_at'].dt.year == 2025]
apres_filtre = len(df_divvy_2025)
print(f"   Before filter: {avant_filtre:,}")
print(f"   After filter: {apres_filtre:,}")
print(f"   Removed: {avant_filtre - apres_filtre:,}")

print("Aggregating data by hour 2025...")
df_hourly_2025 = df_divvy_2025.groupby('datetime_hour').agg({
    'ride_id': 'count'
}).reset_index()
df_hourly_2025.columns = ['datetime_hour', 'trip_count']

print(f"Aggregation 2025 complete: {len(df_hourly_2025):,} hours")
print(f"   Period: {df_hourly_2025['datetime_hour'].min()} ‚Üí {df_hourly_2025['datetime_hour'].max()}")

# Comparer avec 2024
print(f"\nDataset comparison:")
print(f"   2024: {len(df_hourly_2024):,} hours")
print(f"   2025: {len(df_hourly_2025):,} hours")
print(f"   Difference: {abs(len(df_hourly_2024) - len(df_hourly_2025))} hours")

print(f"\nNote: Difference is due to leap year (2024) vs regular year (2025)")

df_hourly_2025.head()

Converting datetime columns 2025...
Filtering 2025 data...
   Before filter: 5,552,994
   After filter: 5,552,941
   Removed: 53
Aggregating data by hour 2025...
Aggregation 2025 complete: 8,758 hours
   Period: 2025-01-01 00:00:00 ‚Üí 2025-12-31 23:00:00

Dataset comparison:
   2024: 8,782 hours
   2025: 8,758 hours
   Difference: 24 hours

Note: Difference is due to leap year (2024) vs regular year (2025)


Unnamed: 0,datetime_hour,trip_count
0,2025-01-01 00:00:00,336
1,2025-01-01 01:00:00,436
2,2025-01-01 02:00:00,213
3,2025-01-01 03:00:00,57
4,2025-01-01 04:00:00,24


---

## 3. Temporal Features

Cr√©er des features temporelles simples et explicables

In [10]:
def create_temporal_features(df):
    """
    Cr√©er des features temporelles √† partir de datetime_hour
    
    Features cr√©√©es:
    - hour: heure de la journ√©e (0-23)
    - day_of_week: jour de la semaine (0=lundi, 6=dimanche)
    - month: mois (1-12)
    - is_weekend: 1 si weekend, 0 sinon
    - season: saison (winter, spring, summer, fall)
    - date: date (pour merge avec m√©t√©o)
    """
    df = df.copy()
    
    # Features basiques
    df['hour'] = df['datetime_hour'].dt.hour
    df['day_of_week'] = df['datetime_hour'].dt.dayofweek
    df['month'] = df['datetime_hour'].dt.month
    df['date'] = df['datetime_hour'].dt.date
    
    # Weekend (samedi=5, dimanche=6)
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    
    # Saison (simple mapping par mois)
    season_map = {
        12: 'winter', 1: 'winter', 2: 'winter',
        3: 'spring', 4: 'spring', 5: 'spring',
        6: 'summer', 7: 'summer', 8: 'summer',
        9: 'fall', 10: 'fall', 11: 'fall'
    }
    df['season'] = df['month'].map(season_map)
    
    return df

print("Function create_temporal_features defined")

Function create_temporal_features defined


In [11]:
# Appliquer sur 2024
print("Creating temporal features 2024...")
df_hourly_2024 = create_temporal_features(df_hourly_2024)

print("Temporal features created:")
print(f"   Columns: {df_hourly_2024.columns.tolist()}")
print(f"\nOverview:")
df_hourly_2024.head()

Creating temporal features 2024...
Temporal features created:
   Columns: ['datetime_hour', 'trip_count', 'hour', 'day_of_week', 'month', 'date', 'is_weekend', 'season']

Overview:


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,date,is_weekend,season
0,2024-01-01 00:00:00,180,0,0,1,2024-01-01,0,winter
1,2024-01-01 01:00:00,373,1,0,1,2024-01-01,0,winter
2,2024-01-01 02:00:00,238,2,0,1,2024-01-01,0,winter
3,2024-01-01 03:00:00,49,3,0,1,2024-01-01,0,winter
4,2024-01-01 04:00:00,23,4,0,1,2024-01-01,0,winter


In [12]:
# Appliquer sur 2025
print("Creating temporal features 2025...")
df_hourly_2025 = create_temporal_features(df_hourly_2025)
print("Temporal features 2025 created")
print(f"\nOverview:")
df_hourly_2025.head()

Creating temporal features 2025...
Temporal features 2025 created

Overview:


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,date,is_weekend,season
0,2025-01-01 00:00:00,336,0,2,1,2025-01-01,0,winter
1,2025-01-01 01:00:00,436,1,2,1,2025-01-01,0,winter
2,2025-01-01 02:00:00,213,2,2,1,2025-01-01,0,winter
3,2025-01-01 03:00:00,57,3,2,1,2025-01-01,0,winter
4,2025-01-01 04:00:00,24,4,2,1,2025-01-01,0,winter


### 3.1 Temporal Features Verification

In [13]:
# V√©rifier la distribution des features
print("Temporal features distribution (2024):")
print(f"\nHour: {df_hourly_2024['hour'].min()} ‚Üí {df_hourly_2024['hour'].max()}")
print(f"Day of week: {df_hourly_2024['day_of_week'].min()} ‚Üí {df_hourly_2024['day_of_week'].max()}")
print(f"Month: {df_hourly_2024['month'].min()} ‚Üí {df_hourly_2024['month'].max()}")
print(f"\nSeasons: {df_hourly_2024['season'].value_counts().to_dict()}")
print(f"Weekend ratio: {df_hourly_2024['is_weekend'].mean():.2%}")

Temporal features distribution (2024):

Hour: 0 ‚Üí 23
Day of week: 0 ‚Üí 6
Month: 1 ‚Üí 12

Seasons: {'summer': 2208, 'spring': 2207, 'fall': 2184, 'winter': 2183}
Weekend ratio: 28.41%


---

## 4. Weather Features üå§Ô∏è

Merger les donn√©es m√©t√©o (quotidiennes) avec les donn√©es horaires

### 4.1 S√©lectionner les features m√©t√©o importantes

In [14]:
# V√©rifier les colonnes m√©t√©o disponibles
print("üå§Ô∏è  Colonnes m√©t√©o disponibles:")
print(df_weather.columns.tolist())
print(f"\nüìä Statistiques m√©t√©o:")
df_weather.describe()

üå§Ô∏è  Colonnes m√©t√©o disponibles:
['date', 'tavg', 'tmin', 'tmax', 'prcp', 'snow', 'wdir', 'wspd', 'wpgt', 'pres', 'tsun']

üìä Statistiques m√©t√©o:


Unnamed: 0,tavg,tmin,tmax,prcp,snow,wdir,wspd,wpgt,pres,tsun
count,731.0,731.0,731.0,526.0,0.0,0.0,731.0,1.0,730.0,0.0
mean,11.595212,6.893434,16.175923,0.90057,,,12.706703,28.0,1016.65589,
std,10.697211,10.388528,11.51219,3.21903,,,5.528717,,7.049609,
min,-21.0,-23.3,-16.7,0.0,,,2.3,28.0,988.3,
25%,3.7,-0.6,7.2,0.0,,,8.4,28.0,1012.0,
50%,12.3,7.8,18.0,0.0,,,12.0,28.0,1016.95,
75%,20.9,15.0,26.1,0.0,,,16.25,28.0,1021.2,
max,31.2,27.2,35.6,24.0,,,34.3,28.0,1038.8,


In [15]:
# S√©lectionner et renommer les colonnes importantes
# Adapter selon les colonnes r√©elles dans le CSV m√©t√©o
weather_features = df_weather[['date']].copy()

# Ajouter les features m√©t√©o (adapter les noms de colonnes)
# Exemple: si les colonnes sont 'tavg', 'prcp', 'wspd'
if 'tavg' in df_weather.columns:
    weather_features['temperature'] = df_weather['tavg']
if 'prcp' in df_weather.columns:
    weather_features['precipitation'] = df_weather['prcp']
if 'wspd' in df_weather.columns:
    weather_features['wind_speed'] = df_weather['wspd']

print("‚úÖ Weather features s√©lectionn√©es:")
print(weather_features.columns.tolist())
weather_features.head()

‚úÖ Weather features s√©lectionn√©es:
['date', 'temperature', 'precipitation', 'wind_speed']


Unnamed: 0,date,temperature,precipitation,wind_speed
0,2024-01-01,-1.0,0.0,15.8
1,2024-01-02,-0.5,0.0,19.3
2,2024-01-03,0.9,0.0,11.0
3,2024-01-04,-0.5,0.0,7.1
4,2024-01-05,-0.4,0.0,7.6


### 4.2 Merger avec les donn√©es horaires

In [16]:
# Merger 2024
print("üîó Merge m√©t√©o avec donn√©es horaires 2024...")
df_hourly_2024 = df_hourly_2024.merge(weather_features, on='date', how='left')

print(f"‚úÖ Merge termin√©: {len(df_hourly_2024)} rows")
print(f"\n‚ö†Ô∏è  Missing values apr√®s merge:")
print(df_hourly_2024[['temperature', 'precipitation', 'wind_speed']].isnull().sum())
df_hourly_2024.head()

üîó Merge m√©t√©o avec donn√©es horaires 2024...
‚úÖ Merge termin√©: 8782 rows

‚ö†Ô∏è  Missing values apr√®s merge:
temperature         0
precipitation    3648
wind_speed          0
dtype: int64


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,date,is_weekend,season,temperature,precipitation,wind_speed
0,2024-01-01 00:00:00,180,0,0,1,2024-01-01,0,winter,-1.0,0.0,15.8
1,2024-01-01 01:00:00,373,1,0,1,2024-01-01,0,winter,-1.0,0.0,15.8
2,2024-01-01 02:00:00,238,2,0,1,2024-01-01,0,winter,-1.0,0.0,15.8
3,2024-01-01 03:00:00,49,3,0,1,2024-01-01,0,winter,-1.0,0.0,15.8
4,2024-01-01 04:00:00,23,4,0,1,2024-01-01,0,winter,-1.0,0.0,15.8


In [17]:
# Merger 2025
print("üîó Merge m√©t√©o avec donn√©es horaires 2025...")
df_hourly_2025 = df_hourly_2025.merge(weather_features, on='date', how='left')

print(f"‚úÖ Merge 2025 termin√©: {len(df_hourly_2025)} rows")
print(f"\n‚ö†Ô∏è  Missing values apr√®s merge (2025):")
print(df_hourly_2025[['temperature', 'precipitation', 'wind_speed']].isnull().sum())
df_hourly_2025.head()

üîó Merge m√©t√©o avec donn√©es horaires 2025...
‚úÖ Merge 2025 termin√©: 8758 rows

‚ö†Ô∏è  Missing values apr√®s merge (2025):
temperature         0
precipitation    1271
wind_speed          0
dtype: int64


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,date,is_weekend,season,temperature,precipitation,wind_speed
0,2025-01-01 00:00:00,336,0,2,1,2025-01-01,0,winter,-1.5,,21.0
1,2025-01-01 01:00:00,436,1,2,1,2025-01-01,0,winter,-1.5,,21.0
2,2025-01-01 02:00:00,213,2,2,1,2025-01-01,0,winter,-1.5,,21.0
3,2025-01-01 03:00:00,57,3,2,1,2025-01-01,0,winter,-1.5,,21.0
4,2025-01-01 04:00:00,24,4,2,1,2025-01-01,0,winter,-1.5,,21.0


### 4.3 G√©rer les valeurs manquantes m√©t√©o

In [18]:
# Remplacer les valeurs manquantes par 0
# Justification: pas de donn√©es m√©t√©o = conditions normales/neutres
# Pour les pr√©cipitations: 0 mm = pas de pluie (hypoth√®se raisonnable)
weather_cols = ['temperature', 'precipitation', 'wind_speed']

print("üîß Traitement des valeurs manquantes (remplacement par 0):\n")

# Traiter 2024
for col in weather_cols:
    if col in df_hourly_2024.columns:
        missing_count = df_hourly_2024[col].isnull().sum()
        if missing_count > 0:
            df_hourly_2024[col].fillna(0, inplace=True)
            print(f"‚úì 2024 - {col}: {missing_count} valeurs manquantes remplac√©es par 0")

# Traiter 2025
for col in weather_cols:
    if col in df_hourly_2025.columns:
        missing_count_2025 = df_hourly_2025[col].isnull().sum()
        if missing_count_2025 > 0:

            df_hourly_2025[col].fillna(0, inplace=True)
            print("üí° Cette approche simple est suffisante car < 1% des donn√©es sont concern√©es")

            print(f"‚úì 2025 - {col}: {missing_count_2025} valeurs manquantes remplac√©es par 0")
            print("\n‚úÖ Toutes les valeurs manquantes m√©t√©o g√©r√©es")


üîß Traitement des valeurs manquantes (remplacement par 0):

‚úì 2024 - precipitation: 3648 valeurs manquantes remplac√©es par 0
üí° Cette approche simple est suffisante car < 1% des donn√©es sont concern√©es
‚úì 2025 - precipitation: 1271 valeurs manquantes remplac√©es par 0

‚úÖ Toutes les valeurs manquantes m√©t√©o g√©r√©es


---

## 5. Calendar Features üìÖ

Ajouter l'indicateur de jours f√©ri√©s

In [19]:
# Ajouter is_holiday
print("üìÖ Ajout de l'indicateur de jours f√©ri√©s...")
df_hourly_2024['is_holiday'] = df_hourly_2024['date'].isin(holiday_dates).astype(int)

holiday_count = df_hourly_2024['is_holiday'].sum()
print(f"‚úÖ is_holiday ajout√©: {holiday_count} heures marqu√©es comme f√©ri√©")
print(f"   Ratio: {df_hourly_2024['is_holiday'].mean():.2%}")

üìÖ Ajout de l'indicateur de jours f√©ri√©s...
‚úÖ is_holiday ajout√©: 264 heures marqu√©es comme f√©ri√©
   Ratio: 3.01%


In [20]:
# Ajouter pour 2025
df_hourly_2025['is_holiday'] = df_hourly_2025['date'].isin(holiday_dates).astype(int)
holiday_count_2025 = df_hourly_2025['is_holiday'].sum()
print(f"‚úÖ is_holiday ajout√© pour 2025: {holiday_count_2025} heures marqu√©es comme f√©ri√©")
print(f"   Ratio: {df_hourly_2025['is_holiday'].mean():.2%}")

‚úÖ is_holiday ajout√© pour 2025: 264 heures marqu√©es comme f√©ri√©
   Ratio: 3.01%


---

## 6. Final Dataset üéØ

Pr√©parer le dataset final pour la mod√©lisation

### 6.1 S√©lectionner les colonnes finales

In [21]:
# Colonnes √† garder pour la mod√©lisation
feature_cols = [
    'datetime_hour',  # Pour r√©f√©rence temporelle
    'trip_count',     # TARGET
    # Temporal
    'hour',
    'day_of_week',
    'month',
    'is_weekend',
    'season',
    # Weather
    'temperature',
    'precipitation',
    'wind_speed',
    # Calendar
    'is_holiday'
]

df_final_2024 = df_hourly_2024[feature_cols].copy()
print(f"‚úÖ Dataset final 2024: {df_final_2024.shape}")
print(f"\nColonnes: {df_final_2024.columns.tolist()}")
df_final_2024.head()

‚úÖ Dataset final 2024: (8782, 11)

Colonnes: ['datetime_hour', 'trip_count', 'hour', 'day_of_week', 'month', 'is_weekend', 'season', 'temperature', 'precipitation', 'wind_speed', 'is_holiday']


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,is_weekend,season,temperature,precipitation,wind_speed,is_holiday
0,2024-01-01 00:00:00,180,0,0,1,0,winter,-1.0,0.0,15.8,1
1,2024-01-01 01:00:00,373,1,0,1,0,winter,-1.0,0.0,15.8,1
2,2024-01-01 02:00:00,238,2,0,1,0,winter,-1.0,0.0,15.8,1
3,2024-01-01 03:00:00,49,3,0,1,0,winter,-1.0,0.0,15.8,1
4,2024-01-01 04:00:00,23,4,0,1,0,winter,-1.0,0.0,15.8,1


In [22]:
# Pr√©parer 2025
df_final_2025 = df_hourly_2025[feature_cols].copy()
print(f"‚úÖ Dataset final 2025: {df_final_2025.shape}")
print(f"\nColonnes: {df_final_2025.columns.tolist()}")
df_final_2025.head()

‚úÖ Dataset final 2025: (8758, 11)

Colonnes: ['datetime_hour', 'trip_count', 'hour', 'day_of_week', 'month', 'is_weekend', 'season', 'temperature', 'precipitation', 'wind_speed', 'is_holiday']


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,is_weekend,season,temperature,precipitation,wind_speed,is_holiday
0,2025-01-01 00:00:00,336,0,2,1,0,winter,-1.5,0.0,21.0,1
1,2025-01-01 01:00:00,436,1,2,1,0,winter,-1.5,0.0,21.0,1
2,2025-01-01 02:00:00,213,2,2,1,0,winter,-1.5,0.0,21.0,1
3,2025-01-01 03:00:00,57,3,2,1,0,winter,-1.5,0.0,21.0,1
4,2025-01-01 04:00:00,24,4,2,1,0,winter,-1.5,0.0,21.0,1


### 6.2 V√©rifications finales

In [23]:
# V√©rifier les valeurs manquantes
print("üîç V√©rification des valeurs manquantes:")
missing = df_final_2024.isnull().sum()
if missing.sum() > 0:
    print("‚ö†Ô∏è  Valeurs manquantes d√©tect√©es:")
    print(missing[missing > 0])
else:
    print("‚úÖ Aucune valeur manquante!")

# Statistiques descriptives
print("\nüìä Statistiques descriptives:")
df_final_2024.describe()

üîç V√©rification des valeurs manquantes:
‚úÖ Aucune valeur manquante!

üìä Statistiques descriptives:


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,is_weekend,temperature,precipitation,wind_speed,is_holiday
count,8782,8782.0,8782.0,8782.0,8782.0,8782.0,8782.0,8782.0,8782.0,8782.0
mean,2024-07-02 00:16:02.924162816,667.338647,11.502277,2.986108,6.514689,0.284104,12.301412,0.261535,12.652961,0.030061
min,2024-01-01 00:00:00,1.0,0.0,0.0,1.0,0.0,-21.0,0.0,2.7,0.0
25%,2024-04-01 13:15:00,131.0,6.0,1.0,4.0,0.0,3.8,0.0,8.3,0.0
50%,2024-07-02 00:30:00,442.5,12.0,3.0,7.0,0.0,14.1,0.0,12.0,0.0
75%,2024-10-01 11:45:00,997.0,17.75,5.0,10.0,1.0,21.0,0.0,16.2,0.0
max,2024-12-31 23:00:00,3789.0,23.0,6.0,12.0,1.0,30.4,24.0,29.7,1.0
std,,667.821275,6.921719,2.003422,3.451118,0.451012,10.077546,1.848902,5.486283,0.170766


In [24]:
# V√©rifier la distribution de la target
print("üéØ Distribution de trip_count (TARGET):")
print(df_final_2024['trip_count'].describe())

# Visualisation
fig = px.histogram(df_final_2024, x='trip_count', nbins=50,
                   title='Distribution du nombre de trajets par heure',
                   labels={'trip_count': 'Nombre de trajets'})
fig.show()

üéØ Distribution de trip_count (TARGET):
count    8782.000000
mean      667.338647
std       667.821275
min         1.000000
25%       131.000000
50%       442.500000
75%       997.000000
max      3789.000000
Name: trip_count, dtype: float64


### 6.3 Encoder les variables cat√©gorielles

Nous appliquons un one-hot encoding sur la variable `season` pour transformer cette variable cat√©gorielle en variables binaires. Cela permet aux mod√®les ML de traiter correctement les saisons sans imposer un ordre artificiel entre elles.

**One-Hot Encoding de 'season'**:

In [25]:
# Encoder 'season' en one-hot
print("üîß Encodage de la variable 'season'...")
season_dummies_2024 = pd.get_dummies(df_final_2024['season'], prefix='season', drop_first=False)
df_final_2024 = pd.concat([df_final_2024.drop('season', axis=1), season_dummies_2024], axis=1)

print(f"‚úÖ Encoding termin√©")
print(f"   Nouvelles colonnes: {season_dummies_2024.columns.tolist()}")
print(f"\nüìä Shape finale: {df_final_2024.shape}")
df_final_2024.head()

üîß Encodage de la variable 'season'...
‚úÖ Encoding termin√©
   Nouvelles colonnes: ['season_fall', 'season_spring', 'season_summer', 'season_winter']

üìä Shape finale: (8782, 14)


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,is_weekend,temperature,precipitation,wind_speed,is_holiday,season_fall,season_spring,season_summer,season_winter
0,2024-01-01 00:00:00,180,0,0,1,0,-1.0,0.0,15.8,1,False,False,False,True
1,2024-01-01 01:00:00,373,1,0,1,0,-1.0,0.0,15.8,1,False,False,False,True
2,2024-01-01 02:00:00,238,2,0,1,0,-1.0,0.0,15.8,1,False,False,False,True
3,2024-01-01 03:00:00,49,3,0,1,0,-1.0,0.0,15.8,1,False,False,False,True
4,2024-01-01 04:00:00,23,4,0,1,0,-1.0,0.0,15.8,1,False,False,False,True


In [26]:
# Encoder 2025
season_dummies_2025 = pd.get_dummies(df_final_2025['season'], prefix='season', drop_first=False)
df_final_2025 = pd.concat([df_final_2025.drop('season', axis=1), season_dummies_2025], axis=1)
print(f"‚úÖ Encoding 2025 termin√©: {df_final_2025.shape}")
print(f"   Nouvelles colonnes: {season_dummies_2025.columns.tolist()}")

‚úÖ Encoding 2025 termin√©: (8758, 14)
   Nouvelles colonnes: ['season_fall', 'season_spring', 'season_summer', 'season_winter']


---

## 7. Train/Test Split

Sauvegarder les datasets train (2024) et test (2025)

### 7.1 Saving Datasets

In [27]:
# Sauvegarder 2024 (train)
output_file_2024 = PROCESSED_PATH / 'train_2024_hourly.csv'
df_final_2024.to_csv(output_file_2024, index=False)
print(f"TRAIN dataset saved: {output_file_2024}")
print(f"   Shape: {df_final_2024.shape}")
print(f"   Size: {output_file_2024.stat().st_size / 1024**2:.2f} MB")

TRAIN dataset saved: ..\data\processed\train_2024_hourly.csv
   Shape: (8782, 14)
   Size: 0.61 MB


In [28]:
# Sauvegarder 2025 (test)
output_file_2025 = PROCESSED_PATH / 'test_2025_hourly.csv'
df_final_2025.to_csv(output_file_2025, index=False)
print(f"TEST dataset saved: {output_file_2025}")
print(f"   Shape: {df_final_2025.shape}")
print(f"   Size: {output_file_2025.stat().st_size / 1024**2:.2f} MB")

TEST dataset saved: ..\data\processed\test_2025_hourly.csv
   Shape: (8758, 14)
   Size: 0.60 MB


### 7.2 Final Summary

In [29]:
# R√©sum√© complet
print("="*80)
print("FEATURE ENGINEERING SUMMARY")
print("="*80)

print(f"\nTRAIN Dataset (2024):")
print(f"   Shape: {df_final_2024.shape}")
print(f"   Period: {df_final_2024['datetime_hour'].min()} ‚Üí {df_final_2024['datetime_hour'].max()}")
print(f"   Target (trip_count): min={df_final_2024['trip_count'].min()}, "
      f"max={df_final_2024['trip_count'].max()}, "
      f"mean={df_final_2024['trip_count'].mean():.0f}")

print(f"\nTEST Dataset (2025):")
print(f"   Shape: {df_final_2025.shape}")
print(f"   Period: {df_final_2025['datetime_hour'].min()} ‚Üí {df_final_2025['datetime_hour'].max()}")
print(f"   Target (trip_count): min={df_final_2025['trip_count'].min()}, "
      f"max={df_final_2025['trip_count'].max()}, "
      f"mean={df_final_2025['trip_count'].mean():.0f}")

print(f"\nFeatures created ({df_final_2024.shape[1]-2} features + 1 target + 1 datetime):")
feature_list = [col for col in df_final_2024.columns if col not in ['datetime_hour', 'trip_count']]
print(f"   {feature_list}")

print("\nFeature types:")
print(f"   Temporal: hour, day_of_week, month, is_weekend, season_*")
print(f"   Weather: temperature, precipitation, wind_speed")
print(f"   Calendar: is_holiday")

print(f"\nNote on hour difference:")
print(f"   2024 (leap year): {df_final_2024.shape[0]} hours")
print(f"   2025 (regular year): {df_final_2025.shape[0]} hours")
print(f"   Difference: {abs(df_final_2024.shape[0] - df_final_2025.shape[0])} hours (expected due to leap year + DST)")

print("\n" + "="*80)
print("FEATURE ENGINEERING COMPLETED")
print("="*80)
print(f"\nProcessing date: {datetime.now().strftime('%Y-%m-%d %H:%M')}")
print("Next step: Modeling (notebook 03)")

FEATURE ENGINEERING SUMMARY

TRAIN Dataset (2024):
   Shape: (8782, 14)
   Period: 2024-01-01 00:00:00 ‚Üí 2024-12-31 23:00:00
   Target (trip_count): min=1, max=3789, mean=667

TEST Dataset (2025):
   Shape: (8758, 14)
   Period: 2025-01-01 00:00:00 ‚Üí 2025-12-31 23:00:00
   Target (trip_count): min=1, max=3491, mean=634

Features created (12 features + 1 target + 1 datetime):
   ['hour', 'day_of_week', 'month', 'is_weekend', 'temperature', 'precipitation', 'wind_speed', 'is_holiday', 'season_fall', 'season_spring', 'season_summer', 'season_winter']

Feature types:
   Temporal: hour, day_of_week, month, is_weekend, season_*
   Weather: temperature, precipitation, wind_speed
   Calendar: is_holiday

Note on hour difference:
   2024 (leap year): 8782 hours
   2025 (regular year): 8758 hours
   Difference: 24 hours (expected due to leap year + DST)

FEATURE ENGINEERING COMPLETED

Processing date: 2026-01-19 12:47
Next step: Modeling (notebook 03)


---

## Feature Engineering Completed

**Date:** 2026-01-19  
**Status:** Completed  
**Next:** Linear Regression, Random Forest, XGBoost

### What was done:
- Hourly trip aggregation (2024 + 2025)
- Simple, interpretable temporal features
- Merged with daily weather data
- Missing values treatment (filled with 0)
- Holiday indicator
- One-hot encoding for 'season' variable
- Train/test datasets saved

### Datasets ready for modeling:
- `train_2024_hourly.csv`: ~8,760 hours
- `test_2025_hourly.csv`: ~8,758 hours

### Final statistics:
- **15 features** total (after one-hot encoding seasons)
- **No missing values** in final datasets
- **Hour difference** between 2024 and 2025 explained by leap year + DST adjustments

---