# Model Training and Evaluation

**Project:** Pr√©diction de l'utilisation des v√©los Divvy √† Chicago  
**Author:** No√©  
**Date:** 2026-01-14  
**Objective:** Entra√Æner 3 mod√®les ML et s√©lectionner le meilleur

---

## Objective

**Pr√©dire le nombre de trajets PAR HEURE** pour optimiser la disponibilit√© des v√©los

### Models Comparison:
1. **Linear Regression** - Simple baseline model
2. **Random Forest** - Captures non-linear interactions
3. **XGBoost** - State-of-the-art gradient boosting

### Evaluation Metrics:
- **RMSE** (Root Mean Squared Error)
- **MAE** (Mean Absolute Error)
- **R¬≤** (Coefficient of determination) - Target ‚â• 0.75
- **MAPE** (Mean Absolute Percentage Error) - Target ‚â§ 20%

---

## 1. Setup and Data Loading

In [1]:
# Import des biblioth√®ques
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from pathlib import Path
import joblib

# Machine Learning
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

# Configuration
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
np.random.seed(42)

print("Libraries loaded")

Libraries loaded


In [2]:
# Chemins
DATA_PATH = Path('../data')
PROCESSED_PATH = DATA_PATH / 'processed'
MODELS_PATH = Path('../models')
MODELS_PATH.mkdir(exist_ok=True)

print("Paths configured")

Paths configured


### 1.1 Loading Datasets

In [3]:
# Charger train et test
print("Loading datasets...")
df_train = pd.read_csv(PROCESSED_PATH / 'train_2024_hourly.csv')
df_test = pd.read_csv(PROCESSED_PATH / 'test_2025_hourly.csv')

print(f"\nDatasets loaded:")
print(f"   Train 2024: {df_train.shape}")
print(f"   Test 2025:  {df_test.shape}")

df_train.head()
df_test.head()

Loading datasets...

Datasets loaded:
   Train 2024: (8782, 14)
   Test 2025:  (8758, 14)


Unnamed: 0,datetime_hour,trip_count,hour,day_of_week,month,is_weekend,temperature,precipitation,wind_speed,is_holiday,season_fall,season_spring,season_summer,season_winter
0,2025-01-01 00:00:00,336,0,2,1,0,-1.5,0.0,21.0,1,False,False,False,True
1,2025-01-01 01:00:00,436,1,2,1,0,-1.5,0.0,21.0,1,False,False,False,True
2,2025-01-01 02:00:00,213,2,2,1,0,-1.5,0.0,21.0,1,False,False,False,True
3,2025-01-01 03:00:00,57,3,2,1,0,-1.5,0.0,21.0,1,False,False,False,True
4,2025-01-01 04:00:00,24,4,2,1,0,-1.5,0.0,21.0,1,False,False,False,True


### 1.2 Preparing X and y

In [4]:
# S√©parer features et target
X_train = df_train.drop(['datetime_hour', 'trip_count'], axis=1)
y_train = df_train['trip_count']

X_test = df_test.drop(['datetime_hour', 'trip_count'], axis=1)
y_test = df_test['trip_count']

print(f"Data prepared:")
print(f"   X_train: {X_train.shape}")
print(f"   y_train: {y_train.shape}")
print(f"\nFeatures: {X_train.columns.tolist()}")

Data prepared:
   X_train: (8782, 12)
   y_train: (8782,)

Features: ['hour', 'day_of_week', 'month', 'is_weekend', 'temperature', 'precipitation', 'wind_speed', 'is_holiday', 'season_fall', 'season_spring', 'season_summer', 'season_winter']


---

## 2. Model 1: Linear Regression

In [5]:
# Standardiser
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Entra√Æner
print("Training Linear Regression...")
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)

# Pr√©dire
y_train_pred_lr = lr_model.predict(X_train_scaled)
y_test_pred_lr = lr_model.predict(X_test_scaled)

print("Linear Regression completed")

Training Linear Regression...
Linear Regression completed


In [6]:
# Fonction d'√©valuation
def evaluate_model(y_true, y_pred, name="Model"):
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    print(f"\n{name}:")
    print(f"  RMSE: {rmse:.2f}")
    print(f"  MAE:  {mae:.2f}")
    print(f"  R¬≤:   {r2:.4f}")
    print(f"  MAPE: {mape:.2f}%")
    
    return {'RMSE': rmse, 'MAE': mae, 'R2': r2, 'MAPE': mape}


In [7]:
# √âvaluer Linear Regression
lr_train = evaluate_model(y_train, y_train_pred_lr, "Train")
lr_test = evaluate_model(y_test, y_test_pred_lr, "Test")


Train:
  RMSE: 530.85
  MAE:  400.30
  R¬≤:   0.3681
  MAPE: 300.10%

Test:
  RMSE: 506.77
  MAE:  382.62
  R¬≤:   0.3858
  MAPE: 313.23%


---

## 3. Model 2: Random Forest

In [8]:
# Entra√Æner Random Forest
print("Training Random Forest...")
rf_model = RandomForestRegressor(
    n_estimators=100,
    max_depth=15,
    min_samples_split=5,
    random_state=42,
    n_jobs=-1,
    verbose=1
)
rf_model.fit(X_train, y_train)

# Pr√©dire
y_train_pred_rf = rf_model.predict(X_train)
y_test_pred_rf = rf_model.predict(X_test)

print("Random Forest completed")

Training Random Forest...


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.1s


Random Forest completed


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.4s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done 100 out of 100 | elapsed:    0.0s finished


In [9]:
# √âvaluer Random Forest
rf_train = evaluate_model(y_train, y_train_pred_rf, "Train")
rf_test = evaluate_model(y_test, y_test_pred_rf, "Test")


Train:
  RMSE: 84.23
  MAE:  49.98
  R¬≤:   0.9841
  MAPE: 14.54%

Test:
  RMSE: 205.07
  MAE:  122.15
  R¬≤:   0.8994
  MAPE: 34.49%


In [10]:
# Feature importance
rf_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\n Feature Importance:")
print(rf_importance)

fig = px.bar(rf_importance, x='importance', y='feature', orientation='h',
             title='Random Forest - Feature Importance')
fig.show()


 Feature Importance:
          feature  importance
0            hour    0.534208
4     temperature    0.329756
1     day_of_week    0.046657
6      wind_speed    0.035968
3      is_weekend    0.019012
2           month    0.015989
8     season_fall    0.006642
7      is_holiday    0.005089
10  season_summer    0.002735
9   season_spring    0.001653
5   precipitation    0.001442
11  season_winter    0.000849


---

## 4. Model 3: XGBoost üöÄ

In [11]:
# Entra√Æner XGBoost
print("Training XGBoost...")
xgb_model = xgb.XGBRegressor(
    n_estimators=200,
    max_depth=8,
    learning_rate=0.1,
    subsample=0.8,
    random_state=42,
    n_jobs=-1
)
xgb_model.fit(X_train, y_train)

# Pr√©dire
y_train_pred_xgb = xgb_model.predict(X_train)
y_test_pred_xgb = xgb_model.predict(X_test)

print("\n XGBoost completed!")

Training XGBoost...

 XGBoost completed!


In [12]:
# √âvaluer XGBoost
xgb_train = evaluate_model(y_train, y_train_pred_xgb, "Train")
xgb_test = evaluate_model(y_test, y_test_pred_xgb, "Test")


Train:
  RMSE: 38.64
  MAE:  25.68
  R¬≤:   0.9967
  MAPE: 11.18%

Test:
  RMSE: 208.66
  MAE:  124.01
  R¬≤:   0.8959
  MAPE: 36.17%


---

## 5. Comparaison des Mod√®les üèÜ

In [13]:
# Tableau comparatif complet (Train + Test)
comparison = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'XGBoost'],
    'Train_R2': [lr_train['R2'], rf_train['R2'], xgb_train['R2']],
    'Test_R2': [lr_test['R2'], rf_test['R2'], xgb_test['R2']],
    'Train_RMSE': [lr_train['RMSE'], rf_train['RMSE'], xgb_train['RMSE']],
    'Test_RMSE': [lr_test['RMSE'], rf_test['RMSE'], xgb_test['RMSE']],
    'Train_MAE': [lr_train['MAE'], rf_train['MAE'], xgb_train['MAE']],
    'Test_MAE': [lr_test['MAE'], rf_test['MAE'], xgb_test['MAE']],
    'Train_MAPE': [lr_train['MAPE'], rf_train['MAPE'], xgb_train['MAPE']],
    'Test_MAPE': [lr_test['MAPE'], rf_test['MAPE'], xgb_test['MAPE']]
})

# Ajouter colonne Gap (Train - Test R¬≤)
comparison['R2_Gap'] = comparison['Train_R2'] - comparison['Test_R2']

print("Comparaison des 3 mod√®les")

print(comparison.to_string(index=False))

# Meilleur mod√®le
best_idx = comparison['Test_R2'].idxmax()
best_model = comparison.loc[best_idx, 'Model']
best_r2 = comparison.loc[best_idx, 'Test_R2']
best_mape = comparison.loc[best_idx, 'Test_MAPE']
best_gap = comparison.loc[best_idx, 'R2_Gap']

print("\n" + "-"*90)
print(f"üèÜ Le meilleur mod√®le est : {best_model}")
print(f"\n   ‚Ä¢ R¬≤ sur test : {best_r2:.4f}")
print(f"   ‚Ä¢ MAPE : {best_mape:.2f}%")
print(f"   ‚Ä¢ Gap train/test : {best_gap:.4f}", end="")
if best_gap < 0.10:
    print(" ‚Üí le mod√®le g√©n√©ralise bien ")
elif best_gap < 0.15:
    print(" ‚Üí l√©ger overfitting mais acceptable ")
else:
    print(" ‚Üí attention, overfitting d√©tect√©")
print("-"*90)

Comparaison des 3 mod√®les
            Model  Train_R2  Test_R2  Train_RMSE  Test_RMSE  Train_MAE   Test_MAE  Train_MAPE  Test_MAPE    R2_Gap
Linear Regression  0.368075 0.385802  530.845739 506.774456 400.296099 382.619940  300.097300 313.232525 -0.017727
    Random Forest  0.984090 0.899431   84.229808 205.065062  49.984944 122.149323   14.536771  34.494014  0.084659
          XGBoost  0.996651 0.895878   38.644237 208.656613  25.681711 124.012604   11.183673  36.170424  0.100773

------------------------------------------------------------------------------------------
üèÜ Le meilleur mod√®le est : Random Forest

   ‚Ä¢ R¬≤ sur test : 0.8994
   ‚Ä¢ MAPE : 34.49%
   ‚Ä¢ Gap train/test : 0.0847 ‚Üí le mod√®le g√©n√©ralise bien 
------------------------------------------------------------------------------------------


In [14]:
# Visualisation comparative am√©lior√©e
fig = make_subplots(rows=1, cols=3, 
                    subplot_titles=('R¬≤ Score (Train vs Test)', 'MAPE %', 'R¬≤ Gap (Overfitting)'))

# Plot 1: R¬≤ Train vs Test
fig.add_trace(go.Bar(x=comparison['Model'], y=comparison['Train_R2'], 
                     name='Train R¬≤', marker_color='lightblue'), row=1, col=1)
fig.add_trace(go.Bar(x=comparison['Model'], y=comparison['Test_R2'], 
                     name='Test R¬≤', marker_color='steelblue'), row=1, col=1)

# Plot 2: MAPE
fig.add_trace(go.Bar(x=comparison['Model'], y=comparison['Test_MAPE'], 
                     name='MAPE', marker_color='coral', showlegend=False), row=1, col=2)

# Plot 3: R¬≤ Gap
fig.add_trace(go.Bar(x=comparison['Model'], y=comparison['R2_Gap'], 
                     name='R¬≤ Gap', marker_color='orange', showlegend=False), row=1, col=3)

# Lignes de r√©f√©rence
fig.add_hline(y=0.75, line_dash="dash", line_color="green", 
              annotation_text="Objectif R¬≤", row=1, col=1)
fig.add_hline(y=20, line_dash="dash", line_color="red", 
              annotation_text="Objectif MAPE", row=1, col=2)
fig.add_hline(y=0.10, line_dash="dash", line_color="red", 
              annotation_text="Seuil overfitting", row=1, col=3)

fig.update_layout(title="Vue d'ensemble des performances", height=500, barmode='group')
fig.update_yaxes(title_text="R¬≤", row=1, col=1)
fig.update_yaxes(title_text="MAPE (%)", row=1, col=2)
fig.update_yaxes(title_text="Gap", row=1, col=3)
fig.show()

---

## 6. Visualisation des Pr√©dictions üìä

In [15]:
# Prendre le meilleur mod√®le
if best_model == 'Linear Regression':
    best_pred = y_test_pred_lr
elif best_model == 'Random Forest':
    best_pred = y_test_pred_rf
else:
    best_pred = y_test_pred_xgb

# Predictions vs Actual
fig = go.Figure()
fig.add_trace(go.Scatter(x=y_test, y=best_pred, mode='markers',
                         marker=dict(size=3, opacity=0.5)))
fig.add_trace(go.Scatter(x=[0, y_test.max()], y=[0, y_test.max()],
                         mode='lines', line=dict(color='red', dash='dash')))
fig.update_layout(title=f'{best_model} - Predictions vs Actual',
                  xaxis_title='Actual', yaxis_title='Predicted', height=500)
fig.show()

In [16]:
# Timeline (premi√®re semaine)
sample = slice(0, 168)
fig = go.Figure()
fig.add_trace(go.Scatter(y=y_test.iloc[sample].values, name='Actual',
                         line=dict(color='steelblue')))
fig.add_trace(go.Scatter(y=best_pred[sample], name='Predicted',
                         line=dict(color='coral', dash='dash')))
fig.update_layout(title='Timeline - Premi√®re semaine 2025',
                  xaxis_title='Heures', yaxis_title='Trip Count', height=500)
fig.show()

---

## 7. Analyse Overfitting/Underfitting

Pour v√©rifier que le mod√®le g√©n√©ralise bien, on va analyser :
- L'√©cart entre performance train et test
- Les learning curves pour comprendre le comportement
- La stabilit√© avec la cross-validation

### 7.1 √âcart Train vs Test

In [17]:
# Analyser le gap √† partir du dataframe comparison
print("\nAnalyse de l'overfitting/underfitting:\n")
print(comparison[['Model', 'Train_R2', 'Test_R2', 'R2_Gap']].to_string(index=False))

# Interpr√©ter le gap pour le meilleur mod√®le
best_gap = comparison.loc[comparison['Test_R2'].idxmax(), 'R2_Gap']
print(f"\nPour {best_model}:")
print(f"  Gap train-test = {best_gap:.4f}")
if best_gap < 0.10:
    print("  -> Pas d'overfitting, le mod√®le g√©n√©ralise bien")
elif best_gap < 0.15:
    print("  -> L√©ger overfitting mais reste acceptable")
else:
    print("  -> Attention, overfitting d√©tect√©")


Analyse de l'overfitting/underfitting:

            Model  Train_R2  Test_R2    R2_Gap
Linear Regression  0.368075 0.385802 -0.017727
    Random Forest  0.984090 0.899431  0.084659
          XGBoost  0.996651 0.895878  0.100773

Pour Random Forest:
  Gap train-test = 0.0847
  -> Pas d'overfitting, le mod√®le g√©n√©ralise bien


### 7.2 Learning Curves

Pour mieux comprendre le comportement du mod√®le, on trace les courbes d'apprentissage du meilleur mod√®le.

In [18]:
from sklearn.model_selection import learning_curve

print(f"Calcul des learning curves pour {best_model}...\n")

# S√©lectionner le mod√®le et les donn√©es
if best_model == 'XGBoost':
    model_to_test = xgb_model
    X_to_test = X_train
elif best_model == 'Random Forest':
    model_to_test = rf_model
    X_to_test = X_train
else:
    model_to_test = lr_model
    X_to_test = X_train_scaled

# Calculer learning curves
train_sizes = np.linspace(0.1, 1.0, 8)
train_sizes_abs, train_scores, val_scores = learning_curve(
    model_to_test, X_to_test, y_train,
    train_sizes=train_sizes,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    random_state=42
)

train_scores_mean = train_scores.mean(axis=1)
val_scores_mean = val_scores.mean(axis=1)

print("Termin√©!")

Calcul des learning curves pour Random Forest...

Termin√©!


In [19]:
# Visualisation
fig = go.Figure()

fig.add_trace(go.Scatter(
    x=train_sizes_abs, y=train_scores_mean,
    name='Score Train',
    mode='lines+markers',
    line=dict(color='blue', width=2)
))

fig.add_trace(go.Scatter(
    x=train_sizes_abs, y=val_scores_mean,
    name='Score Validation (CV)',
    mode='lines+markers',
    line=dict(color='red', width=2)
))

fig.update_layout(
    title=f'Learning Curves - {best_model}',
    xaxis_title='Nombre d\'exemples d\'entra√Ænement',
    yaxis_title='R¬≤',
    height=500
)
fig.show()

# Interpr√©tation
print("\nInterpr√©tation :")
if val_scores_mean[-1] > 0.75:
    print(f"Les courbes convergent vers un bon score ({val_scores_mean[-1]:.3f})")
    print("Le mod√®le apprend bien et g√©n√©ralise correctement.")
else:
    print("Les courbes sugg√®rent qu'on pourrait am√©liorer avec plus de donn√©es.")


Interpr√©tation :
Les courbes convergent vers un bon score (0.763)
Le mod√®le apprend bien et g√©n√©ralise correctement.


### 7.3 Cross-Validation

Pour v√©rifier la stabilit√© du meilleur mod√®le, on utilise la cross-validation 5-fold.

In [20]:
from sklearn.model_selection import cross_val_score

print(f"Cross-validation pour {best_model}...\n")

# CV uniquement pour le meilleur mod√®le
if best_model == 'XGBoost':
    cv_scores = cross_val_score(xgb_model, X_train, y_train, cv=5, scoring='r2', n_jobs=-1)
elif best_model == 'Random Forest':
    cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5, scoring='r2', n_jobs=-1)
else:
    cv_scores = cross_val_score(lr_model, X_train_scaled, y_train, cv=5, scoring='r2', n_jobs=-1)

print(f"Scores des 5 folds: {cv_scores}")
print(f"Moyenne: {cv_scores.mean():.4f}")
print(f"√âcart-type: {cv_scores.std():.4f}")

if cv_scores.std() < 0.05:
    print("\nLe mod√®le est stable (faible variance entre les folds)")
else:
    print("\nLe mod√®le montre une certaine variance selon les donn√©es")

Cross-validation pour Random Forest...

Scores des 5 folds: [0.74054813 0.82625906 0.84314591 0.81704151 0.59019965]
Moyenne: 0.7634
√âcart-type: 0.0935

Le mod√®le montre une certaine variance selon les donn√©es


In [21]:
# Sauvegarder tous les mod√®les
joblib.dump(lr_model, MODELS_PATH / 'linear_regression.pkl')
joblib.dump(scaler, MODELS_PATH / 'scaler.pkl')
joblib.dump(rf_model, MODELS_PATH / 'random_forest.pkl')
joblib.dump(xgb_model, MODELS_PATH / 'xgboost.pkl')

# Sauvegarder le meilleur
if best_model == 'Linear Regression':
    joblib.dump(lr_model, MODELS_PATH / 'best_model.pkl')
elif best_model == 'Random Forest':
    joblib.dump(rf_model, MODELS_PATH / 'best_model.pkl')
else:
    joblib.dump(xgb_model, MODELS_PATH / 'best_model.pkl')

comparison.to_csv(MODELS_PATH / 'model_comparison.csv', index=False)

print("‚úÖ Tous les mod√®les sauvegard√©s!")
print(f"   Meilleur: {best_model}")

‚úÖ Tous les mod√®les sauvegard√©s!
   Meilleur: Random Forest
