<a href="https://colab.research.google.com/github/kazikamil/backend_track/blob/main/tracking_model_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cell imports necessary libraries (pandas, glob, os), lists all CSV files in the current directory, reads them into a list of DataFrames, and then concatenates them into a single DataFrame `df_all`. It also calculates `remaining_laps` and prints rows where 'incident' is NaN for single-incident cases.

In [None]:
import pandas as pd
import glob
import os

# Lister tous les fichiers CSV dans le répertoire courant
csv_files = glob.glob("*.csv")

# (Optionnel) — Si tes fichiers sont dans un dossier précis :
# csv_files = glob.glob("/content/mon_dossier/*.csv")

print("Fichiers trouvés :", csv_files)

dfs = [pd.read_csv(f, encoding="utf-8") for f in csv_files]

for df in dfs:
  df['remaining_laps'] = df['lap_number'].max() - df['lap_number']
  if(df['incident'].isna().sum()==1):
    print(df[df['incident'].isna()])


df_all = pd.concat(dfs, ignore_index=True)
print(df_all.shape)
df_all.head()


Fichiers trouvés : ['IND_race_2 (1).csv', 'IND_race_1 (1).csv', 'COTA_race_1 (1).csv', 'BARBER_race_1 (1).csv', 'COTA_race_2 (1).csv', 'SON_race_1 (1).csv', 'RA_race_1.csv', 'RA_race_2.csv', 'SEB_race_2 (1).csv', 'BARBER_race_2 (1).csv', 'SEB_race_1 (1).csv']
    vehicle_number  driver_number  lap_number                lap_time  \
52              11              1           9  0 days 00:02:30.663000   

    lap_improvement crossing_finish_line_in_pit      s1  s1_improvement  \
52                0                         NaN  44.282               0   

        s2  s2_improvement  ... ath  gear  nmot pbrake_f pbrake_r speed  \
52  52.848               0  ... NaN   NaN   NaN      NaN      NaN   NaN   

   incident           best_lap_time  loss_per_lap remaining_laps  
52      NaN  0 days 00:02:28.095000         2.568             10  

[1 rows x 67 columns]
(4986, 79)


Unnamed: 0,vehicle_number,driver_number,lap_number,lap_time,lap_improvement,crossing_finish_line_in_pit,s1,s1_improvement,s2,s2_improvement,...,speed,incident,best_lap_time,loss_per_lap,remaining_laps,ath,int-1_time,int-1_elapsed,int-2_time,int-2_elapsed
0,2,1,1,0 days 00:01:52.476000,0,,38.431,0,40.412,0,...,86.550548,0.0,0 days 00:01:41.003000,11.473,22,,,,,
1,2,1,2,0 days 00:02:04.809000,0,,35.068,0,40.306,0,...,93.533951,0.0,0 days 00:01:41.003000,23.806,21,,,,,
2,2,1,3,0 days 00:02:13.716000,0,,50.63,0,41.998,0,...,123.409228,1.0,0 days 00:01:41.003000,32.713,20,,,,,
3,2,1,4,0 days 00:01:43.847000,0,,35.789,0,34.448,0,...,113.729261,0.0,0 days 00:01:41.003000,2.844,19,,,,,
4,2,1,5,0 days 00:01:42.299000,0,,35.346,0,34.178,0,...,105.783729,0.0,0 days 00:01:41.003000,1.296,18,,,,,


This cell converts the `lap_time` column from a timedelta format to total seconds (float) for numerical processing. `errors='coerce'` handles any non-convertible values by setting them to `NaT` (Not a Time) which then become `NaN` after `total_seconds()`.

In [None]:
df_all["lap_time"] = pd.to_timedelta(df_all["lap_time"], errors="coerce")

# Conversion en secondes (float)
df_all["lap_time"] = df_all["lap_time"].dt.total_seconds()

This cell identifies and removes outliers from the `loss_per_lap` column. It calculates the 99th percentile of `loss_per_lap` and then filters `df_all` to keep only rows where `loss_per_lap` is less than or equal to this quantile, effectively removing the top 1% of the highest loss values.

In [None]:
q99 = df_all['loss_per_lap'].quantile(0.99)
df_all = df_all[df_all['loss_per_lap'] <= q99]

This cell removes rows from the DataFrame `df_all` where the 'incident' column has missing (NaN) values, ensuring that subsequent analyses involving 'incident' are based on complete data.

In [None]:
df_all = df_all.dropna(subset=["incident"])

This cell calculates a new feature, `expected_gain_if_pit_now`, which estimates the potential gain if a car pits now by comparing the current `loss_per_lap` with the average `loss_per_lap` over the next 5 laps. This aims to quantify the immediate benefit of a pit stop.

In [None]:
import numpy as np

# Fenêtre de 3 tours pour estimer les pertes futures
window = 5
expected_gain_if_pit_now = []

for i in range(len(df_all)):
    future_losses = df_all['loss_per_lap'].iloc[i+1:i+1+window]
    if len(future_losses) > 0:
        gain = df_all['loss_per_lap'].iloc[i] - np.mean(future_losses)
    else:
        gain = np.nan
    expected_gain_if_pit_now.append(gain)

df_all['expected_gain_if_pit_now'] = expected_gain_if_pit_now


This cell calculates `traffic_density` for each lap. It does so by grouping the data by `lap_number` and then inversely transforming the standard deviation of `speed`. A low standard deviation in speed suggests higher traffic density. A small constant `1e-3` is added to avoid division by zero.

In [None]:
traffic_density = (
    df_all.groupby("lap_number")["speed"]
      .transform(lambda x: 1 / (x.std() + 1e-3))  # faible std → forte densité
)
df_all["traffic_density"] = traffic_density

This cell defines the list of feature columns (`features`) that will be used for training the machine learning model and sets the target variable (`target`) to `expected_gain_if_pit_now`.

In [None]:
features = ["lap_number", 'lap_time','remaining_laps',"accx_can", "accy_can", "Steering_Angle", "pbrake_f", "speed","incident","tyre_age"]
target = "expected_gain_if_pit_now"

This cell checks for missing values (NaN) within the specified feature columns. It prints the count of `NaN` values for each feature, helping to identify if any further data cleaning is needed before model training.

In [None]:
df_all[features].isna().sum()

Unnamed: 0,0
lap_number,0
lap_time,0
remaining_laps,0
accx_can,0
accy_can,0
Steering_Angle,0
pbrake_f,0
speed,0
incident,0
tyre_age,0


In [None]:
df_all = df_all.dropna(subset=[target])


This cell first removes any rows where the target variable (`expected_gain_if_pit_now`) is missing (NaN). Then, it separates the features (`X`) and the target (`y`), scales the features using `StandardScaler` to normalize their range, and finally splits the data into training and testing sets for model development.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report

X = df_all[features]
y = df_all[target]

import pandas as pd
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X = pd.DataFrame(X_scaled, columns=features)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This cell initializes and trains an XGBoost Regressor model using the scaled training data (`X_train`, `y_train`). After training, it makes predictions on the test set (`X_test`) and evaluates the model's performance using R-squared (R²), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

# Définition du modèle
model = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Entraînement
model.fit(X_train, y_train)

# Prédictions
y_pred = model.predict(X_test)

# Évaluation
r2 = r2_score(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R² : {r2:.3f}")
print(f"MAE : {mae:.3f}")
print(f"RMSE : {rmse:.3f}")


R² : 0.903
MAE : 4.095
RMSE : 6.400


This cell imports the `joblib` library, which is commonly used for efficient serialization and deserialization of Python objects, particularly large NumPy arrays and models from scikit-learn or similar libraries.

In [None]:
import joblib

This cell saves the trained XGBoost model to a file named 'model3.pkl' using `joblib.dump()`. This allows the model to be reloaded and reused later without needing to retrain it.

In [None]:
joblib.dump(model, "model3.pkl")

['model3.pkl']

This cell downloads the saved model file ('model3.pkl') to the user's local machine using `files.download()` from Google Colab's utility functions.

In [None]:
from google.colab import files
files.download("model3.pkl")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This cell imports the `xgboost` library and prints its version. This is useful for reproducibility, ensuring that the model is used or re-trained with a compatible version of the library.

In [None]:
import xgboost
xgboost.__version__

'3.1.1'

This cell saves the `StandardScaler` object (used to scale the features) to a file named 'scaler_model3.pkl' and then downloads it. Saving the scaler is crucial to ensure that new data is scaled consistently with the training data before making predictions with the saved model.

In [None]:
joblib.dump(scaler, "scaler_model3.pkl")
from google.colab import files
files.download("scaler_model3.pkl")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>