<a href="https://colab.research.google.com/github/kazikamil/backend_track/blob/main/model_tracking_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cell imports necessary libraries (`pandas`, `glob`, `os`), identifies all CSV files in the current directory, reads them into a list of pandas DataFrames, checks for missing values in the 'incident' column of each DataFrame, concatenates all DataFrames into a single `df_all`, and displays its shape and the first 5 rows.

In [None]:
import pandas as pd
import glob
import os

# Lister tous les fichiers CSV dans le répertoire courant
csv_files = glob.glob("*.csv")

# (Optionnel) — Si tes fichiers sont dans un dossier précis :
# csv_files = glob.glob("/content/mon_dossier/*.csv")

print("Fichiers trouvés :", csv_files)

dfs = [pd.read_csv(f, encoding="utf-8") for f in csv_files]

for df in dfs:
  print(df['incident'].isna().sum())

df_all = pd.concat(dfs, ignore_index=True)
print(df_all.shape)
df_all.head()


Fichiers trouvés : ['IND_race_2 (1).csv', 'IND_race_1 (1).csv', 'COTA_race_1 (1).csv', 'BARBER_race_1 (1).csv', 'COTA_race_2 (1).csv', 'SON_race_1 (1).csv', 'RA_race_1.csv', 'RA_race_2.csv', 'SEB_race_2 (1).csv', 'BARBER_race_2 (1).csv', 'SEB_race_1 (1).csv']
0
0
12
12
20
29
170
2
0
14
1
(4986, 78)


Unnamed: 0,vehicle_number,driver_number,lap_number,lap_time,lap_improvement,crossing_finish_line_in_pit,s1,s1_improvement,s2,s2_improvement,...,pbrake_r,speed,incident,best_lap_time,loss_per_lap,ath,int-1_time,int-1_elapsed,int-2_time,int-2_elapsed
0,2,1,1,0 days 00:01:52.476000,0,,38.431,0,40.412,0,...,2.499473,86.550548,0.0,0 days 00:01:41.003000,11.473,,,,,
1,2,1,2,0 days 00:02:04.809000,0,,35.068,0,40.306,0,...,3.118171,93.533951,0.0,0 days 00:01:41.003000,23.806,,,,,
2,2,1,3,0 days 00:02:13.716000,0,,50.63,0,41.998,0,...,5.059603,123.409228,1.0,0 days 00:01:41.003000,32.713,,,,,
3,2,1,4,0 days 00:01:43.847000,0,,35.789,0,34.448,0,...,3.45025,113.729261,0.0,0 days 00:01:41.003000,2.844,,,,,
4,2,1,5,0 days 00:01:42.299000,0,,35.346,0,34.178,0,...,2.420892,105.783729,0.0,0 days 00:01:41.003000,1.296,,,,,


This cell converts the 'lap_time' column from a timedelta object to total seconds (float) for numerical operations.

In [None]:
df_all["lap_time"] = pd.to_timedelta(df_all["lap_time"], errors="coerce")

# Conversion en secondes (float)
df_all["lap_time"] = df_all["lap_time"].dt.total_seconds()

This cell calculates the variance of 'lap_time' for each 'vehicle_number' and stores it in a new column called 'lap_time_variance'.

In [None]:
df_all["lap_time_variance"] = df_all.groupby("vehicle_number")["lap_time"].transform("var")

This cell removes outliers from the 'loss_per_lap' column by filtering out rows where 'loss_per_lap' is greater than its 99th percentile.

In [None]:
q99 = df_all['loss_per_lap'].quantile(0.99)
df_all = df_all[df_all['loss_per_lap'] <= q99]


This cell drops rows where the 'incident' column has missing values, ensuring that the target variable is complete.

In [None]:
df_all = df_all.dropna(subset=["incident"])

This cell defines the `features` (predictor variables) and the `target` (the variable to be predicted) for the machine learning model.

In [None]:
features = ["accx_can", "accy_can", "Steering_Angle", "pbrake_f", "speed", "lap_time_variance","tyre_age"]
target = "incident"

This cell prepares the data for model training. It imports `RandomForestClassifier`, `train_test_split`, `roc_auc_score`, `classification_report`, and `StandardScaler`. It then separates features (X) and target (y), scales the features using `StandardScaler`, and splits the data into training and testing sets.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report

X = df_all[features]
y = df_all[target]

import pandas as pd
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X = pd.DataFrame(X_scaled, columns=features)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

This cell trains an XGBoost classifier model. It initializes the `XGBClassifier` with specified hyperparameters, fits the model to the training data, predicts the incident risk probability for the entire dataset, and prints the ROC AUC score on the test set.

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

model = XGBClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

model.fit(X_train, y_train)

df_all["incident_risk_prob"] = model.predict_proba(X)[:, 1]
print("ROC AUC:", roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))


ROC AUC: 0.9565095559249281


This cell checks for any remaining missing values in the selected `features` columns to ensure data quality before further processing or deployment.

In [None]:
df_all[features].isna().sum()


Unnamed: 0,0
accx_can,0
accy_can,0
Steering_Angle,0
pbrake_f,0
speed,0
lap_time_variance,0
tyre_age,0


This cell imports the `joblib` library, which is used for efficiently saving and loading Python objects, particularly large NumPy arrays or models.

In [None]:
import joblib

This cell saves the trained XGBoost model to a file named 'model2.pkl' using `joblib.dump` for later use or deployment.

In [None]:
joblib.dump(model, "model2.pkl")

['model2.pkl']

This cell downloads the saved model file ('model2.pkl') to the local machine using Google Colab's `files.download` utility.

In [None]:
from google.colab import files
files.download("model2.pkl")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This cell saves the fitted `StandardScaler` object to a file named 'scaler_model2.pkl' and then downloads it, allowing for consistent scaling of new data when the model is used for inference.

In [None]:
joblib.dump(scaler, "scaler_model2.pkl")
from google.colab import files
files.download("scaler_model2.pkl")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>