# PROJET MACHINE LEARNING (RÉGRESSION) : PRÉDICTION DE LA RÉSISTANCE DU BÉTON

**Realiser par** : 
* HOUSSAM EL AOUTMANI
* ABDELALI HIRRI
* SOUFIANE EL AMRAOUI

## Objectif:
* **Régression**: Prédire la résistance exacte du béton (MPa)

## Algorithmes utilisés:
* Régression Linéaire
* Decision Tree Regressor
* Random Forest Regressor
* Gradient Boosting Regressor
* XGBoost Regressor
* Support Vector Regressor (SVR)

## 1. Import des bibliothèques

In [100]:
import pandas as pd
import numpy as np
import warnings
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from xgboost import XGBRegressor
from sklearn.metrics import (mean_squared_error, r2_score, mean_absolute_error)
warnings.filterwarnings('ignore')


## 2. Chargement et Exploration des Données

In [101]:
df = pd.read_csv('concrete_data.csv')
print(f"\nDimensions du dataset: {df.shape[0]} lignes × {df.shape[1]} colonnes\n")
print("Premières lignes:")
print(df.head(10))


Dimensions du dataset: 1030 lignes × 9 colonnes

Premières lignes:
   cement  blast_furnace_slag  fly_ash  water  superplasticizer  \
0   540.0                 0.0      0.0  162.0               2.5   
1   540.0                 0.0      0.0  162.0               2.5   
2   332.5               142.5      0.0  228.0               0.0   
3   332.5               142.5      0.0  228.0               0.0   
4   198.6               132.4      0.0  192.0               0.0   
5   266.0               114.0      0.0  228.0               0.0   
6   380.0                95.0      0.0  228.0               0.0   
7   380.0                95.0      0.0  228.0               0.0   
8   266.0               114.0      0.0  228.0               0.0   
9   475.0                 0.0      0.0  228.0               0.0   

   coarse_aggregate  fine_aggregate   age  concrete_compressive_strength  
0            1040.0            676.0   28                          79.99  
1            1055.0            676.0   28   

In [102]:
print("\nInformations générales:")
print(df.info())

print("\nStatistiques descriptives:")
print(df.describe())


Informations générales:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   cement                         1030 non-null   float64
 1   blast_furnace_slag             1030 non-null   float64
 2   fly_ash                        1030 non-null   float64
 3   water                          1030 non-null   float64
 4   superplasticizer               1030 non-null   float64
 5   coarse_aggregate               1030 non-null   float64
 6   fine_aggregate                 1030 non-null   float64
 7   age                            1030 non-null   int64  
 8   concrete_compressive_strength  1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.6 KB
None

Statistiques descriptives:
            cement  blast_furnace_slag      fly_ash        water  \
count  1030.000000         1030.000000  1030.000000  1030

## 3. Nettoyage et Préparation des Données (Nettoyage)

In [103]:
print("\nValeurs manquantes par colonne:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0] if missing_values.sum() > 0 else "Aucune valeur manquante")


Valeurs manquantes par colonne:
Aucune valeur manquante


In [104]:
duplicates = df.duplicated().sum()
print(f"\nNombre de doublons: {duplicates}")
if duplicates > 0:
    df = df.drop_duplicates()
    print(f"    {duplicates} doublons supprimés")


Nombre de doublons: 25
    25 doublons supprimés


In [105]:
print("\nVérification des valeurs négatives:")
for col in df.columns:
    neg_count = (df[col] < 0).sum()
    if neg_count > 0:
        print(f"    {col}: {neg_count} valeurs négatives")



Vérification des valeurs négatives:


In [106]:
print("\nDétection des outliers (méthode IQR): ")
outlier_summary = {}
for col in df.columns:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = ((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))).sum()
    outlier_summary[col] = outliers
    if outliers > 0:
        print(f"    {col}: {outliers} outliers ({outliers/len(df)*100:.2f}%) ")


df_clean = df.copy()

print(f"\nDataset final: {df_clean.shape[0]} lignes × {df_clean.shape[1]} colonnes")


Détection des outliers (méthode IQR): 
    blast_furnace_slag: 2 outliers (0.20%) 
    water: 15 outliers (1.49%) 
    superplasticizer: 10 outliers (1.00%) 
    fine_aggregate : 5 outliers (0.50%) 
    age: 59 outliers (5.87%) 
    concrete_compressive_strength: 8 outliers (0.80%) 

Dataset final: 1005 lignes × 9 colonnes


## 4. Analyse Exploratoire et Visualisation avec Plotly

In [134]:
fig1 = px.histogram(df_clean, x='concrete_compressive_strength', 
                    nbins=50,
                    title='Distribution de la Résistance du Béton',
                    labels={'concrete_compressive_strength': 'Résistance (MPa)'},
                    color_discrete_sequence=['#1f77b4'])
fig1.update_layout(showlegend=False, height=400)
fig1.show()

In [108]:
fig2 = go.Figure()
for col in df_clean.columns:
    fig2.add_trace(go.Box(y=df_clean[col], name=col))
fig2.update_layout(title='Distribution des Variables (Boxplot)',
                    yaxis_title='Valeurs',
                    height=500)
fig2.show()

In [109]:
corr_matrix = df_clean.corr()
fig3 = px.imshow(corr_matrix, 
                 text_auto='.2f',
                 aspect='auto',
                 color_continuous_scale='RdBu_r',
                 title='Matrice de Corrélation')
fig3.update_layout(height=600)
fig3.show()

In [110]:
correlations = corr_matrix['concrete_compressive_strength'].sort_values(ascending=False)[1:]
fig4 = px.bar(x=correlations.index, y=correlations.values,
              title='Corrélation des Variables avec la Résistance',
              labels={'x': 'Variables', 'y': 'Corrélation'},
              color=correlations.values,
              color_continuous_scale='RdYlGn')
fig4.update_layout(height=400)
fig4.show()

In [111]:
fig5 = px.scatter(df_clean, x='cement', y='concrete_compressive_strength',
                  color='age', size='water',
                  title='Relation Ciment-Résistance (coloré par âge)',
                  labels={'cement': 'Ciment (kg/m³)', 
                          'concrete_compressive_strength': 'Résistance (MPa)',
                          'age': 'Âge (jours)',
                          'water': 'Eau (kg/m³)'})
fig5.show()

In [112]:
fig6 = px.box(df_clean, x='age', y='concrete_compressive_strength',
              title='Résistance du Béton selon l\'Âge',
              labels={'age': 'Âge (jours)', 
                      'concrete_compressive_strength': 'Résistance (MPa)'})
fig6.show()

print("Visualisations créées.")

Visualisations créées.


## 5. Préparation des Données pour le Machine Learning (Split 70/20/10 et Normalisation)

In [113]:
X = df_clean.drop('concrete_compressive_strength', axis=1)
y = df_clean['concrete_compressive_strength']

# Séparation Test (20%) et (Train + Validation, 80%)
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Séparation Train (70%) et Validation (10%)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.125, random_state=42
)

print(f"\nDonnées préparées:")
print(f"    Train (70%): {X_train.shape[0]} échantillons")
print(f"    Test (20%): {X_test.shape[0]} échantillons")
print(f"    Validation (10%): {X_val.shape[0]} échantillons")


Données préparées:
    Train (70%): 703 échantillons
    Test (20%): 201 échantillons
    Validation (10%): 101 échantillons


In [114]:
# Normalisation des features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_val_scaled = scaler.transform(X_val)

## 6. MODÈLES DE RÉGRESSION et Validation Croisée

### Initialisation et Configuration

In [115]:
# Dictionnaire pour stocker les résultats et les modèles
regression_results = {}

# Configuration K-Fold pour la Validation Croisée
kf = KFold(n_splits=5, shuffle=True, random_state=42)

models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42, max_depth=10),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, max_depth=15),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42, max_depth=5),
    'XGBoost': XGBRegressor(n_estimators=100, random_state=42, max_depth=5, learning_rate=0.1),
    'SVR': SVR(kernel='rbf', C=100, gamma=0.1)
}

print("Initialisation des modèles et de la K-Fold Cross-Validation terminée.")

Initialisation des modèles et de la K-Fold Cross-Validation terminée.


### 6.1 Régression Linéaire

In [116]:
name = 'Linear Regression'
model = models[name]
X_train_data = X_train_scaled
X_test_data = X_test_scaled

cv_scores = cross_val_score(model, X_train_data, y_train, cv=kf, scoring='r2')
cv_r2_mean = np.mean(cv_scores)
model.fit(X_train_data, y_train)
y_pred_test = model.predict(X_test_data)

r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_test = mean_absolute_error(y_test, y_pred_test)

regression_results[name] = {
    'model': model, 'CV_R2': cv_r2_mean, 'R2': r2_test, 'RMSE': rmse_test, 'MAE': mae_test, 'predictions': y_pred_test
}

print(f"Résultats - {name}:\nR² CV: {cv_r2_mean:.4f} | R² Test: {r2_test:.4f} | RMSE: {rmse_test:.4f}")

fig = go.Figure()
fig.add_trace(go.Scatter(x=y_test, y=y_pred_test, mode='markers', name='Predictions', marker=dict(color='blue', opacity=0.5)))
fig.add_trace(go.Scatter(x=[y_test.min(), y_test.max()], y=[y_test.min(), y_test.max()], mode='lines', name='Idéal', line=dict(color='red', dash='dash')))
fig.update_layout(title=f'Prédictions vs Réalité - {name}', xaxis_title='Réel', yaxis_title='Prédit', height=400)
fig.show()

Résultats - Linear Regression:
R² CV: 0.5888 | R² Test: 0.5825 | RMSE: 11.1597


### 6.2 Decision Tree Regressor

In [117]:
name = 'Decision Tree'
model = models[name]
X_train_data = X_train
X_test_data = X_test

cv_scores = cross_val_score(model, X_train_data, y_train, cv=kf, scoring='r2')
cv_r2_mean = np.mean(cv_scores)
model.fit(X_train_data, y_train)
y_pred_test = model.predict(X_test_data)

r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_test = mean_absolute_error(y_test, y_pred_test)

regression_results[name] = {
    'model': model, 'CV_R2': cv_r2_mean, 'R2': r2_test, 'RMSE': rmse_test, 'MAE': mae_test, 'predictions': y_pred_test
}

print(f"Résultats - {name}:\nR² CV: {cv_r2_mean:.4f} | R² Test: {r2_test:.4f} | RMSE: {rmse_test:.4f}")



Résultats - Decision Tree:
R² CV: 0.7489 | R² Test: 0.8164 | RMSE: 7.4000


### 6.3 Random Forest Regressor

In [118]:
name = 'Random Forest'
model = models[name]
X_train_data = X_train
X_test_data = X_test

cv_scores = cross_val_score(model, X_train_data, y_train, cv=kf, scoring='r2')
cv_r2_mean = np.mean(cv_scores)
model.fit(X_train_data, y_train)
y_pred_test = model.predict(X_test_data)

r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_test = mean_absolute_error(y_test, y_pred_test)

regression_results[name] = {
    'model': model, 'CV_R2': cv_r2_mean, 'R2': r2_test, 'RMSE': rmse_test, 'MAE': mae_test, 'predictions': y_pred_test
}

print(f"Résultats - {name}:\nR² CV: {cv_r2_mean:.4f} | R² Test: {r2_test:.4f} | RMSE: {rmse_test:.4f}")



Résultats - Random Forest:
R² CV: 0.8737 | R² Test: 0.9041 | RMSE: 5.3491


### 6.4 Gradient Boosting Regressor

In [119]:
name = 'Gradient Boosting'
model = models[name]
X_train_data = X_train
X_test_data = X_test

cv_scores = cross_val_score(model, X_train_data, y_train, cv=kf, scoring='r2')
cv_r2_mean = np.mean(cv_scores)
model.fit(X_train_data, y_train)
y_pred_test = model.predict(X_test_data)

r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_test = mean_absolute_error(y_test, y_pred_test)

regression_results[name] = {
    'model': model, 'CV_R2': cv_r2_mean, 'R2': r2_test, 'RMSE': rmse_test, 'MAE': mae_test, 'predictions': y_pred_test
}

print(f"Résultats - {name}:\nR² CV: {cv_r2_mean:.4f} | R² Test: {r2_test:.4f} | RMSE: {rmse_test:.4f}")



Résultats - Gradient Boosting:
R² CV: 0.8910 | R² Test: 0.9291 | RMSE: 4.5996


### 6.5 XGBoost Regressor

In [120]:
name = 'XGBoost'
model = models[name]
X_train_data = X_train
X_test_data = X_test

cv_scores = cross_val_score(model, X_train_data, y_train, cv=kf, scoring='r2')
cv_r2_mean = np.mean(cv_scores)
model.fit(X_train_data, y_train)
y_pred_test = model.predict(X_test_data)

r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_test = mean_absolute_error(y_test, y_pred_test)

regression_results[name] = {
    'model': model, 'CV_R2': cv_r2_mean, 'R2': r2_test, 'RMSE': rmse_test, 'MAE': mae_test, 'predictions': y_pred_test
}

print(f"Résultats - {name}:\nR² CV: {cv_r2_mean:.4f} | R² Test: {r2_test:.4f} | RMSE: {rmse_test:.4f}")



Résultats - XGBoost:
R² CV: 0.8924 | R² Test: 0.9265 | RMSE: 4.6840


### 6.6 Support Vector Regressor (SVR)

In [121]:
name = 'SVR'
model = models[name]
X_train_data = X_train_scaled
X_test_data = X_test_scaled

cv_scores = cross_val_score(model, X_train_data, y_train, cv=kf, scoring='r2')
cv_r2_mean = np.mean(cv_scores)
model.fit(X_train_data, y_train)
y_pred_test = model.predict(X_test_data)

r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mae_test = mean_absolute_error(y_test, y_pred_test)

regression_results[name] = {
    'model': model, 'CV_R2': cv_r2_mean, 'R2': r2_test, 'RMSE': rmse_test, 'MAE': mae_test, 'predictions': y_pred_test
}

print(f"Résultats - {name}:\nR² CV: {cv_r2_mean:.4f} | R² Test: {r2_test:.4f} | RMSE: {rmse_test:.4f}")


Résultats - SVR:
R² CV: 0.8359 | R² Test: 0.8764 | RMSE: 6.0729


### 6.7 Synthèse des Performances

In [122]:
print("Scores de CV (sur Train) pour une robustesse accrue, Scores de Test (sur Test) pour l'évaluation finale.")
print(f"\n{'Modèle':<25} {'CV R²':<10} {'Test R²':<10} {'Test RMSE':<12} {'Test MAE':<10}")
print("-" * 70)
for name, results in regression_results.items():
    print(f"{name:<25} {results['CV_R2']:<10.4f} {results['R2']:<10.4f} {results['RMSE']:<12.4f} {results['MAE']:<10.4f}")

Scores de CV (sur Train) pour une robustesse accrue, Scores de Test (sur Test) pour l'évaluation finale.

Modèle                    CV R²      Test R²    Test RMSE    Test MAE  
----------------------------------------------------------------------
Linear Regression         0.5888     0.5825     11.1597      8.8560    
Decision Tree             0.7489     0.8164     7.4000       4.8304    
Random Forest             0.8737     0.9041     5.3491       3.6903    
Gradient Boosting         0.8910     0.9291     4.5996       3.1337    
XGBoost                   0.8924     0.9265     4.6840       3.1771    
SVR                       0.8359     0.8764     6.0729       4.2153    


In [123]:
results_df = pd.DataFrame({
    'Modèle': list(regression_results.keys()),
    'CV R²': [r['CV_R2'] for r in regression_results.values()],
    'Test R²': [r['R2'] for r in regression_results.values()],
    'RMSE': [r['RMSE'] for r in regression_results.values()],
    'MAE': [r['MAE'] for r in regression_results.values()]
}).sort_values('Test R²', ascending=False)

fig7 = make_subplots(rows=1, cols=3, subplot_titles=('CV R² Score (Moyenne)', 'Test RMSE', 'Test MAE'))
fig7.add_trace(go.Bar(x=results_df['Modèle'], y=results_df['CV R²'], name='CV R²', marker_color='lightblue'), row=1, col=1)
fig7.add_trace(go.Bar(x=results_df['Modèle'], y=results_df['RMSE'], name='Test RMSE', marker_color='lightcoral'), row=1, col=2)
fig7.add_trace(go.Bar(x=results_df['Modèle'], y=results_df['MAE'], name='Test MAE', marker_color='lightgreen'), row=1, col=3)
fig7.update_layout(height=450, showlegend=False, title_text="Comparaison des Performances - Régression (CV & Test Scores)")
fig7.update_xaxes(tickangle=45)
fig7.show()

In [124]:
best_model_name = results_df.iloc[0]['Modèle']
best_predictions = regression_results[best_model_name]['predictions']

fig8 = go.Figure()
fig8.add_trace(go.Scatter(x=y_test, y=best_predictions, mode='markers', name='Prédictions (Test)', marker=dict(size=5, opacity=0.6)))
fig8.add_trace(go.Scatter(x=[y_test.min(), y_test.max()], y=[y_test.min(), y_test.max()], mode='lines', name='Ligne parfaite', line=dict(color='red', dash='dash')))
fig8.update_layout(title=f'Prédictions vs Réalité - {best_model_name} (Jeu de Test)', xaxis_title='Réel (MPa)', yaxis_title='Prédit (MPa)', height=500)
fig8.show()

## 7. Importance des Features (Modèles Basés sur les Arbres)

In [125]:
rf_reg_final = regression_results['Random Forest']['model']
feature_importance_rf = pd.DataFrame({'Feature': X.columns, 'Importance': rf_reg_final.feature_importances_}).sort_values('Importance', ascending=False)

fig11 = px.bar(feature_importance_rf, x='Importance', y='Feature', orientation='h', title='Importance des Features - Random Forest', color='Importance', color_continuous_scale='Viridis')
fig11.update_layout(height=400)
fig11.show()

In [126]:
xgb_reg_final = regression_results['XGBoost']['model']
feature_importance_xgb = pd.DataFrame({'Feature': X.columns, 'Importance': xgb_reg_final.feature_importances_}).sort_values('Importance', ascending=False)
fig12 = px.bar(feature_importance_xgb, x='Importance', y='Feature', orientation='h', title='Importance des Features - XGBoost', color='Importance', color_continuous_scale='Plasma')
fig12.update_layout(height=400)
fig12.show()

## 8. Résumé Final et Recommandations

In [127]:
print(f"\nDataset:")
print(f"    - Nombre d'échantillons total: {len(df_clean)}")
print(f"    - Split: Train ({X_train.shape[0]}) / Test ({X_test.shape[0]}) / Validation ({X_val.shape[0]})")
print(f"    - Variable cible: Résistance du béton (MPa)")
print(f"\nMEILLEUR MODÈLE DE RÉGRESSION (Basé sur le R² de Test):")
best_result = regression_results[best_model_name]
print(f"    - Modèle: {best_model_name}")
print(f"      • R² de Cross-Validation: {best_result['CV_R2']:.4f}")
print(f"      • R² de Test: {best_result['R2']:.4f}")
print(f"      • RMSE de Test: {best_result['RMSE']:.4f} MPa")
print(f"      • MAE de Test: {best_result['MAE']:.4f} MPa")
print(f"\nFEATURES LES PLUS IMPORTANTES (selon Random Forest):")
print(feature_importance_rf.head(3).to_string(index=False))
print(f"\nRECOMMANDATIONS:")
print(f"    1. Le ciment et l'âge sont les facteurs les plus influents sur la résistance.")
print(f"    2. Les scores de Cross-Validation indiquent une bonne robustesse du modèle {best_model_name}.")
print(f"    3. Pour la production, utiliser {best_model_name} pour prédire la résistance exacte du béton avec une erreur moyenne absolue (MAE) de approx. {best_result['MAE']:.2f} MPa.")


Dataset:
    - Nombre d'échantillons total: 1005
    - Split: Train (703) / Test (201) / Validation (101)
    - Variable cible: Résistance du béton (MPa)

MEILLEUR MODÈLE DE RÉGRESSION (Basé sur le R² de Test):
    - Modèle: Gradient Boosting
      • R² de Cross-Validation: 0.8910
      • R² de Test: 0.9291
      • RMSE de Test: 4.5996 MPa
      • MAE de Test: 3.1337 MPa

FEATURES LES PLUS IMPORTANTES (selon Random Forest):
Feature  Importance
    age    0.338259
 cement    0.307413
  water    0.109276

RECOMMANDATIONS:
    1. Le ciment et l'âge sont les facteurs les plus influents sur la résistance.
    2. Les scores de Cross-Validation indiquent une bonne robustesse du modèle Gradient Boosting.
    3. Pour la production, utiliser Gradient Boosting pour prédire la résistance exacte du béton avec une erreur moyenne absolue (MAE) de approx. 3.13 MPa.
