 Prétraitement du dataset Worldwide Travel Cities

Ce notebook prépare le fichier **Worldwide Travel Cities.csv** pour le projet de recommandation de voyages.
 
 Étapes 
- Importation du dataset
- Exploration initiale
- Vérification des valeurs manquantes et doublons
- Nettoyage de base
- Encodage des variables catégorielles
- Normalisation des variables numériques
- Génération d’un dictionnaire de données
- Sauvegarde du dataset nettoyé

In [None]:
# 1. Importation des librairies
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Définition de RANDOM_STATE 
RANDOM_STATE = 42 


In [2]:
# 2. Chargement du dataset
df = pd.read_csv("Worldwide Travel Cities.csv")
df



Unnamed: 0,id,city,country,region,short_description,latitude,longitude,avg_temp_monthly,ideal_durations,budget_level,culture,adventure,nature,beaches,nightlife,cuisine,wellness,urban,seclusion
0,c54acf38-3029-496b-8c7a-8343ad82785c,Milan,Italy,europe,"Chic streets lined with fashion boutiques, his...",45.464194,9.189635,"{""1"":{""avg"":3.7,""max"":7.8,""min"":0.4},""2"":{""avg...","[""Short trip"",""One week""]",Luxury,5,2,2,1,4,5,3,5,2
1,0bd12654-ed64-424e-a044-7bc574bcf078,Yasawa Islands,Fiji,oceania,"Crystal-clear waters, secluded beaches, and vi...",-17.290947,177.125786,"{""1"":{""avg"":28,""max"":30.8,""min"":25.8},""2"":{""av...","[""Long trip"",""One week""]",Luxury,2,4,5,5,2,3,4,1,5
2,73036cda-9134-46fc-a2c6-807782d59dfb,Whistler,Canada,north_america,Snow-capped peaks and lush forests create a se...,50.117190,-122.954302,"{""1"":{""avg"":-2.5,""max"":0.4,""min"":-5.5},""2"":{""a...","[""Short trip"",""Weekend"",""One week""]",Luxury,3,5,5,2,3,3,4,2,4
3,3872c9c0-6b6e-49e1-9743-f46bfe591b86,Guanajuato,Mexico,north_america,Winding cobblestone streets and colorful facad...,20.987700,-101.000000,"{""1"":{""avg"":15.5,""max"":22.8,""min"":8.7},""2"":{""a...","[""Weekend"",""One week"",""Short trip""]",Mid-range,5,3,3,1,3,4,3,4,2
4,e1ebc1b6-8798-422d-847a-22016faff3fd,Surabaya,Indonesia,asia,Bustling streets filled with the aroma of loca...,-7.245972,112.737827,"{""1"":{""avg"":28.1,""max"":32.5,""min"":25.5},""2"":{""...","[""Short trip"",""Weekend""]",Budget,4,3,3,2,3,4,3,4,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
555,778d28df-a4fa-4328-896e-4a9f80216fda,Maun,Botswana,africa,"A gateway to the Okavango Delta, offering a se...",-19.986095,23.422435,"{""1"":{""avg"":26.6,""max"":32,""min"":21.2},""2"":{""av...","[""One week"",""Short trip""]",Mid-range,3,5,5,1,2,3,3,2,4
556,44fb18eb-2641-46ab-b3fa-df6870ba3c74,Gothenburg,Sweden,europe,"A charming city with picturesque canals, lush ...",57.707233,11.967017,"{""1"":{""avg"":1.4,""max"":3.2,""min"":-1.2},""2"":{""av...","[""Weekend"",""One week"",""Short trip""]",Mid-range,4,3,4,3,3,4,3,4,3
557,8c8c7203-2a45-44ba-9fb2-b5158104375e,Manchester,United Kingdom,europe,"Industrial heritage meets modern creativity, w...",53.479489,-2.245115,"{""1"":{""avg"":4.7,""max"":7.1,""min"":2},""2"":{""avg"":...","[""Weekend"",""One week"",""Short trip""]",Mid-range,4,2,2,1,4,4,3,4,2
558,ba72b976-10f9-4415-a818-32cf17d8e649,Copenhagen,Denmark,europe,"Charming canals, vibrant neighborhoods, and a ...",55.686724,12.570072,"{""1"":{""avg"":2.6,""max"":4.2,""min"":0.6},""2"":{""avg...","[""One week"",""Short trip"",""Weekend""]",Mid-range,5,2,3,2,4,4,3,5,2


In [3]:
# Aperçu des données
df.head()

Unnamed: 0,id,city,country,region,short_description,latitude,longitude,avg_temp_monthly,ideal_durations,budget_level,culture,adventure,nature,beaches,nightlife,cuisine,wellness,urban,seclusion
0,c54acf38-3029-496b-8c7a-8343ad82785c,Milan,Italy,europe,"Chic streets lined with fashion boutiques, his...",45.464194,9.189635,"{""1"":{""avg"":3.7,""max"":7.8,""min"":0.4},""2"":{""avg...","[""Short trip"",""One week""]",Luxury,5,2,2,1,4,5,3,5,2
1,0bd12654-ed64-424e-a044-7bc574bcf078,Yasawa Islands,Fiji,oceania,"Crystal-clear waters, secluded beaches, and vi...",-17.290947,177.125786,"{""1"":{""avg"":28,""max"":30.8,""min"":25.8},""2"":{""av...","[""Long trip"",""One week""]",Luxury,2,4,5,5,2,3,4,1,5
2,73036cda-9134-46fc-a2c6-807782d59dfb,Whistler,Canada,north_america,Snow-capped peaks and lush forests create a se...,50.11719,-122.954302,"{""1"":{""avg"":-2.5,""max"":0.4,""min"":-5.5},""2"":{""a...","[""Short trip"",""Weekend"",""One week""]",Luxury,3,5,5,2,3,3,4,2,4
3,3872c9c0-6b6e-49e1-9743-f46bfe591b86,Guanajuato,Mexico,north_america,Winding cobblestone streets and colorful facad...,20.9877,-101.0,"{""1"":{""avg"":15.5,""max"":22.8,""min"":8.7},""2"":{""a...","[""Weekend"",""One week"",""Short trip""]",Mid-range,5,3,3,1,3,4,3,4,2
4,e1ebc1b6-8798-422d-847a-22016faff3fd,Surabaya,Indonesia,asia,Bustling streets filled with the aroma of loca...,-7.245972,112.737827,"{""1"":{""avg"":28.1,""max"":32.5,""min"":25.5},""2"":{""...","[""Short trip"",""Weekend""]",Budget,4,3,3,2,3,4,3,4,2


In [4]:
# 3. Infos générales et statistiques descriptives
print(df.info())
df.describe(include="all")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560 entries, 0 to 559
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 560 non-null    object 
 1   city               560 non-null    object 
 2   country            560 non-null    object 
 3   region             560 non-null    object 
 4   short_description  560 non-null    object 
 5   latitude           560 non-null    float64
 6   longitude          560 non-null    float64
 7   avg_temp_monthly   560 non-null    object 
 8   ideal_durations    560 non-null    object 
 9   budget_level       560 non-null    object 
 10  culture            560 non-null    int64  
 11  adventure          560 non-null    int64  
 12  nature             560 non-null    int64  
 13  beaches            560 non-null    int64  
 14  nightlife          560 non-null    int64  
 15  cuisine            560 non-null    int64  
 16  wellness           560 non

Unnamed: 0,id,city,country,region,short_description,latitude,longitude,avg_temp_monthly,ideal_durations,budget_level,culture,adventure,nature,beaches,nightlife,cuisine,wellness,urban,seclusion
count,560,560,560,560,560,560.0,560.0,560,560,560,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0,560.0
unique,560,559,167,7,560,,,545,21,3,,,,,,,,,
top,44aaa492-d7a7-43bb-a9bf-4f0ecc209def,Granada,United States,europe,"Expansive landscapes filled with geysers, hot ...",,,"{""1"":{""avg"":28.5,""max"":32.5,""min"":25.2},""2"":{""...","[""Short trip"",""Weekend"",""One week""]",Mid-range,,,,,,,,,
freq,1,2,50,177,1,,,3,120,339,,,,,,,,,
mean,,,,,,22.502186,7.914665,,,,3.85,3.178571,3.728571,2.380357,3.019643,3.792857,3.073214,3.146429,3.028571
std,,,,,,27.980022,78.813803,,,,0.81291,0.79819,0.90392,1.435547,0.921599,0.679329,0.592134,1.018604,0.989699
min,,,,,,-54.807306,-175.201808,,,,2.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,1.0
25%,,,,,,5.268054,-64.439118,,,,3.0,3.0,3.0,1.0,2.0,3.0,3.0,2.0,2.0
50%,,,,,,31.793618,10.711854,,,,4.0,3.0,4.0,2.0,3.0,4.0,3.0,3.0,3.0
75%,,,,,,43.673199,50.020162,,,,4.0,4.0,4.0,3.0,4.0,4.0,3.0,4.0,4.0


In [5]:
# 4. Vérification des valeurs manquantes et doublons
print("Valeurs manquantes par colonne :")
print(df.isnull().sum())

print("\nNombre de doublons :", df.duplicated().sum())



Valeurs manquantes par colonne :
id                   0
city                 0
country              0
region               0
short_description    0
latitude             0
longitude            0
avg_temp_monthly     0
ideal_durations      0
budget_level         0
culture              0
adventure            0
nature               0
beaches              0
nightlife            0
cuisine              0
wellness             0
urban                0
seclusion            0
dtype: int64

Nombre de doublons : 0


In [6]:
# Suppression des doublons par precautions 
df = df.drop_duplicates()

In [7]:
# 5. Nettoyage de base
# Suppression des lignes avec trop de valeurs manquantes
df = df.dropna(thresh=len(df.columns) - 2)

# Remplissage des colonnes numériques par leur moyenne
for col in ["Popularity", "Rating", "Average_Satisfaction"]:
    if col in df.columns:
        df[col].fillna(df[col].mean(), inplace=True)

In [8]:
# 6. Encodage des variables catégorielles
cat_cols = ["Country", "Climate"]
encoder = LabelEncoder()

for col in cat_cols:
    if col in df.columns:
        df[col] = encoder.fit_transform(df[col].astype(str))

df.head()

Unnamed: 0,id,city,country,region,short_description,latitude,longitude,avg_temp_monthly,ideal_durations,budget_level,culture,adventure,nature,beaches,nightlife,cuisine,wellness,urban,seclusion
0,c54acf38-3029-496b-8c7a-8343ad82785c,Milan,Italy,europe,"Chic streets lined with fashion boutiques, his...",45.464194,9.189635,"{""1"":{""avg"":3.7,""max"":7.8,""min"":0.4},""2"":{""avg...","[""Short trip"",""One week""]",Luxury,5,2,2,1,4,5,3,5,2
1,0bd12654-ed64-424e-a044-7bc574bcf078,Yasawa Islands,Fiji,oceania,"Crystal-clear waters, secluded beaches, and vi...",-17.290947,177.125786,"{""1"":{""avg"":28,""max"":30.8,""min"":25.8},""2"":{""av...","[""Long trip"",""One week""]",Luxury,2,4,5,5,2,3,4,1,5
2,73036cda-9134-46fc-a2c6-807782d59dfb,Whistler,Canada,north_america,Snow-capped peaks and lush forests create a se...,50.11719,-122.954302,"{""1"":{""avg"":-2.5,""max"":0.4,""min"":-5.5},""2"":{""a...","[""Short trip"",""Weekend"",""One week""]",Luxury,3,5,5,2,3,3,4,2,4
3,3872c9c0-6b6e-49e1-9743-f46bfe591b86,Guanajuato,Mexico,north_america,Winding cobblestone streets and colorful facad...,20.9877,-101.0,"{""1"":{""avg"":15.5,""max"":22.8,""min"":8.7},""2"":{""a...","[""Weekend"",""One week"",""Short trip""]",Mid-range,5,3,3,1,3,4,3,4,2
4,e1ebc1b6-8798-422d-847a-22016faff3fd,Surabaya,Indonesia,asia,Bustling streets filled with the aroma of loca...,-7.245972,112.737827,"{""1"":{""avg"":28.1,""max"":32.5,""min"":25.5},""2"":{""...","[""Short trip"",""Weekend""]",Budget,4,3,3,2,3,4,3,4,2


In [9]:
# 7. Normalisation des variables numériques
scaler = MinMaxScaler()
num_cols = ["Popularity", "Rating", "Average_Satisfaction"]

for col in num_cols:
    if col in df.columns:
        df[[col]] = scaler.fit_transform(df[[col]])

df.head()

Unnamed: 0,id,city,country,region,short_description,latitude,longitude,avg_temp_monthly,ideal_durations,budget_level,culture,adventure,nature,beaches,nightlife,cuisine,wellness,urban,seclusion
0,c54acf38-3029-496b-8c7a-8343ad82785c,Milan,Italy,europe,"Chic streets lined with fashion boutiques, his...",45.464194,9.189635,"{""1"":{""avg"":3.7,""max"":7.8,""min"":0.4},""2"":{""avg...","[""Short trip"",""One week""]",Luxury,5,2,2,1,4,5,3,5,2
1,0bd12654-ed64-424e-a044-7bc574bcf078,Yasawa Islands,Fiji,oceania,"Crystal-clear waters, secluded beaches, and vi...",-17.290947,177.125786,"{""1"":{""avg"":28,""max"":30.8,""min"":25.8},""2"":{""av...","[""Long trip"",""One week""]",Luxury,2,4,5,5,2,3,4,1,5
2,73036cda-9134-46fc-a2c6-807782d59dfb,Whistler,Canada,north_america,Snow-capped peaks and lush forests create a se...,50.11719,-122.954302,"{""1"":{""avg"":-2.5,""max"":0.4,""min"":-5.5},""2"":{""a...","[""Short trip"",""Weekend"",""One week""]",Luxury,3,5,5,2,3,3,4,2,4
3,3872c9c0-6b6e-49e1-9743-f46bfe591b86,Guanajuato,Mexico,north_america,Winding cobblestone streets and colorful facad...,20.9877,-101.0,"{""1"":{""avg"":15.5,""max"":22.8,""min"":8.7},""2"":{""a...","[""Weekend"",""One week"",""Short trip""]",Mid-range,5,3,3,1,3,4,3,4,2
4,e1ebc1b6-8798-422d-847a-22016faff3fd,Surabaya,Indonesia,asia,Bustling streets filled with the aroma of loca...,-7.245972,112.737827,"{""1"":{""avg"":28.1,""max"":32.5,""min"":25.5},""2"":{""...","[""Short trip"",""Weekend""]",Budget,4,3,3,2,3,4,3,4,2


In [10]:
# 8. Génération d’un dictionnaire de données automatique
dictionnaire = pd.DataFrame({
    "Colonne": df.columns,
    "Type": [str(df[col].dtype) for col in df.columns],
    "Nb_valeurs_uniques": [df[col].nunique() for col in df.columns],
    "Exemple": [df[col].iloc[0] for col in df.columns]
})
dictionnaire

Unnamed: 0,Colonne,Type,Nb_valeurs_uniques,Exemple
0,id,object,560,c54acf38-3029-496b-8c7a-8343ad82785c
1,city,object,559,Milan
2,country,object,167,Italy
3,region,object,7,europe
4,short_description,object,560,"Chic streets lined with fashion boutiques, his..."
5,latitude,float64,552,45.464194
6,longitude,float64,552,9.189635
7,avg_temp_monthly,object,545,"{""1"":{""avg"":3.7,""max"":7.8,""min"":0.4},""2"":{""avg..."
8,ideal_durations,object,21,"[""Short trip"",""One week""]"
9,budget_level,object,3,Luxury


In [11]:
# Sauvegarde du dataset prétraité
df.to_csv("Worldwide_Travel_Cities_clean.csv", index=False)
print(" Dataset nettoyé et sauvegardé sous 'Worldwide_Travel_Cities_clean.csv'")

 Dataset nettoyé et sauvegardé sous 'Worldwide_Travel_Cities_clean.csv'


In [12]:
# Chargement des donnees
df = pd.read_csv("Worldwide Travel Cities.csv")


In [13]:
# Créer une cible “pertinent”
df['pertinent'] = (
    (df['culture'] + df['nature'] + df['cuisine'] + df['nightlife'] + df['adventure']) / 5 >= 3.5
).astype(int)
#  définition  de X et y 
# on utilise la cible équilibrée 'pertinent'
y = df['pertinent']

# colonnes features à utiliser 
wanted_num = ['culture','adventure','nature','beaches','nightlife','cuisine','wellness','urban','seclusion']
wanted_cat = ['budget_level','climate']  # climate sera automatiquement ignoré si absent

num_cols = [c for c in wanted_num if c in df.columns]
cat_cols = [c for c in wanted_cat if c in df.columns]

X = df[num_cols + cat_cols]
# debug rapide
print("Features utilisées :", X.columns.tolist())
print("Distribution cible (pertinent):\n", y.value_counts())


Features utilisées : ['culture', 'adventure', 'nature', 'beaches', 'nightlife', 'cuisine', 'wellness', 'urban', 'seclusion', 'budget_level']
Distribution cible (pertinent):
 pertinent
1    301
0    259
Name: count, dtype: int64


In [14]:
# Creation de variable budget
df["target"] = (df["budget_level"] != "Luxury").astype(int)

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)

# debug
import numpy as np
print("Train class distribution:", np.bincount(y_train))
print("Test class distribution :", np.bincount(y_test))


Train class distribution: [207 241]
Test class distribution : [52 60]


In [16]:

# Colonnes numériques et catégorielles
num_cols = ['culture', 'adventure', 'nature', 'beaches', 'nightlife',
            'cuisine', 'wellness', 'urban', 'seclusion']
cat_cols = ['budget_level']


In [17]:

# Preprocessor pour pipelines
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols) if num_cols else ('num','passthrough',[]),
    ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_cols) if cat_cols else ('cat','passthrough',[])
])


# Pour équilibrer la target dans les classifieurs 
pipe_rf = Pipeline([
    ('pre', preprocessor),
    ('clf', RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE, class_weight='balanced'))
])

In [18]:
print("X_train colonnes :", X_train.columns.tolist())
print("preprocessor num_cols:", num_cols, "cat_cols:", cat_cols)


X_train colonnes : ['culture', 'adventure', 'nature', 'beaches', 'nightlife', 'cuisine', 'wellness', 'urban', 'seclusion', 'budget_level']
preprocessor num_cols: ['culture', 'adventure', 'nature', 'beaches', 'nightlife', 'cuisine', 'wellness', 'urban', 'seclusion'] cat_cols: ['budget_level']


In [19]:
#  Split pour le modèle de régression 
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)


Modeles

In [20]:
# Logistic regression (Regression)
pipe_log = Pipeline([
    ('pre', preprocessor),
    ('clf', LogisticRegression(max_iter=2000, random_state=RANDOM_STATE, class_weight='balanced'))
])


In [21]:
# Random forest(Classification)
pipe_rf = Pipeline([
    ('pre', preprocessor),
    ('clf', RandomForestClassifier(n_estimators=300, random_state=RANDOM_STATE, n_jobs=-1, class_weight='balanced'))
])

In [22]:
# Random Forest Regressor(Regression)
pipe_rf_reg = Pipeline([
    ('pre', preprocessor),
    ('reg', RandomForestRegressor(n_estimators=200, random_state=RANDOM_STATE, n_jobs=-1))
])

pipe_rf_reg.fit(X_train_r, y_train_r)
pred3 = pipe_rf_reg.predict(X_test_r)

print("\nRandom Forest Regressor")
rmse = np.sqrt(mean_squared_error(y_test_r, pred3))
print("R²:", r2_score(y_test_r, pred3))



Random Forest Regressor
R²: 0.8435194871794871


Explications sur le choix de ces modeles pour nos donnees

Logistic Regression : modèle de base simple et interprétable pour estimer la probabilité qu’un voyage soit pertinent selon les variables.

Random Forest Classifier : modèle plus puissant, gérant les relations non linéaires et offrant une meilleure précision pour classer les voyages pertinents ou non.

Random Forest Regressor : utilisé pour prédire un score continu de pertinence, permettant d’évaluer le degré de compatibilité d’un voyage à l’utilisateur

Entrainement

In [23]:
# logistic regression (Regression)
pipe_log.fit(X_train, y_train)

In [24]:
# Random forest Classifer (Classification)
pipe_rf.fit(X_train, y_train)

In [25]:
# Random Forest Regressor (Regression)
pipe_rf_reg.fit(X_train_r, y_train_r)


Re-entrainement et evaluation du logistic regression , du random forest regressor et du random forest classifer

In [26]:
#  Entraînement 
pipe_log.fit(X_train, y_train)
pipe_rf.fit(X_train, y_train)

#  Évaluation 
from sklearn.metrics import accuracy_score, classification_report
y_pred_log = pipe_log.predict(X_test)
y_pred_rf  = pipe_rf.predict(X_test)
pred3 = pipe_rf_reg.predict(X_test_r)

print("\n Logistic Regression ")
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

print("\n Random Forest Classifier ")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

# Random Forest Regressor


print("\n Évaluation : Random Forest Regressor ")
rmse = np.sqrt(mean_squared_error(y_test_r, pred3))
print("R² :", r2_score(y_test_r, pred3))




 Logistic Regression 
Accuracy: 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        52
           1       1.00      1.00      1.00        60

    accuracy                           1.00       112
   macro avg       1.00      1.00      1.00       112
weighted avg       1.00      1.00      1.00       112


 Random Forest Classifier 
Accuracy: 0.9464285714285714
              precision    recall  f1-score   support

           0       0.91      0.98      0.94        52
           1       0.98      0.92      0.95        60

    accuracy                           0.95       112
   macro avg       0.95      0.95      0.95       112
weighted avg       0.95      0.95      0.95       112


 Évaluation : Random Forest Regressor 
R² : 0.8383852564102564


Predictions

In [27]:
# Pour X_test
# Prédiction pour un nouvel utilisateur  
new_user = {}
for c in num_cols:
    new_user[c] = float(df[c].median())  # exemple : médiane
for c in cat_cols:
    new_user[c] = 'Mid-range' if c=='budget_level' else df[c].iloc[0]

new_user_df = pd.DataFrame([new_user])
print("New user features:", new_user_df.to_dict(orient='records')[0])

if hasattr(pipe_rf, "predict_proba"):
    proba = pipe_rf.predict_proba(new_user_df)[:,1][0]
    print(f"Probabilité (new_user) pertinent (RF) : {proba:.2f}")
    print("Classe prédite :", "Pertinent" if proba>=0.5 else "Non pertinent")
else:
    print("Le pipeline RF n'a pas predict_proba()")


New user features: {'culture': 4.0, 'adventure': 3.0, 'nature': 4.0, 'beaches': 2.0, 'nightlife': 3.0, 'cuisine': 4.0, 'wellness': 3.0, 'urban': 3.0, 'seclusion': 3.0, 'budget_level': 'Mid-range'}
Probabilité (new_user) pertinent (RF) : 0.98
Classe prédite : Pertinent


In [28]:
# Prédiction logistic regression ( Regression)
y_pred_log = pipe_log.predict(X_test)

# Affichage
print("\n Prédictions : Logistic Regression ")
for i, p in enumerate(y_pred_log[:10]):
    print(f"Observation = {i+1}  Prédiction : {p}")

print("\nAccuracy :", accuracy_score(y_test, y_pred_log))
print("R² (approximation) :", r2_score(y_test, y_pred_log))



 Prédictions : Logistic Regression 
Observation = 1  Prédiction : 1
Observation = 2  Prédiction : 1
Observation = 3  Prédiction : 1
Observation = 4  Prédiction : 1
Observation = 5  Prédiction : 1
Observation = 6  Prédiction : 0
Observation = 7  Prédiction : 1
Observation = 8  Prédiction : 1
Observation = 9  Prédiction : 0
Observation = 10  Prédiction : 0

Accuracy : 1.0
R² (approximation) : 1.0


In [29]:
# Prédictions avec Random Forest (Classification)
y_pred_rf = pipe_rf.predict(X_test)

print("\n Prédictions : Random Forest (Classification) ")
for i, p in enumerate(y_pred_rf[:10]):  # afficher les 10 premières prédictions
    statut = "Pertinent" if p == 1 else "Non pertinent"
    print(f"Voyageur {i+1} = {statut}")

print("\nAccuracy :", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))



 Prédictions : Random Forest (Classification) 
Voyageur 1 = Pertinent
Voyageur 2 = Pertinent
Voyageur 3 = Pertinent
Voyageur 4 = Pertinent
Voyageur 5 = Pertinent
Voyageur 6 = Non pertinent
Voyageur 7 = Pertinent
Voyageur 8 = Pertinent
Voyageur 9 = Non pertinent
Voyageur 10 = Non pertinent

Accuracy : 0.9464285714285714
              precision    recall  f1-score   support

           0       0.91      0.98      0.94        52
           1       0.98      0.92      0.95        60

    accuracy                           0.95       112
   macro avg       0.95      0.95      0.95       112
weighted avg       0.95      0.95      0.95       112



In [30]:
# Prédictions avec Random Forest Regressor (Regression)
y_pred_rf_reg = pipe_rf_reg.predict(X_test_r)

print("\n Prédictions : Random Forest Regressor ")
for i, p in enumerate(y_pred_rf_reg[:10]):  # affiche les 10 premières prédictions
    print(f"Voyageur {i+1} = Score prédit : {round(p, 2)}")

rmse = np.sqrt(mean_squared_error(y_test_r, pred3))
print("R² :", r2_score(y_test_r, y_pred_rf_reg))



 Prédictions : Random Forest Regressor 
Voyageur 1 = Score prédit : 1.0
Voyageur 2 = Score prédit : 0.0
Voyageur 3 = Score prédit : 0.83
Voyageur 4 = Score prédit : 1.0
Voyageur 5 = Score prédit : 0.98
Voyageur 6 = Score prédit : 0.78
Voyageur 7 = Score prédit : 0.0
Voyageur 8 = Score prédit : 0.1
Voyageur 9 = Score prédit : 1.0
Voyageur 10 = Score prédit : 1.0
R² : 0.8383852564102564


Optimisations

In [31]:
# Optimisation Logistic Regression 
from sklearn.model_selection import GridSearchCV

param_grid_log = {
    'clf__C': [0.01, 0.1, 1, 10],
    'clf__solver': ['liblinear', 'lbfgs']
}

grid_log = GridSearchCV(pipe_log, param_grid_log, cv=5, scoring='accuracy', n_jobs=-1)
grid_log.fit(X_train, y_train)

print("\n Logistic Regression Optimisée ")
print("Meilleurs paramètres :", grid_log.best_params_)
print("Accuracy (test) :", accuracy_score(y_test, grid_log.predict(X_test)))



 Logistic Regression Optimisée 
Meilleurs paramètres : {'clf__C': 1, 'clf__solver': 'liblinear'}
Accuracy (test) : 1.0


In [32]:

# Optimisation Random Forest Classifier 
param_grid_rf = {
    'clf__n_estimators': [100, 300, 500],
    'clf__max_depth': [None, 10, 20],
    'clf__min_samples_split': [2, 5, 10]
}

grid_rf = GridSearchCV(pipe_rf, param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train, y_train)

print("\n Random Forest Classifier Optimisé ")
print("Meilleurs paramètres :", grid_rf.best_params_)
print("Accuracy (test) :", accuracy_score(y_test, grid_rf.predict(X_test)))




 Random Forest Classifier Optimisé 
Meilleurs paramètres : {'clf__max_depth': 10, 'clf__min_samples_split': 2, 'clf__n_estimators': 100}
Accuracy (test) : 0.9464285714285714


Entrainement des modeles optimises

In [33]:
# Entraînement du modèle Logistic Regression optimisé
grid_log.fit(X_train, y_train)


In [34]:
# Entraînement du modèle Random Forest Classifier optimisé
grid_rf.fit(X_train, y_train)


Évaluation des modeles optimises

In [35]:
# Évaluation du modèle Logistic Regression Optimisé 
y_pred_log_opt = grid_log.predict(X_test)

print("\n Évaluation : Logistic Regression Optimisée ")
print("Accuracy :", accuracy_score(y_test, y_pred_log_opt))
print(classification_report(y_test, y_pred_log_opt))



 Évaluation : Logistic Regression Optimisée 
Accuracy : 1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        52
           1       1.00      1.00      1.00        60

    accuracy                           1.00       112
   macro avg       1.00      1.00      1.00       112
weighted avg       1.00      1.00      1.00       112



In [36]:
# Évaluation du modèle Random Forest Classifier Optimisé 
y_pred_rf_opt = grid_rf.predict(X_test)

print("\n Évaluation : Random Forest Classifier Optimisé ")
print("Accuracy :", accuracy_score(y_test, y_pred_rf_opt))
print(classification_report(y_test, y_pred_rf_opt))



 Évaluation : Random Forest Classifier Optimisé 
Accuracy : 0.9464285714285714
              precision    recall  f1-score   support

           0       0.93      0.96      0.94        52
           1       0.97      0.93      0.95        60

    accuracy                           0.95       112
   macro avg       0.95      0.95      0.95       112
weighted avg       0.95      0.95      0.95       112



predictions des modeles optimises

In [37]:
# Prédictions avec Logistic Regression Optimisée 
y_pred_log_opt = grid_log.predict(X_test)

print("\n Prédictions : Logistic Regression Optimisée ")
for i, p in enumerate(y_pred_log_opt[:10]):  # afficher les 10 premières
    statut = "Pertinent" if p == 1 else "Non pertinent"
    print(f"Voyageur {i+1} = {statut}")



 Prédictions : Logistic Regression Optimisée 
Voyageur 1 = Pertinent
Voyageur 2 = Pertinent
Voyageur 3 = Pertinent
Voyageur 4 = Pertinent
Voyageur 5 = Pertinent
Voyageur 6 = Non pertinent
Voyageur 7 = Pertinent
Voyageur 8 = Pertinent
Voyageur 9 = Non pertinent
Voyageur 10 = Non pertinent


In [38]:
# Prédictions avec Random Forest Classifier Optimisé 
y_pred_rf_opt = grid_rf.predict(X_test)

print("\n Prédictions : Random Forest Classifier Optimisé ")
for i, p in enumerate(y_pred_rf_opt[:10]):  # afficher les 10 premières
    statut = "Pertinent" if p == 1 else "Non pertinent"
    print(f"Voyageur {i+1} = {statut}")



 Prédictions : Random Forest Classifier Optimisé 
Voyageur 1 = Pertinent
Voyageur 2 = Pertinent
Voyageur 3 = Pertinent
Voyageur 4 = Pertinent
Voyageur 5 = Pertinent
Voyageur 6 = Non pertinent
Voyageur 7 = Pertinent
Voyageur 8 = Pertinent
Voyageur 9 = Non pertinent
Voyageur 10 = Non pertinent


EXplication sur le choix de GridSearchCV comme optimisation


Nous avons choisi GridSearchCV pour trouver automatiquement les meilleurs paramètres de mes modèles et améliorer leur précision tout en éviter le surapprentissage

Comparaison entre les modeles de base et les modeles optimises

Apres analyse des rapports de classification, nous avons constater que malgré l'utilisation de GridSearchCV
pour ameliorer nos modeles, les resultats de performances ainsi que les predictions
sont restés les memes. Ceci pourrait nous comfirmer que nos modeles sont au maximum de leur performance 

Choix d'un modele final et explications

Nous constatons que parmis tout nos modeles, le plus performant est le random forest classifer parce qu'il a un accurancy de 0.94, une precision de 0.91 et 0.98 , un rappel de 0.98 et 0.92 et un f1 score de 0.94 et 0.95 contrairement au logistic regression qui a un rapport de classification parfait avec precision 1 et 1, accurancy 1, rappel 1 et 1 et f1 score 1 et 1 , ce qui semble suspect pour la confiance de sa performance qui peux etre est causé par un sur-apprentissage ou une fuite de donnees. le random forest classifer qui a une estimation de prediction a 94% comparé encore au random forest regressor avec 83% de performance en prediction reste toujours le plus performant par consequent nous choisissons le random forest classifer.

Toute fois après l'Optimisation avec GridSearchCV qui est sensé améliorer les performances et résoudre le sur-apprentissage sur nos modèles nous laisse penser que le modèle de regression logistique qui a garder ses performances est réellement performant mais par précaution nous maintenions le choix du modèle random forest classifer comme le meilleur modèle.

In [None]:
# Sauvegarde du modele final

import joblib

# Entraînement du modèle
pipe_rf.fit(X_train, y_train)

# Sauvegarde du modèle dans un fichier
joblib.dump(pipe_rf, 'random_forest_model.pkl')

print("Modèle sauvegardé avec succès !")

Modèle sauvegardé avec succès !
