##  Business Context

## ✈️ Analyse de Satisfaction Client - Anvestico Airline

### 🎯 Contexte Business

Anvestico Airline cherche à comprendre les **facteurs qui influencent la satisfaction de ses clients** afin de :
- Améliorer l'expérience passager,
- Réduire les plaintes et les mauvaises évaluations,
- Optimiser les services internes (embarquement, confort, Wi-Fi, etc.),
- Mieux segmenter les clients (par classe, type de voyage...),
- Et ultimement : **améliorer la fidélité et la rentabilité**.

L’objectif de cette étude est donc de **modéliser la satisfaction des passagers** en fonction de leurs caractéristiques, du service reçu et des conditions de vol.

---

### 📋 Dictionnaire des variables

| Variable                               | Définition |
|----------------------------------------|------------|
| `satisfaction`                         | **Variable cible** indiquant si le client est satisfait ou non (valeurs : "satisfied", "neutral or dissatisfied"). |
| `Gender`                               | Sexe du client (valeurs : "Male", "Female"). |
| `Customer Type`                        | Type de client (valeurs : "Loyal Customer", "Disloyal Customer"). |
| `Age`                                  | Âge du passager (en années). |
| `Type of Travel`                       | Motif du déplacement (valeurs : "Business travel", "Personal Travel"). |
| `Class`                                | Classe de voyage (valeurs : "Eco", "Eco Plus", "Business"). |
| `Flight Distance`                      | Distance du vol en miles (numérique). |
| `Seat comfort`                         | Note de confort du siège (sur une échelle de 0 à 5). |
| `Departure/Arrival time convenient`    | Note de commodité des horaires de départ/arrivée (0 à 5). |
| `Food and drink`                       | Satisfaction sur la nourriture et boisson fournies (0 à 5). |
| `Gate location`                        | Note sur la commodité de l'emplacement de la porte d'embarquement (0 à 5). |
| `Inflight wifi service`                | Qualité du service Wi-Fi à bord (0 à 5). |
| `Inflight entertainment`              | Qualité des divertissements en vol (films, musique, etc.) (0 à 5). |
| `Online support`                       | Efficacité du support client en ligne (site, chat, etc.) (0 à 5). |
| `Ease of Online booking`              | Facilité de réservation en ligne (0 à 5). |
| `On-board service`                     | Qualité globale du service à bord (personnel, ambiance...) (0 à 5). |
| `Leg room service`                     | Confort au niveau de l'espace pour les jambes (0 à 5). |
| `Baggage handling`                     | Satisfaction concernant la gestion des bagages (0 à 5). |
| `Checkin service`                      | Note sur l'efficacité du service d'enregistrement (0 à 5). |
| `Cleanliness`                          | Propreté de l'avion (0 à 5). |
| `Online boarding`                      | Qualité du processus d'embarquement en ligne (0 à 5). |
| `Departure Delay in Minutes`          | Retard au départ en minutes (valeur numérique). |
| `Arrival Delay in Minutes`            | Retard à l’arrivée en minutes (valeur numérique). |

---

## 💡 Remarques

- Toutes les notes de service (0 à 5) représentent une **appréciation subjective** du client.
- Les variables de **retard** (départ/arrivée) peuvent être importantes pour modéliser la satisfaction.
- Il existe plusieurs **variables ordinales**, catégorielles, et numériques à traiter différemment en analyse.


## load package

In [None]:
import pandas as pd 
import pathlib
import os
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
import time

## load data 

In [15]:
data=pd.read_csv(r"C:\Users\Client\Desktop\R studio project\Airline_project\Invistico_Airline.csv")
df=pd.DataFrame(data)
df.head()

Unnamed: 0,satisfaction,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,...,4,2,2,0,2,4,2,5,0,0.0


##  data structure 

In [16]:
def data_summary(df):
    from IPython.display import display, Markdown
    import pandas as pd

    separator = "\n" + "=" * 70 + "\n"
    
    display(Markdown("## 📊 **Résumé Global du Dataset**"))
    
    total_rows, total_cols = df.shape
    display(Markdown(f"- **Nombre de lignes :** {total_rows:,}"))
    display(Markdown(f"- **Nombre de colonnes :** {total_cols:,}"))

    display(Markdown("### 🏷️ **Noms des colonnes**"))
    display(Markdown(", ".join([f"`{col}`" for col in df.columns])))

    print(separator)
    
    display(Markdown("## 🧮 **Types de données**"))
    display(df.dtypes.value_counts().to_frame("Nombre de colonnes").rename_axis("Type de données"))

    # Séparer qualitatif et quantitatif
    qualitative_columns = df.select_dtypes(include=["object"]).columns.tolist()
    quantitative_columns = df.select_dtypes(include=["number"]).columns.tolist()

    print(separator)

    display(Markdown("## 🔤 **Variables Qualitatives (catégorielles)**"))
    if qualitative_columns:
        display(Markdown(", ".join([f"`{col}`" for col in qualitative_columns])))
    else:
        display(Markdown("_Aucune variable qualitative._"))

    display(Markdown("## 🔢 **Variables Quantitatives (numériques)**"))
    if quantitative_columns:
        display(Markdown(", ".join([f"`{col}`" for col in quantitative_columns])))
    else:
        display(Markdown("_Aucune variable quantitative._"))

    print(separator)

    # Valeurs manquantes
    display(Markdown("## ❗ **Valeurs Manquantes**"))
    missing_values = df.isnull().sum()
    missing_values = missing_values[missing_values > 0]
    if missing_values.empty:
        display(Markdown("✅ _Aucune valeur manquante détectée._"))
    else:
        display(missing_values.to_frame("Valeurs manquantes").rename_axis("Variable"))
data_summary(df)


## 📊 **Résumé Global du Dataset**

- **Nombre de lignes :** 129,880

- **Nombre de colonnes :** 23

### 🏷️ **Noms des colonnes**

`satisfaction`, `Gender`, `Customer Type`, `Age`, `Type of Travel`, `Class`, `Flight Distance`, `Seat comfort`, `Departure/Arrival time convenient`, `Food and drink`, `Gate location`, `Inflight wifi service`, `Inflight entertainment`, `Online support`, `Ease of Online booking`, `On-board service`, `Leg room service`, `Baggage handling`, `Checkin service`, `Cleanliness`, `Online boarding`, `Departure Delay in Minutes`, `Arrival Delay in Minutes`





## 🧮 **Types de données**

Unnamed: 0_level_0,Nombre de colonnes
Type de données,Unnamed: 1_level_1
int64,17
object,5
float64,1






## 🔤 **Variables Qualitatives (catégorielles)**

`satisfaction`, `Gender`, `Customer Type`, `Type of Travel`, `Class`

## 🔢 **Variables Quantitatives (numériques)**

`Age`, `Flight Distance`, `Seat comfort`, `Departure/Arrival time convenient`, `Food and drink`, `Gate location`, `Inflight wifi service`, `Inflight entertainment`, `Online support`, `Ease of Online booking`, `On-board service`, `Leg room service`, `Baggage handling`, `Checkin service`, `Cleanliness`, `Online boarding`, `Departure Delay in Minutes`, `Arrival Delay in Minutes`





## ❗ **Valeurs Manquantes**

Unnamed: 0_level_0,Valeurs manquantes
Variable,Unnamed: 1_level_1
Arrival Delay in Minutes,393


## cleaned data 

### data issus

#### Nettoie les noms de colonnes en remplaçant espaces et caractères spéciaux

In [17]:
df.columns

Index(['satisfaction', 'Gender', 'Customer Type', 'Age', 'Type of Travel',
       'Class', 'Flight Distance', 'Seat comfort',
       'Departure/Arrival time convenient', 'Food and drink', 'Gate location',
       'Inflight wifi service', 'Inflight entertainment', 'Online support',
       'Ease of Online booking', 'On-board service', 'Leg room service',
       'Baggage handling', 'Checkin service', 'Cleanliness', 'Online boarding',
       'Departure Delay in Minutes', 'Arrival Delay in Minutes'],
      dtype='object')

In [18]:
import pandas as pd
import re

class DataCleaner:
    def __init__(self, df):
        self.df = df

    def clean_column_names(self):
        """Nettoie les noms de colonnes en remplaçant espaces et caractères spéciaux."""
        self.df.columns = (
            self.df.columns.str.strip()
            .str.lower()
            .str.replace(r'[^a-z0-9_]', '_', regex=True)
        )
        return self

    def clean_numeric_columns(self, numeric_columns):
        """Nettoie uniquement les colonnes qui peuvent contenir des chiffres."""
        for col in numeric_columns:
            if col in self.df.columns and self.df[col].dtype == object:  # Vérifie si la colonne existe et contient du texte
                # Supprime caractères spéciaux et espaces
                self.df[col] = (
                    self.df[col]
                    .astype(str)
                    .str.replace(r'[^0-9.,-]', '', regex=True)  # Garde les nombres négatifs
                    .str.replace(',', '.', regex=True)  # Convertit les virgules en points
                )
                # Convertit en float/int
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
        return self 
    def clean_data(self, numeric_columns):
        """Applique les corrections uniquement sur les colonnes numériques spécifiées."""
        return (
            self.clean_column_names()
            .clean_numeric_columns(numeric_columns)
            .df
        )
# Liste des colonnes à traiter
numeric_columns = [
        'Customer Type', 'Age', 'Type of Travel',
        'Class', 'Flight Distance', 'Seat comfort',
        'Departure/Arrival time convenient', 'Food and drink', 'Gate location',
        'Inflight wifi service', 'Inflight entertainment', 'Online support',
        'Ease of Online booking', 'On-board service', 'Leg room service',
        'Baggage handling', 'Checkin service', 'Cleanliness', 'Online boarding',
        'Departure Delay in Minutes', 'Arrival Delay in Minutes'
]
# Nettoyage
cleaner = DataCleaner(df)
df = cleaner.clean_data(numeric_columns)
print(df) 

        satisfaction  gender      customer_type  age   type_of_travel  \
0          satisfied  Female     Loyal Customer   65  Personal Travel   
1          satisfied    Male     Loyal Customer   47  Personal Travel   
2          satisfied  Female     Loyal Customer   15  Personal Travel   
3          satisfied  Female     Loyal Customer   60  Personal Travel   
4          satisfied  Female     Loyal Customer   70  Personal Travel   
...              ...     ...                ...  ...              ...   
129875     satisfied  Female  disloyal Customer   29  Personal Travel   
129876  dissatisfied    Male  disloyal Customer   63  Personal Travel   
129877  dissatisfied    Male  disloyal Customer   69  Personal Travel   
129878  dissatisfied    Male  disloyal Customer   66  Personal Travel   
129879  dissatisfied  Female  disloyal Customer   38  Personal Travel   

           class  flight_distance  seat_comfort  \
0            Eco              265             0   
1       Business     

#### Corrige les valeursillogiques par des bornes réalistes

In [19]:
class AirlineDataCleaner:
    def __init__(self, df):
        self.df = df.copy()

    def correct_categorical_values(self):
        categorical_rules = {
            "Customer Type": ["Loyal Customer", "Disloyal Customer"],
            "Type of Travel": ["Business travel", "Personal Travel"],
            "Class": ["Eco", "Eco Plus", "Business"]
        }
        for col, valid_values in categorical_rules.items():
            if col in self.df.columns:
                self.df[col] = self.df[col].where(self.df[col].isin(valid_values), np.nan)
                # Remplacer par la valeur la plus fréquente si NaN
                mode_val = self.df[col].mode(dropna=True)
                if not mode_val.empty:
                    self.df[col].fillna(mode_val[0], inplace=True)
        return self
    def correct_numerical_values(self):
        numeric_rules = {
            "Age": (10, 100),
            "Flight Distance": (50, 20000),
            "Departure Delay in Minutes": (0, 1440), 
            "Arrival Delay in Minutes": (0, 1440)
        }
        for col, (min_val, max_val) in numeric_rules.items():
            if col in self.df.columns:
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
                self.df[col] = np.where(self.df[col] < min_val, np.nan, self.df[col])
                self.df[col] = np.where(self.df[col] > max_val, np.nan, self.df[col])
                median_val = self.df[col].median(skipna=True)
                self.df[col].fillna(median_val, inplace=True)
        return self

    def correct_rating_values(self):
        """Corrige les notes de satisfaction (de 0 à 5)."""
        rating_columns = [
            'Seat comfort', 'Departure/Arrival time convenient', 'Food and drink',
            'Gate location', 'Inflight wifi service', 'Inflight entertainment',
            'Online support', 'Ease of Online booking', 'On-board service',
            'Leg room service', 'Baggage handling', 'Checkin service',
            'Cleanliness', 'Online boarding'
        ]
        for col in rating_columns:
            if col in self.df.columns:
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
                self.df[col] = np.where((self.df[col] < 0) | (self.df[col] > 5), np.nan, self.df[col])
                median = self.df[col].median(skipna=True)
                self.df[col].fillna(median, inplace=True)
        return self

    def clean_data(self):
        """Applique toutes les corrections."""
        return (
            self.correct_categorical_values()
                .correct_numerical_values()
                .correct_rating_values()
                .df
        )
cleaner = AirlineDataCleaner(df)
df_cleaned = cleaner.clean_data()


### missing values 

In [20]:
class MissingDataHandler:
    def __init__(self, df):
        self.df = df  # Appliquer les changements directement sur le DataFrame original
    # 1. Supprimer les colonnes avec trop de valeurs manquantes
    def drop_column(self, threshold=0.5):
        """Drop columns with more than threshold% missing values."""
        self.df.dropna(thresh=len(self.df) * (1 - threshold), axis=1, inplace=True)
    # 2. Supprimer les lignes avec des valeurs manquantes
    def drop_row(self):
        """Drop rows with any missing values."""
        self.df.dropna(inplace=True)
    # 3. Imputation avec la moyenne ou la médiane
    def impute_mean_median(self, strategy='mean'):
        """Impute missing values in all numerical columns with mean or median."""
        num_cols = self.df.select_dtypes(include=[np.number]).columns
        for col in num_cols:
            imputer = SimpleImputer(strategy=strategy)
            self.df[col] = imputer.fit_transform(self.df[[col]])
    
    def group_imputation(self, group_by, strategy='mean'):
        """Group-wise imputation using mean or median for all numerical columns."""
        num_cols = self.df.select_dtypes(include=[np.number]).columns
        for col in num_cols:
            self.df[col].fillna(self.df.groupby(group_by)[col].transform(strategy), inplace=True)
    # 4. Remplir par la valeur la plus fréquente (catégories)
    def impute_categorical(self, strategy='most_frequent', fill_value=None):
        """Impute missing values in all categorical columns."""
        cat_cols = self.df.select_dtypes(exclude=[np.number]).columns
        for col in cat_cols:
            if strategy == 'constant' and fill_value is not None:
                self.df[col].fillna(fill_value, inplace=True)
            else:
                imputer = SimpleImputer(strategy=strategy)
                self.df[col] = imputer.fit_transform(self.df[[col]]).ravel()
    # 5. Remplissage en avant et en arrière
    def forward_fill(self):
        """Forward fill missing values for all columns."""
        self.df.fillna(method='ffill', inplace=True)
    
    def backward_fill(self):
        """Backward fill missing values for all columns."""
        self.df.fillna(method='bfill', inplace=True)
    # 6. Interpolation (utile pour les séries temporelles)
    def interpolate(self, method='linear'):
        """Interpolate missing values for all numerical columns."""
        num_cols = self.df.select_dtypes(include=[np.number]).columns
        for col in num_cols:
            self.df[col].interpolate(method=method, inplace=True)
    # using KNN for all numerical columns
    def knn_imputation(self, n_neighbors=5):
        """Impute missing values using KNN for all numerical columns."""
        imputer = KNNImputer(n_neighbors=n_neighbors)
        self.df[:] = imputer.fit_transform(self.df)
    # using Iterative Imputer (MICE)
    def iterative_imputation(self):
        """Impute missing values using Iterative Imputer (MICE) for all numerical columns."""
        imputer = IterativeImputer()
        self.df[:] = imputer.fit_transform(self.df)
    
    def get_dataframe(self):
        """Return the processed DataFrame."""
        return self.df


In [21]:
# Importer la classe
handler = MissingDataHandler(df)
# Supprimer les colonnes avec plus de 50% de valeurs manquantes
handler.drop_column(threshold=0.9)
# Imputation des valeurs manquantes
handler.impute_mean_median(strategy='median') 
# Récupérer le DataFrame traité
cleaned_df = handler.get_dataframe()
print("\nDataFrame après traitement :")
cleaned_df.info()
df=cleaned_df
df.describe()


DataFrame après traitement :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   satisfaction                       129880 non-null  object 
 1   gender                             129880 non-null  object 
 2   customer_type                      129880 non-null  object 
 3   age                                129880 non-null  float64
 4   type_of_travel                     129880 non-null  object 
 5   class                              129880 non-null  object 
 6   flight_distance                    129880 non-null  float64
 7   seat_comfort                       129880 non-null  float64
 8   departure_arrival_time_convenient  129880 non-null  float64
 9   food_and_drink                     129880 non-null  float64
 10  gate_location                      129880 non-null  float64
 11  inflight_

Unnamed: 0,age,flight_distance,seat_comfort,departure_arrival_time_convenient,food_and_drink,gate_location,inflight_wifi_service,inflight_entertainment,online_support,ease_of_online_booking,on_board_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding,departure_delay_in_minutes,arrival_delay_in_minutes
count,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0,129880.0
mean,39.427957,1981.409055,2.838597,2.990645,2.851994,2.990422,3.24913,3.383477,3.519703,3.472105,3.465075,3.485902,3.695673,3.340807,3.705759,3.352587,14.713713,15.045465
std,15.11936,1027.115606,1.392983,1.527224,1.443729,1.30597,1.318818,1.346059,1.306511,1.30556,1.270836,1.292226,1.156483,1.260582,1.151774,1.298715,38.071126,38.416353
min,7.0,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,27.0,1359.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,3.0,3.0,3.0,2.0,0.0,0.0
50%,40.0,1925.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,4.0,0.0,0.0
75%,51.0,2544.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,5.0,4.0,5.0,5.0,4.0,5.0,4.0,12.0,13.0
max,85.0,6951.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1592.0,1584.0


In [22]:
def na(df, percent = True):
    srs = df.isna().sum()[df.isna().sum() > 0]
    srs = srs.sort_values(ascending=False)
    if percent:
        print('% of NaNs in df:')
        return srs / df.shape[0]
    else:
        print('# of NaNs in df:')
        return srs

na(df, False)

# of NaNs in df:


Series([], dtype: int64)

###  Removing duplicates rows

In [None]:
def remove_duplicates(df, subset=None, keep='first', inplace=False):
    if keep not in ['first', 'last', 'none']:
        raise ValueError("keep must be one of 'first', 'last', or 'none'.")
    # Count duplicates before removal
    total_rows = len(df)
    
    duplicate_rows = df.duplicated(subset=subset, keep=False).sum()
    percentage_duplicates = (duplicate_rows / total_rows) * 100
    print(f"Total Rows: {total_rows}")
    print(f"Duplicate Rows: {duplicate_rows} ({percentage_duplicates:.2f}%)")
    if duplicate_rows == 0:
        print("No duplicates found. No rows removed.")
        return df if not inplace else None
    # Handle duplicate removal    if keep == 'none':
        # Drop all duplicates and keep only unique rows
        duplicated_mask = df.duplicated(subset=subset, keep=False)
        result = df[~duplicated_mask]
    else:
        # Use pandas built-in drop_duplicates
        result = df.drop_duplicates(subset=subset, keep=keep)
    # Count duplicates after removal
    remaining_rows = len(result)
    rows_removed = total_rows - remaining_rows
    print(f"Rows Removed: {rows_removed}")
    print(f"Remaining Rows: {remaining_rows}")
    if rows_removed > 0:
        print("Duplicates successfully removed.")
    else:
        print("No duplicates were removed.")
    if inplace:
        df.drop_duplicates(subset=subset, keep=keep, inplace=True)
    else:
        return result
remove_duplicates(df, subset=None, keep='first', inplace=False)

Total Rows: 129880
Duplicate Rows: 0 (0.00%)
No duplicates found. No rows removed.


Unnamed: 0,satisfaction,gender,customer_type,age,type_of_travel,class,flight_distance,seat_comfort,departure_arrival_time_convenient,food_and_drink,...,online_support,ease_of_online_booking,on_board_service,leg_room_service,baggage_handling,checkin_service,cleanliness,online_boarding,departure_delay_in_minutes,arrival_delay_in_minutes
0,satisfied,Female,Loyal Customer,65.0,Personal Travel,Eco,265.0,0.0,0.0,0.0,...,2.0,3.0,3.0,0.0,3.0,5.0,3.0,2.0,0.0,0.0
1,satisfied,Male,Loyal Customer,47.0,Personal Travel,Business,2464.0,0.0,0.0,0.0,...,2.0,3.0,4.0,4.0,4.0,2.0,3.0,2.0,310.0,305.0
2,satisfied,Female,Loyal Customer,15.0,Personal Travel,Eco,2138.0,0.0,0.0,0.0,...,2.0,2.0,3.0,3.0,4.0,4.0,4.0,2.0,0.0,0.0
3,satisfied,Female,Loyal Customer,60.0,Personal Travel,Eco,623.0,0.0,0.0,0.0,...,3.0,1.0,1.0,0.0,1.0,4.0,1.0,3.0,0.0,0.0
4,satisfied,Female,Loyal Customer,70.0,Personal Travel,Eco,354.0,0.0,0.0,0.0,...,4.0,2.0,2.0,0.0,2.0,4.0,2.0,5.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,satisfied,Female,disloyal Customer,29.0,Personal Travel,Eco,1731.0,5.0,5.0,5.0,...,2.0,2.0,3.0,3.0,4.0,4.0,4.0,2.0,0.0,0.0
129876,dissatisfied,Male,disloyal Customer,63.0,Personal Travel,Business,2087.0,2.0,3.0,2.0,...,1.0,3.0,2.0,3.0,3.0,1.0,2.0,1.0,174.0,172.0
129877,dissatisfied,Male,disloyal Customer,69.0,Personal Travel,Eco,2320.0,3.0,0.0,3.0,...,2.0,4.0,4.0,3.0,4.0,2.0,3.0,2.0,155.0,163.0
129878,dissatisfied,Male,disloyal Customer,66.0,Personal Travel,Eco,2450.0,3.0,2.0,3.0,...,2.0,3.0,3.0,2.0,3.0,2.0,1.0,2.0,193.0,205.0


### outliers

In [None]:
class DataCleaner:
    def __init__(self, df):
        self.df = df
        self.numeric_columns = self.df.select_dtypes(include=['number']).columns.to_list()

    def remove_na_inf(self):
        """Replaces infinite values with NaN and drops rows containing NaN."""
        self.df[self.numeric_columns] = self.df[self.numeric_columns].replace([np.inf, -np.inf], np.nan).dropna()
        print(" NaN and infinite values removed.")

    def detect_outliers_iqr(self):
        outliers_dict = {}

        for col in self.numeric_columns:
            Q1 = self.df[col].quantile(0.25)
            Q3 = self.df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            # Find outlier indices
            outliers = self.df[(self.df[col] < lower_bound) | (self.df[col] > upper_bound)].index
            outliers_dict[col] = outliers

        return outliers_dict

    def drop_outliers(self):
        """Drops rows containing outliers based on the IQR method."""
        outliers_dict = self.detect_outliers_iqr()
        outlier_indices = set(index for indices in outliers_dict.values() for index in indices)
        self.df = self.df.drop(outlier_indices)
        print(" Outliers removed.")

    def replace_outliers_iqr(self):
        """Replaces outliers with the lower and upper IQR bounds (Winsorization)."""
        for col in self.numeric_columns:
            Q1 = self.df[col].quantile(0.25)
            Q3 = self.df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR

            self.df[col] = np.where(self.df[col] < lower_bound, lower_bound, self.df[col])
            self.df[col] = np.where(self.df[col] > upper_bound, upper_bound, self.df[col])
        
        print(" Outliers replaced using IQR bounds.")

    def replace_outliers_median(self):
        """Replaces outliers with the column median."""
        for col in self.numeric_columns:
            Q1 = self.df[col].quantile(0.25)
            Q3 = self.df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            median_value = self.df[col].median()

            self.df[col] = np.where((self.df[col] < lower_bound) | (self.df[col] > upper_bound), median_value, self.df[col])
        
        print(" Outliers replaced with median.")

    def replace_outliers_knn(self, n_neighbors=5):
        """Replaces outliers using K-Nearest Neighbors (KNN) imputation."""
        imputer = KNNImputer(n_neighbors=n_neighbors)
        self.df[self.numeric_columns] = imputer.fit_transform(self.df[self.numeric_columns])
        print(" Outliers replaced using KNN imputation.")

    def save_cleaned_data(self, filename="cleaned_data.csv"):
        """Saves the cleaned DataFrame as a CSV file."""
        self.df.to_csv(filename, index=False)
        print(f" Cleaned data saved to {filename}.")

    def clean_data(self, method="median"):
        self.remove_na_inf()  # Remove NaN and infinite values

        if method == "drop":
            self.drop_outliers()
        elif method == "winsorization":
            self.replace_outliers_iqr()
        elif method == "median":
            self.replace_outliers_median()
        elif method == "knn":
            self.replace_outliers_knn()
        else:
            print("Invalid method. Choose from 'drop', 'winsorization', 'median', 'knn'.")

        self.save_cleaned_data()


In [25]:
# Initialisation du DataCleaner
cleaner = DataCleaner(df)
cleaner.clean_data(method="median")  

 NaN and infinite values removed.
 Outliers replaced with median.
 Cleaned data saved to cleaned_data.csv.


### Data type conversions

In [26]:
class DataTypeConverter:
    def __init__(self, df):
        self.df = df 

    def convert_data_types(self):
        """
        Converts column data types based on their values:
        - Converts strings containing numbers to int/float.
        - Converts float columns with only integer values to int.
        - Converts boolean-like columns (0/1) to bool.
        - Converts date-like columns to datetime.
        - Converts categorical columns (few unique values) to category.
        """
        for col in self.df.columns:
            # Convert numeric strings to proper numbers
            if self.df[col].dtype == 'object':
                try:
                    self.df[col] = pd.to_numeric(self.df[col])
                except ValueError:
                    pass  # Keep as object if conversion fails

            # Convert float columns with only integer values to int
            if self.df[col].dtype == 'float64':
                if (self.df[col].dropna() % 1 == 0).all():
                    self.df[col] = self.df[col].astype(int)

            # Convert boolean-like columns (0/1) to bool
            if set(self.df[col].dropna().unique()) == {0, 1}:
                self.df[col] = self.df[col].astype(bool)

            # Convert date-like columns to datetime
            if self.df[col].dtype == 'object':
                try:
                    self.df[col] = pd.to_datetime(self.df[col])
                except ValueError:
                    pass  # Keep as object if not a valid date format
            
            # Convert categorical columns (few unique values) to category
            if self.df[col].dtype == 'object' and self.df[col].nunique() / len(self.df) < 0.05:
                self.df[col] = self.df[col].astype('category')

        print(" Data types converted successfully.")

    def get_dataframe(self):
        """Returns the converted DataFrame."""
        return self.df


In [27]:
# Initialize DataTypeConverter
converter = DataTypeConverter(df)

converter.convert_data_types()

# Get the transformed DataFrame
df_converted = converter.get_dataframe()

  self.df[col] = pd.to_datetime(self.df[col])
  self.df[col] = pd.to_datetime(self.df[col])
  self.df[col] = pd.to_datetime(self.df[col])
  self.df[col] = pd.to_datetime(self.df[col])
  self.df[col] = pd.to_datetime(self.df[col])


 Data types converted successfully.


In [None]:
def get_object_columns(df, target=None):
    object_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    if target and target in object_cols:
        object_cols.remove(target)
    return object_cols

# ----- Exemple d'utilisation -----
object_columns = get_object_columns(df, target='satisfaction')
print(len(object_columns), "\n", object_columns)


4 
 ['gender', 'customer_type', 'type_of_travel', 'class']


In [None]:
def label_encode_columns(df, columns=object_columns, verbose=True):
    # Loop through the specified columns and encode them
    for col in columns:
        if col in df.columns:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col])
            
            # Display label mapping
            if verbose:
                mapping = dict(zip(le.classes_, le.transform(le.classes_)))
                print(f"\n[INFO] Label Encoding for '{col}': {mapping}")
        else:
            print(f"[WARNING] Column '{col}' not found in DataFrame.")
    
    return df
df=label_encode_columns(df, columns=object_columns, verbose=True)


[INFO] Label Encoding for 'gender': {'Female': 0, 'Male': 1}

[INFO] Label Encoding for 'customer_type': {'Loyal Customer': 0, 'disloyal Customer': 1}

[INFO] Label Encoding for 'type_of_travel': {'Business travel': 0, 'Personal Travel': 1}

[INFO] Label Encoding for 'class': {'Business': 0, 'Eco': 1, 'Eco Plus': 2}


In [30]:
df.to_csv("aireline_without_feat.csv", index=False)


## feature eng

##### 1. total_service_score

In [32]:
service_cols = [
    'seat_comfort', 'departure_arrival_time_convenient', 'food_and_drink', 
    'gate_location', 'inflight_wifi_service', 'inflight_entertainment', 
    'online_support', 'ease_of_online_booking', 'on_board_service', 
    'leg_room_service', 'baggage_handling', 'checkin_service', 
    'cleanliness', 'online_boarding'
]

df['total_service_score'] = df[service_cols].sum(axis=1)


##### 2. average_service_score

In [34]:
df['average_service_score'] = df[service_cols].mean(axis=1)

##### 3. total_delay

In [33]:
df['total_delay'] = df['departure_delay_in_minutes'] + df['arrival_delay_in_minutes']

##### 4. delay_level

In [35]:
def delay_category(total_delay):
    if total_delay == 0:
        return 'no_delay'
    elif total_delay <= 15:
        return 'short_delay'
    elif total_delay <= 60:
        return 'moderate_delay'
    else:
        return 'long_delay'

df['delay_level'] = df['total_delay'].apply(delay_category)


#####  One-hot encoding

In [37]:
new_vars = [
    'delay_level' 
]
df_encoded = pd.get_dummies(df, columns=new_vars, prefix=new_vars, drop_first=True)


In [38]:
df.to_csv("aireline_with_feat.csv", index=False)