Projeto final de redes neurais, grupo:
-Luiz Eduardo Schmalz(lefvs)
-Pedro Calheiros(pca)
-Matheus Braga(mbb4)
-Henrique Melo(hcm)

In [40]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Carregar os dados do arquivo CSV
df = pd.read_csv('train.csv')

# Dividir o dataset com base na coluna 'satisfaction'
satisfied_df = df[df['satisfaction'] == 'satisfied']
non_satisfied_df = df[df['satisfaction'] == 'neutral or dissatisfied']

# Verificar os resultados
print("Satisfied DataFrame:")
print(satisfied_df.head())

print("Non Satisfied DataFrame:")
print(non_satisfied_df.head())

Satisfied DataFrame:
    Unnamed: 0      id  Gender   Customer Type  Age   Type of Travel   
2            2  110028  Female  Loyal Customer   26  Business travel  \
4            4  119299    Male  Loyal Customer   61  Business travel   
7            7   96462  Female  Loyal Customer   52  Business travel   
13          13   83502    Male  Loyal Customer   33  Personal Travel   
16          16   71142  Female  Loyal Customer   26  Business travel   

       Class  Flight Distance  Inflight wifi service   
2   Business             1142                      2  \
4   Business              214                      3   
7   Business             2035                      4   
13       Eco              946                      4   
16  Business             2123                      3   

    Departure/Arrival time convenient  ...  Inflight entertainment   
2                                   2  ...                       5  \
4                                   3  ...                       3   

In [41]:

def detect_and_replace_outliers(df):
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    
    for column in numeric_columns:
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        
        # Detectar outliers
        outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
        outlier_count = outliers.shape[0]
        
        # Printar o número de outliers
        print(f'Coluna {column}: {outlier_count} outliers detectados.')
        
        # Substituir outliers pela média da coluna
        mean_value = df[column].mean()
        df.loc[df[column] < lower_bound, column] = mean_value
        df.loc[df[column] > upper_bound, column] = mean_value

    return df

# Aplicar a função
df_cleaned = detect_and_replace_outliers(satisfied_df)

# Verificar as primeiras linhas do DataFrame limpo
print(df_cleaned.head())

Coluna Unnamed: 0: 0 outliers detectados.
Coluna id: 0 outliers detectados.
Coluna Age: 18 outliers detectados.
Coluna Flight Distance: 0 outliers detectados.
Coluna Inflight wifi service: 0 outliers detectados.
Coluna Departure/Arrival time convenient: 0 outliers detectados.
Coluna Ease of Online booking: 0 outliers detectados.
Coluna Gate location: 0 outliers detectados.
Coluna Food and drink: 0 outliers detectados.
Coluna Online boarding: 4843 outliers detectados.
Coluna Seat comfort: 6034 outliers detectados.
Coluna Inflight entertainment: 5507 outliers detectados.
Coluna On-board service: 0 outliers detectados.
Coluna Leg room service: 0 outliers detectados.
Coluna Baggage handling: 5558 outliers detectados.
Coluna Checkin service: 0 outliers detectados.
Coluna Inflight service: 5505 outliers detectados.
Coluna Cleanliness: 0 outliers detectados.
Coluna Departure Delay in Minutes: 6655 outliers detectados.
Coluna Arrival Delay in Minutes: 7073 outliers detectados.
    Unnamed: 0  

Depois de fazer essa verificação, percebemos que não existem outliers nesse conjunto de testes, visto que os números de outliers detectados são muito grandes, ou seja, tem importância no aprendizado e não devem ser retirados, agora vamos retirar as colunas '#' e 'id' que não ajudaram em nada no nosso aprendizado de máquina

In [42]:
satisfied_df = satisfied_df.drop([satisfied_df.columns[0],'id'], axis=1)
non_satisfied_df = non_satisfied_df.drop([non_satisfied_df.columns[0],'id'], axis=1)

print(satisfied_df.head())
print(non_satisfied_df.head())

    Gender   Customer Type   Age   Type of Travel     Class  Flight Distance   
2   Female  Loyal Customer  26.0  Business travel  Business           1142.0  \
4     Male  Loyal Customer  61.0  Business travel  Business            214.0   
7   Female  Loyal Customer  52.0  Business travel  Business           2035.0   
13    Male  Loyal Customer  33.0  Personal Travel       Eco            946.0   
16  Female  Loyal Customer  26.0  Business travel  Business           2123.0   

    Inflight wifi service  Departure/Arrival time convenient   
2                     2.0                                2.0  \
4                     3.0                                3.0   
7                     4.0                                3.0   
13                    4.0                                2.0   
16                    3.0                                3.0   

    Ease of Online booking  Gate location  ...  Inflight entertainment   
2                      2.0            2.0  ...              

Agora precisamos tratar algumas colunas para transformá-las em valores binários, assim, possibilitando o melhor resultado da nossa rede neural

In [43]:
#alterando genero para 0 e 1
map_genero = {'Male': 1, 'Female': 0}
satisfied_df['Gender'] = satisfied_df['Gender'].map(map_genero)
non_satisfied_df['Gender'] = non_satisfied_df['Gender'].map(map_genero)
print(satisfied_df['Gender'])
print(non_satisfied_df['Gender'])



2         0
4         1
7         0
13        1
16        0
         ..
103890    0
103891    1
103894    1
103897    0
103900    1
Name: Gender, Length: 45025, dtype: int64
0         1
1         1
3         0
5         0
6         1
         ..
103898    1
103899    0
103901    1
103902    0
103903    1
Name: Gender, Length: 58879, dtype: int64


Agora vamos realizar o mesmo mapeamento para diversas outras variáveis que estão em forma de texto

In [44]:
#customer
map_customer = {'Loyal Customer': 1, 'disloyal Customer': 0}
#type of travel
map_travel = {'Business travel': 1, 'Personal Travel': 0}
#class
map_class = {'Business': 1, 'Eco': 0, 'Eco Plus': 2}

#agora aplicar esses outros mapeamentos aos datasets
satisfied_df['Customer Type'] = satisfied_df['Customer Type'].map(map_customer)
satisfied_df['Type of Travel'] = satisfied_df['Type of Travel'].map(map_travel)
satisfied_df['Class'] = satisfied_df['Class'].map(map_class)

non_satisfied_df['Customer Type'] = non_satisfied_df['Customer Type'].map(map_customer)
non_satisfied_df['Type of Travel'] = non_satisfied_df['Type of Travel'].map(map_travel)
non_satisfied_df['Class'] = non_satisfied_df['Class'].map(map_class)

print(satisfied_df.head())
print(non_satisfied_df.head())

    Gender  Customer Type   Age  Type of Travel  Class  Flight Distance   
2        0              1  26.0               1      1           1142.0  \
4        1              1  61.0               1      1            214.0   
7        0              1  52.0               1      1           2035.0   
13       1              1  33.0               0      0            946.0   
16       0              1  26.0               1      1           2123.0   

    Inflight wifi service  Departure/Arrival time convenient   
2                     2.0                                2.0  \
4                     3.0                                3.0   
7                     4.0                                3.0   
13                    4.0                                2.0   
16                    3.0                                3.0   

    Ease of Online booking  Gate location  ...  Inflight entertainment   
2                      2.0            2.0  ...                     5.0  \
4               

In [45]:
#agora vamos criar parametros para as colunas de idade e distancia do voo

bins = [0, 16, 30, 50, 87, float('inf')]  # Intervalos: 0-12, 13-18, 19-60, acima de 60

# Definir os nomes das categorias
labels = [0, 1, 2, 3, 4]

# Aplicar a categorização usando pd.cut()
satisfied_df['Age'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
non_satisfied_df['Age'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Mostrar o resultado
print(satisfied_df['Age'])
print(non_satisfied_df['Age'])

2         1
4         3
7         3
13        2
16        1
         ..
103890    3
103891    3
103894    1
103897    3
103900    2
Name: Age, Length: 45025, dtype: category
Categories (5, int64): [0 < 1 < 2 < 3 < 4]
0         0
1         1
3         1
5         1
6         2
         ..
103898    3
103899    1
103901    2
103902    1
103903    1
Name: Age, Length: 58879, dtype: category
Categories (5, int64): [0 < 1 < 2 < 3 < 4]


In [47]:
#agora vamos definir os intervalos para a distancia do voo, de acordo com dados da eurocontrol, voos curtos ate 1500km, ate 4000km voos medios e acima de 4000km voos longos, provavelmente o dataset esta em milhas então temos os valores de 935, 2485
bins = [0, 935, 2485, float('inf')]  # Intervalos: 0-935, 936-2485, acima de 2485
labels = [0, 1, 2]
satisfied_df['Flight Distance'] = pd.cut(df['Flight Distance'], bins=bins, labels=labels, right=False)
non_satisfied_df['Flight Distance'] = pd.cut(df['Flight Distance'], bins=bins, labels=labels, right=False)

print(satisfied_df['Flight Distance'])
print(non_satisfied_df['Flight Distance'])

2         1
4         0
7         1
13        1
16        1
         ..
103890    0
103891    1
103894    0
103897    1
103900    1
Name: Flight Distance, Length: 45025, dtype: category
Categories (3, int64): [0 < 1 < 2]
0         0
1         0
3         0
5         1
6         1
         ..
103898    1
103899    0
103901    1
103902    1
103903    1
Name: Flight Distance, Length: 58879, dtype: category
Categories (3, int64): [0 < 1 < 2]


Agora temos todas colunas do nosso arquivo em valores numericos e parametrizados, agora vamos normalizar tudo usando min-max para encerrar nosso pre processamento