# Data Mining Project - Group XX 2025/2026

# Import Libraries

In [83]:
import sqlite3
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

from itertools import product
from ydata_profiling import ProfileReport

# for better resolution plots
%config InlineBackend.figure_format = 'retina'

#o svg consegue ampliar infinitamente os gráficos sem perder qualidade mas às vezes é mais lento 
#por isso agora usamos retina


sns.set()

# Data Exploration and Initial Analysis

## Loading the data

Import the datasets from csv files using commas as separators of the columns and setting the unique customer identifier as the index of both columns.

In [84]:
flightsDB = pd.read_csv('DM_AIAI_FlightsDB.csv', sep = ",", index_col= "Loyalty#")
customerDB = pd.read_csv('DM_AIAI_CustomerDB.csv', sep = ",", index_col= "Loyalty#")
metaData = pd.read_csv('DM_AIAI_Metadata.csv', sep = ";", header= None)

Remove the 'Unnamed' column referring to a sequential numbering of the rows, as we set the column "Loyalty#" as the index

In [85]:
customerDB = customerDB.iloc[:, 1:]
customerDB

Unnamed: 0_level_0,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,Postal code,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,M2Z 4K1,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.490930,T3G 6Y6,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.282730,-123.120740,V6E 3D9,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,P1W 1K4,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.428730,-75.713364,J8Y 3Z5,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100012,Ethan,Thompson,Ethan Thompson,Canada,Quebec,Quebec City,46.759733,-71.141009,Y0C 7D6,male,Bachelor,Suburban,,Single,Star,2/27/2019,2/27/2019,,Standard
100013,Layla,Young,Layla Young,Canada,Alberta,Edmonton,53.524829,-113.546357,L3S 9Y3,female,Bachelor,Rural,,Married,Star,9/20/2017,9/20/2017,,Standard
100014,Amelia,Bennett,Amelia Bennett,Canada,New Brunswick,Moncton,46.051866,-64.825428,G2S 2B6,male,Bachelor,Rural,,Married,Star,11/28/2020,11/28/2020,,Standard
100015,Benjamin,Wilson,Benjamin Wilson,Canada,Quebec,Quebec City,46.862970,-71.133444,B1Z 8T3,female,College,Urban,,Married,Star,4/9/2020,4/9/2020,,Standard


## Metadata

In [86]:
# display(metaData)

**CustomerDB Database Variable Description**
- **Loyalty#:**  Unique customer identifier for loyalty program members
- **First Name:**   Customer's first name
- **Last Name:**   Customer's last name 
- **Customer Name:** Customer's full name (concatenated)
- **Country:**	Customer's country of residence
- **Province or State:**	Customer's province or state
- **City:**	Customer's city of residence
- **Latitude:**	Geographic latitude coordinate of customer location
- **Longitude:**	Geographic longitude coordinate of customer locatio
- **Postal code:**	Customer's postal/ZIP code
- **Gender:**	Customer's gender
- **Education:**	Customer's highest education level (Bachelor, College, etc.)
- **Location:** Code	Urban/Suburban/Rural classification of customer residence
- **Income:**	Customer's annual income
- **Marital Status:**	Customer's marital status (Married, Single, Divorced)
- **LoyaltyStatus:**	Current tier status in loyalty program (Star > Nova > Aurora)
- **EnrollmentDateOpening:**	Date when customer joined the loyalty program
- **CancellationDate:**	Date when customer left the program
- **Customer Lifetime:** Value	Total calculated monetary value of customer relationship
- **EnrollmentType:**	Method of joining loyalty program


**FlightsDB Database Variable Description**
- **Loyalty#:**	Unique customer identifier linking to CustomerDB
- **Year:**	Year of flight activity record
- **Month:**	Month of flight activity record (1-12)
- **YearMonthDate:**	First day of the month for the activity period
- **NumFlights:**	Total number of flights taken by customer in the month
- **NumFlightsWithCompanions:**	Number of flights where customer traveled with companions
- **DistanceKM:**	Total distance traveled in kilometers for the month
- **PointsAccumulated:**	Loyalty points earned by customer during the month
- **PointsRedeemed:**	Loyalty points spent/redeemed by customer during the month
- **DollarCostPointsRedeemed:**	Dollar value of points redeemed during the month

# Data Understanding

On this section we will inspect the data shape, column names and data types for each dataset

## Relationships between Variables (FlightsDB) - Margarida

Relação KM e Pontos (flights)

In [115]:
flightsDB[["DistanceKM", "PointsAccumulated"]]

Unnamed: 0_level_0,DistanceKM,PointsAccumulated
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1
413052,9384.0,938.00
464105,0.0,0.00
681785,14745.0,1474.00
185013,26311.0,2631.00
216596,19275.0,1927.00
...,...,...
999902,30766.5,3076.65
999911,0.0,0.00
999940,18261.0,1826.10
999982,0.0,0.00


A cada 10km, é 1 ponto mas

nos primeiros 5, arredonda para baixo, sendo todos números inteiros

nos últimos 5, mantém as decimais

In [116]:
flightsDB[["PointsRedeemed", "DollarCostPointsRedeemed"]][flightsDB.PointsRedeemed > 0]

Unnamed: 0_level_0,PointsRedeemed,DollarCostPointsRedeemed
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1
185013,3213.0,32.0
281305,4638.0,46.0
755276,4050.0,40.0
950107,5151.0,51.0
360472,6244.0,62.0
...,...,...
994285,2759.4,27.0
994993,4783.5,47.7
996745,4127.4,40.5
998934,2709.0,27.0


100 points = 1 dollar

arredondando para baixo, sem cêntimos nos dados mais recentes (de cima)

com casas decimais, com cêntimos nos dados mais recentes (de cima)

In [117]:
numeric_variables= flightsDB.loc[:,['NumFlights', 'NumFlightsWithCompanions', 'PointsAccumulated', 'PointsRedeemed', 'DollarCostPointsRedeemed']]
numeric_variables.sum()

NumFlights                  2.377833e+06
NumFlightsWithCompanions    5.986672e+05
PointsAccumulated           4.829630e+08
PointsRedeemed              1.431356e+08
DollarCostPointsRedeemed    1.414513e+06
dtype: float64

Falta fazer correlation matrixes entre pares de variables que fazem sentido e encontrar outras potenciais relações entre variáveis


## Relationships between Variables (CustomerDB) - Margarida

In [118]:
customerDB[["Income", "Customer Lifetime Value"]]

Unnamed: 0_level_0,Income,Customer Lifetime Value
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1
480934,70146.0,3839.14
549612,0.0,3839.61
429460,0.0,3839.75
608370,0.0,3839.75
530508,97832.0,3842.79
...,...,...
100012,,
100013,,
100014,,
100015,,


Daqui podemos talvez concluir que 20 NaN values no Income poderão ser os mesmo 20 do Costumer LifeTime Value

In [119]:
customerDB[["EnrollmentDateOpening", "CancellationDate"]]

Unnamed: 0_level_0,EnrollmentDateOpening,CancellationDate
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1
480934,2/15/2019,
549612,3/9/2019,
429460,7/14/2017,1/8/2021
608370,2/17/2016,
530508,10/25/2017,
...,...,...
100012,2/27/2019,2/27/2019
100013,9/20/2017,9/20/2017
100014,11/28/2020,11/28/2020
100015,4/9/2020,4/9/2020


Mais uma vez os ultimos valores parecem nao fazer sentido, pq o dia de adesão é o mesmo que o dia do cancelamento. Provavelmente teremos de eliminar estas observações pq devem ser erros

!!!! Faz sentido comparar muitas variáveis duas a duas então talvez seja mais fácil realizar os histogramas de cada par de variáveis para comparar mais rapidamente todos os pares e tirar conclusões

# Data Quality Check - Maria e Margarida

To do on this section:
- Identifying missing values - **Margarida**
- Checking and correcting data types - **Maria** (general look at the data)
- detecting and handling duplicate records (handling duplicates?? não é só detecting?) - **Maria** (general look at the data)

## Falta fazer:
1. Feature Engineering - Maria
2. Identify Strange Values - Lourenço