# Data Mining Project - Group XX 2025/2026

# Import Libraries

In [2]:
import sqlite3
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

from itertools import product
from ydata_profiling import ProfileReport

# for better resolution plots
%config InlineBackend.figure_format = 'retina'

#o svg consegue ampliar infinitamente os gráficos sem perder qualidade mas às vezes é mais lento 
#por isso agora usamos retina


sns.set()

# Loading the Data

Import the datasets from csv files using commas as separators of the columns and setting the unique customer identifier as the index of both columns.

In [3]:
flightsDB = pd.read_csv('DM_AIAI_FlightsDB.csv', sep = ",", index_col= "Loyalty#")
customerDB = pd.read_csv('DM_AIAI_CustomerDB.csv', sep = ",", index_col= "Loyalty#")
metaData = pd.read_csv('DM_AIAI_Metadata.csv', sep = ";", header= None)

Remove the 'Unnamed' column referring to a sequential numbering of the rows, as we set the column "Loyalty#" as the index

In [4]:
customerDB = customerDB.iloc[:, 1:]
customerDB

Unnamed: 0_level_0,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,Postal code,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,M2Z 4K1,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.490930,T3G 6Y6,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.282730,-123.120740,V6E 3D9,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,P1W 1K4,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.428730,-75.713364,J8Y 3Z5,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100012,Ethan,Thompson,Ethan Thompson,Canada,Quebec,Quebec City,46.759733,-71.141009,Y0C 7D6,male,Bachelor,Suburban,,Single,Star,2/27/2019,2/27/2019,,Standard
100013,Layla,Young,Layla Young,Canada,Alberta,Edmonton,53.524829,-113.546357,L3S 9Y3,female,Bachelor,Rural,,Married,Star,9/20/2017,9/20/2017,,Standard
100014,Amelia,Bennett,Amelia Bennett,Canada,New Brunswick,Moncton,46.051866,-64.825428,G2S 2B6,male,Bachelor,Rural,,Married,Star,11/28/2020,11/28/2020,,Standard
100015,Benjamin,Wilson,Benjamin Wilson,Canada,Quebec,Quebec City,46.862970,-71.133444,B1Z 8T3,female,College,Urban,,Married,Star,4/9/2020,4/9/2020,,Standard


# Metadata

**FlightsDB Database Variable Description**
- **Loyalty#:**	Unique customer identifier linking to CustomerDB
- **Year:**	Year of flight activity record
- **Month:**	Month of flight activity record (1-12)
- **YearMonthDate:**	First day of the month for the activity period
- **NumFlights:**	Total number of flights taken by customer in the month
- **NumFlightsWithCompanions:**	Number of flights where customer traveled with companions
- **DistanceKM:**	Total distance traveled in kilometers for the month
- **PointsAccumulated:**	Loyalty points earned by customer during the month
- **PointsRedeemed:**	Loyalty points spent/redeemed by customer during the month
- **DollarCostPointsRedeemed:**	Dollar value of points redeemed during the month

**CustomerDB Database Variable Description**
- **Loyalty#:**  Unique customer identifier for loyalty program members
- **First Name:**   Customer's first name
- **Last Name:**   Customer's last name 
- **Customer Name:** Customer's full name (concatenated)
- **Country:**	Customer's country of residence
- **Province or State:**	Customer's province or state
- **City:**	Customer's city of residence
- **Latitude:**	Geographic latitude coordinate of customer location
- **Longitude:**	Geographic longitude coordinate of customer locatio
- **Postal code:**	Customer's postal/ZIP code
- **Gender:**	Customer's gender
- **Education:**	Customer's highest education level (Bachelor, College, etc.)
- **Location:** Code	Urban/Suburban/Rural classification of customer residence
- **Income:**	Customer's annual income
- **Marital Status:**	Customer's marital status (Married, Single, Divorced)
- **LoyaltyStatus:**	Current tier status in loyalty program (Star > Nova > Aurora)
- **EnrollmentDateOpening:**	Date when customer joined the loyalty program
- **CancellationDate:**	Date when customer left the program
- **Customer Lifetime:** Value	Total calculated monetary value of customer relationship
- **EnrollmentType:**	Method of joining loyalty program

# Business Understanding

Define the project's objectives and requirements by translating business goals into data science goals. 
This involves understanding the business problem, identifying success criteria, determining resource needs, and creating an initial project plan with stages, duration, and costs.

Business Success criteria: 
- “A 5% reduction in churn results in €50k monthly savings.”

Data mining Success criteria: 
- “Model accuracy ≥ 85% on test data.” 
- “Segments must be interpretable and actionable by marketing.”


# Data Understanding

On this section we will inspect the data shape, column names and data types for each dataset

## General Look at the DataSet (FlightsDB)

In [5]:
flightsDB.shape

(608436, 9)

In [6]:
flightsDB.head(15)


Unnamed: 0_level_0,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
413052,2021,12,12/1/2021,2.0,2.0,9384.0,938.0,0.0,0.0
464105,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
681785,2021,12,12/1/2021,10.0,3.0,14745.0,1474.0,0.0,0.0
185013,2021,12,12/1/2021,16.0,4.0,26311.0,2631.0,3213.0,32.0
216596,2021,12,12/1/2021,9.0,0.0,19275.0,1927.0,0.0,0.0
486956,2021,12,12/1/2021,12.0,7.0,23967.0,2396.0,0.0,0.0
247514,2021,12,12/1/2021,17.0,7.0,23029.0,2302.0,0.0,0.0
711864,2021,12,12/1/2021,6.0,0.0,25995.0,2599.0,0.0,0.0
721372,2021,12,12/1/2021,11.0,3.0,30758.0,3075.0,0.0,0.0
762715,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
flightsDB.tail(15)

Unnamed: 0_level_0,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
999498,2019,12,12/1/2019,0.9,0.9,30283.2,3028.32,0.0,0.0
999513,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999524,2019,12,12/1/2019,13.5,4.5,22572.9,2257.29,0.0,0.0
999550,2019,12,12/1/2019,8.1,0.0,18168.3,1816.83,0.0,0.0
999589,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999631,2019,12,12/1/2019,3.6,1.8,12262.5,1226.25,0.0,0.0
999731,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999758,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999788,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999891,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0


From the visualization of the head and tail of the data base we can already understand that some errors exist:

    - NumFlights and NumFlightsWithCompanions as floats...
    - PointsAccumulated and PointsRedeemed as floats. Should they be integers?
We will further analyse this using describe and info.


In [8]:
flightsDB.info()

<class 'pandas.core.frame.DataFrame'>
Index: 608436 entries, 413052 to 999986
Data columns (total 9 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Year                      608436 non-null  int64  
 1   Month                     608436 non-null  int64  
 2   YearMonthDate             608436 non-null  object 
 3   NumFlights                608436 non-null  float64
 4   NumFlightsWithCompanions  608436 non-null  float64
 5   DistanceKM                608436 non-null  float64
 6   PointsAccumulated         608436 non-null  float64
 7   PointsRedeemed            608436 non-null  float64
 8   DollarCostPointsRedeemed  608436 non-null  float64
dtypes: float64(6), int64(2), object(1)
memory usage: 46.4+ MB


From info we can see that:

    - NumFlights and NumFlightsWithCompanions as floats...
    - PointsAccumulated and PointsRedeemed as floats. Should they be integers? 
    - There aren't missing values

What will we do?

    Analyse with describe to have a different view

In [None]:
#To confirm that missing values don't exist
flightsDB.replace("", np.nan, inplace=True)
flightsDB.isna().sum()

In [None]:
flightsDB.describe().T

In [None]:
flightsDB.describe(include='object')

#o "top" é a moda e "freq" é a frequencia do valor mais frequente
#"unique" é a quantidade de valores unicos ((36 datas diferentes pq é o primeiro dia de cada mês durante 3 anos))
#"count" é o numero de valores nao nulos

From both numeric and categorical describe we don't notice any weird value.

## Data Exploration and Analysis (FlightDB)

### Unique, Max, Min 

In [None]:
print(flightsDB["Year"].unique())
print(flightsDB["Month"].unique())

From the code above we can see that our dataset have only values from the years of 2019, 2020 and 2021 and have values from all months of the year.

In [None]:
(flightsDB["NumFlights"].max(), flightsDB["NumFlights"].min())
#from this we can see that there are some customers with 0 flights in a month and the maximum number of flights is 21 in a month

### Values Count

In [None]:
flightsDB["NumFlights"].value_counts()
#it looks like the most common number of flights in a month is 0, meaning that many customers don't fly every month

In [None]:
flightsDB["NumFlightsWithCompanions"].value_counts()
#similarly to NumFlights, the most common value is 0 but the maximum number of flights with companions is 9.9 (float?)

In [None]:
print(flightsDB["Year"].value_counts())
#we can see that the number of records for each year is equally distributed

print('-------------------------------------')

print(flightsDB["Month"].value_counts())
#we can see that the number of records for each month is equally distributed just like for the years

### Check Duplicate Values

In [None]:
#Check how many duplicates exist
print(flightsDB.duplicated().sum())

#check the percentage of duplicates in our DataFrame
print(flightsDB.duplicated().sum() / len(flightsDB) * 100)

!!!!    The percentage of duplicates ir almost 50%    !!!!

Because of this we understand that having Loyalty# as an index can be a wrong approach to check the duplicates so we read again our csv file and assign it to the variable flightsDB with the Loyalty# as a feature to check again the duplicates considering this feature.

In [None]:
flightsDB = pd.read_csv('DM_AIAI_FlightsDB.csv', sep = ",")
flightsDB.duplicated().sum() / len(flightsDB) * 100

From the new calculation we obtain only 0.48% of duplicated which it makes more sense in our problem.

With this value we can decide to drop the duplicates 

In [None]:
flightsDB[flightsDB["Loyalty#"] == 263267]
#Here we check that there are duplicates for the Loyalty# number 263267
#the DataFrame below show us all the Data associated to this Loyalty number and we can see that some rows have the exactly same information


There are 72 equal rows meaning all 36 unique values (corresponding to 12 months over 3 years) are duplicated.

As said before, to be sure that we are not losing any information, we need to introduce the column "Loyalty#" as a feature and not a index. Because of that the code that follows assign the variable FlightsDB to the new variable created that consider "Loyalty#" as a feature

After all the reasoning about the duplicates we decide to drop the duplicates, since they represent a minimal percentage of the total data and such a loss of information will not be significant for the final objective of this work.

In [None]:
#we drop the duplicates from the DataFrame with index
flightsDB.drop_duplicates(inplace= True)

# Check that the duplicates were removed
flightsDB.duplicated().sum()

### New Values Count

After dropping the duplicates we think that's important to verify again the values of each year and month that were to well distributed.

In [None]:
print(flightsDB["Year"].value_counts())

print('--------------------------------')

print(flightsDB["Month"].value_counts())


It's possible to understand that the values changed but they are still quite similar. It's obvious that the same will happen if we count the values for the NumFlights and NumFlightWithCompanions.

### Correlation between variables

This correlation is also an important analysis to be done. However this doesn't make sense for all variables so we create a new DataFrame with only the variables we want to use to check the correlation.

In [None]:
new = flightsDB[["Year", "Month", "NumFlights", "NumFlightsWithCompanions", "DistanceKM", "PointsAccumulated", "PointsRedeemed", "DollarCostPointsRedeemed"]]

new.corr(method="pearson")

From the code before it's difficult to get conclusions. We will visualize this matrix in a easy way of getting conclusions.

In [None]:
corr = new.corr(method="pearson"). round(2)

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Visualize correlation matrix
fig = plt.figure(figsize=(10, 8))

sns.heatmap(
    corr,
    mask=mask,                # hide upper triangle
    annot=True,               # show values
    cmap="coolwarm",          # divergent color map
    center=0,                 # center colormap in 0
    linewidths=0.5,           # lines between cells to help visualization
    vmin=-1, vmax=1,          # fix scale
    square=True               # make cells square-shaped
)


plt.title("Correlation Matrix (Pearson)", fontsize=14, pad=15)
plt.tight_layout() # improve layout by reducing overlaps
plt.show()


Now the analysis of the correlation between each two variables it's much more easy. With this we understand that some variables are perfectly correlated, what let us think that maybe we should not consider all variables to go on with the work. 

PointsRedeemed and DollarCostPointsRedeemed, DistanceKm and Points Accumulated.

## General Look at the Data (CustomerDB)

In [None]:
customerDB.shape

In [None]:
customerDB.head(10)

In [None]:
customerDB.tail(20)

From the visualization of the head and tail of the data base we can already understand that some errors exist:

    - Missing values in some features
    - EnrollmentType as "2021 Promotion" when it's suppose to be a type
We will further analyse this using describe and info.

It's also possible to see that some variables are redundante, such as Costumer Name, First Name and Last Name
To solve this problem we will uniformize all the values in data preparation

In [None]:
customerDB.info()

From info we can see that:

    - missing values in Income, CustomerLifetimeValue, CancellationDate
* the missing values in the features Income can make sense in cases where customers do not want to share their personal annual income. Or they may also be input errors. (Depends on interpretation).

* We can also believe that it makes sense to have NaN values in “CancellationDate,” as this means that there are customers who have not left the program.

* For the “CustomerLifetimeValue” variable, we believe that it does not make sense to have NaN values because even if the customer has no value for the company, their CustomerLifetimeValue will be 0.

What will we do?

    Analyse with describe to have a different view

In [None]:
#To confirm that missing values exist
customerDB.replace("", np.nan, inplace=True)
customerDB.isna().sum()

In [None]:
customerDB.describe()

From the numerical describe we can see that:

    - Once again we have the column Unnamed that has no relevant values

From the rest of the infromation we can't find any other problem from the first look

In [None]:
customerDB.describe(include='object')

From the object describe we can conclude that:

    - there are no repeted Customer Names (count = unique = 16921);
    - there's only one Country, Canada
    - other things that will be analysed latter if they are relevant

## Data Exploration and Analysis (CustomerDB)

### Unique, Max, Min

In [None]:
print(customerDB["Country"].unique()) # with this we can see that only one country exists in the data base
print('-------------------------------------')
print(customerDB["Education"].unique())
print('-------------------------------------')
print(customerDB["Location Code"].unique())
print('-------------------------------------')
print(customerDB["Marital Status"].unique())
print('-------------------------------------')
print(customerDB["LoyaltyStatus"].unique())
print('-------------------------------------')
print(customerDB["EnrollmentType"].unique())

From the results above we can see there aren't weird values for the features analysed. We can also verify that all Costumer's reside in Canada but in different areas, because there's diffrent Location Codes.

### Values Count

In [None]:
print(customerDB["Postal code"].value_counts()) 
#check the frequency of each postal code and we notice that some postal codes are much more common than others

print('-------------------------------------')
print(customerDB["Gender"].value_counts()) 
#we conclude that man and woman customers are almost equally represented in the data base

print('-------------------------------------')
print(customerDB["Education"].value_counts()) 
#we can see that most customers have a Bachelor degree and few have a Master's

print('-------------------------------------')
print(customerDB["Location Code"].value_counts()) 
#the location codes are quite equally distributed

print('-------------------------------------')
print(customerDB["Marital Status"].value_counts()) 
#most customers are married and only a few are divorced

print('-------------------------------------')
print(customerDB["LoyaltyStatus"].value_counts()) 
#there are way more Gold members and platinum members are the least common

print('-------------------------------------')
print(customerDB["EnrollmentType"].value_counts()) 
# most customers enrolled through a promotion and very few through 2021 promotion, the difference is huge


#### Check Duplicates

In [None]:
customerDB.duplicated().sum()

Checking the duplicates we verify that we don´t have any.

But it's still important to check the duplicates without the names features.

In [None]:
customerDB_no_name = customerDB.drop(columns=["First Name", "Last Name", "Customer Name"])
customerDB_no_name.duplicated().sum()

The result is the same so we can conclude that there aren't duplicated values in this DataFrame.

Contrary to what we have seen with the Flights dataset, here is not important to consider the values of Loyalty# as a feature. Still, so we can be consistint when analysing our datasets, we will had this column as a feature also to this dataset.

In [None]:
customerDB = pd.read_csv('DM_AIAI_CustomerDB.csv', sep = ",")
# code that we also did in the begining because there's a column with the index numbers that is completely unuseful
customerDB = customerDB.iloc[:, 1:] 

# to verify that the Loyalty# is now a feature and not an index anymore
customerDB.head()

### Correlation between variables

As before this correlation is also an important analysis to be done. However this doesn't make sense for all variables so we create a new DataFrame with only the variables we want to use to check the correlation.

In [None]:
# create a copy of the customerDB and select only the relevant columns for correlation analysis
new = customerDB.copy()
new = new[["Latitude", "Longitude", "Income", "Customer Lifetime Value", "EnrollmentDateOpening", "CancellationDate"]]

# converting date columns to datetime format
new['EnrollmentDateOpening'] = pd.to_datetime(new['EnrollmentDateOpening'], format='%m/%d/%Y', errors='coerce')
new['CancellationDate'] = pd.to_datetime(new['CancellationDate'], format='%m/%d/%Y', errors='coerce')

# using the two date columns converted before to create a new column with the customer duration in days
new['CustomerDurationDays'] = (new['CancellationDate'] - new['EnrollmentDateOpening']).dt.days

# choose the numerical columns for correlation analysis
cols = ["Latitude", "Longitude", "Income", "Customer Lifetime Value", "CustomerDurationDays"]
new[cols].corr(method='pearson')

From the code before it's difficult to get conclusions. We will visualize this matrix in a easy way of getting conclusions.

In [None]:
corr = new.corr(method="pearson"). round(2)

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Visualize correlation matrix
fig = plt.figure(figsize=(10, 8))

sns.heatmap(
    corr,
    mask=mask,                # hide upper triangle
    annot=True,               # show values
    cmap="coolwarm",          # divergent color map
    center=0,                 # center colormap in 0
    linewidths=0.5,           # lines between cells to help visualization
    vmin=-1, vmax=1,          # fix scale
    square=True               # make cells square-shaped
)


plt.title("Correlation Matrix (Pearson)", fontsize=14, pad=15)
plt.tight_layout() # improve layout by reducing overlaps
plt.show()


Now the analysis of the correlation between each two variables it's much more easy. With this we understand that the variables EnrollmentDateOpening and CustomerDurationDays are correlated, but we don't think that this is a value that lead us to drop one of this variables. The same happen for the variables Latitude and Longitude that have a bigger correlation but maybe not enough to drop one of this variables.


## Data Quality Check in both Datasets

Identificar missing values