# Data Mining Project - Group XX 2025/2026

# Import Libraries

In [83]:
import sqlite3
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

from itertools import product
from ydata_profiling import ProfileReport

# for better resolution plots
%config InlineBackend.figure_format = 'retina'

#o svg consegue ampliar infinitamente os gráficos sem perder qualidade mas às vezes é mais lento 
#por isso agora usamos retina


sns.set()

# Data Exploration and Initial Analysis

## Loading the data

Import the datasets from csv files using commas as separators of the columns and setting the unique customer identifier as the index of both columns.

In [84]:
flightsDB = pd.read_csv('DM_AIAI_FlightsDB.csv', sep = ",", index_col= "Loyalty#")
customerDB = pd.read_csv('DM_AIAI_CustomerDB.csv', sep = ",", index_col= "Loyalty#")
metaData = pd.read_csv('DM_AIAI_Metadata.csv', sep = ";", header= None)

Remove the 'Unnamed' column referring to a sequential numbering of the rows, as we set the column "Loyalty#" as the index

In [85]:
customerDB = customerDB.iloc[:, 1:]
customerDB

Unnamed: 0_level_0,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,Postal code,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,M2Z 4K1,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.490930,T3G 6Y6,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.282730,-123.120740,V6E 3D9,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,P1W 1K4,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.428730,-75.713364,J8Y 3Z5,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100012,Ethan,Thompson,Ethan Thompson,Canada,Quebec,Quebec City,46.759733,-71.141009,Y0C 7D6,male,Bachelor,Suburban,,Single,Star,2/27/2019,2/27/2019,,Standard
100013,Layla,Young,Layla Young,Canada,Alberta,Edmonton,53.524829,-113.546357,L3S 9Y3,female,Bachelor,Rural,,Married,Star,9/20/2017,9/20/2017,,Standard
100014,Amelia,Bennett,Amelia Bennett,Canada,New Brunswick,Moncton,46.051866,-64.825428,G2S 2B6,male,Bachelor,Rural,,Married,Star,11/28/2020,11/28/2020,,Standard
100015,Benjamin,Wilson,Benjamin Wilson,Canada,Quebec,Quebec City,46.862970,-71.133444,B1Z 8T3,female,College,Urban,,Married,Star,4/9/2020,4/9/2020,,Standard


## Metadata

In [86]:
# display(metaData)

**CustomerDB Database Variable Description**
- **Loyalty#:**  Unique customer identifier for loyalty program members
- **First Name:**   Customer's first name
- **Last Name:**   Customer's last name 
- **Customer Name:** Customer's full name (concatenated)
- **Country:**	Customer's country of residence
- **Province or State:**	Customer's province or state
- **City:**	Customer's city of residence
- **Latitude:**	Geographic latitude coordinate of customer location
- **Longitude:**	Geographic longitude coordinate of customer locatio
- **Postal code:**	Customer's postal/ZIP code
- **Gender:**	Customer's gender
- **Education:**	Customer's highest education level (Bachelor, College, etc.)
- **Location:** Code	Urban/Suburban/Rural classification of customer residence
- **Income:**	Customer's annual income
- **Marital Status:**	Customer's marital status (Married, Single, Divorced)
- **LoyaltyStatus:**	Current tier status in loyalty program (Star > Nova > Aurora)
- **EnrollmentDateOpening:**	Date when customer joined the loyalty program
- **CancellationDate:**	Date when customer left the program
- **Customer Lifetime:** Value	Total calculated monetary value of customer relationship
- **EnrollmentType:**	Method of joining loyalty program


**FlightsDB Database Variable Description**
- **Loyalty#:**	Unique customer identifier linking to CustomerDB
- **Year:**	Year of flight activity record
- **Month:**	Month of flight activity record (1-12)
- **YearMonthDate:**	First day of the month for the activity period
- **NumFlights:**	Total number of flights taken by customer in the month
- **NumFlightsWithCompanions:**	Number of flights where customer traveled with companions
- **DistanceKM:**	Total distance traveled in kilometers for the month
- **PointsAccumulated:**	Loyalty points earned by customer during the month
- **PointsRedeemed:**	Loyalty points spent/redeemed by customer during the month
- **DollarCostPointsRedeemed:**	Dollar value of points redeemed during the month

# Data Understanding

On this section we will inspect the data shape, column names and data types for each dataset

## General Look at the DataSet (FlightsDB)  - Maria

In [87]:
flightsDB.shape

(608436, 9)

In [88]:
flightsDB.head(15)


Unnamed: 0_level_0,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
413052,2021,12,12/1/2021,2.0,2.0,9384.0,938.0,0.0,0.0
464105,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
681785,2021,12,12/1/2021,10.0,3.0,14745.0,1474.0,0.0,0.0
185013,2021,12,12/1/2021,16.0,4.0,26311.0,2631.0,3213.0,32.0
216596,2021,12,12/1/2021,9.0,0.0,19275.0,1927.0,0.0,0.0
486956,2021,12,12/1/2021,12.0,7.0,23967.0,2396.0,0.0,0.0
247514,2021,12,12/1/2021,17.0,7.0,23029.0,2302.0,0.0,0.0
711864,2021,12,12/1/2021,6.0,0.0,25995.0,2599.0,0.0,0.0
721372,2021,12,12/1/2021,11.0,3.0,30758.0,3075.0,0.0,0.0
762715,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0


In [89]:
flightsDB.tail(15)

Unnamed: 0_level_0,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
999498,2019,12,12/1/2019,0.9,0.9,30283.2,3028.32,0.0,0.0
999513,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999524,2019,12,12/1/2019,13.5,4.5,22572.9,2257.29,0.0,0.0
999550,2019,12,12/1/2019,8.1,0.0,18168.3,1816.83,0.0,0.0
999589,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999631,2019,12,12/1/2019,3.6,1.8,12262.5,1226.25,0.0,0.0
999731,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999758,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999788,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999891,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0


From the visualization of the head and tail of the data base we can already understand that some errors exist:

    - NumFlights and NumFlightsWithCompanions as floats...
    - PointsAccumulated and PointsRedeemed as floats. Should they be integers?
We will further analyse this using describe and info.


In [90]:
flightsDB.info()

<class 'pandas.core.frame.DataFrame'>
Index: 608436 entries, 413052 to 999986
Data columns (total 9 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   Year                      608436 non-null  int64  
 1   Month                     608436 non-null  int64  
 2   YearMonthDate             608436 non-null  object 
 3   NumFlights                608436 non-null  float64
 4   NumFlightsWithCompanions  608436 non-null  float64
 5   DistanceKM                608436 non-null  float64
 6   PointsAccumulated         608436 non-null  float64
 7   PointsRedeemed            608436 non-null  float64
 8   DollarCostPointsRedeemed  608436 non-null  float64
dtypes: float64(6), int64(2), object(1)
memory usage: 46.4+ MB


From info we can see that:

    - NumFlights and NumFlightsWithCompanions as floats...
    - PointsAccumulated and PointsRedeemed as floats. Should they be integers? 
    - There aren't missing values

What will we do?

    Analyse with describe to have a different view

In [91]:
#To confirm that missing values don't exist
flightsDB.replace("", np.nan, inplace=True)
flightsDB.isna().sum()

Year                        0
Month                       0
YearMonthDate               0
NumFlights                  0
NumFlightsWithCompanions    0
DistanceKM                  0
PointsAccumulated           0
PointsRedeemed              0
DollarCostPointsRedeemed    0
dtype: int64

In [92]:
flightsDB.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Year,608436.0,2020.0,0.816497,2019.0,2019.0,2020.0,2021.0,2021.0
Month,608436.0,6.5,3.452055,1.0,3.75,6.5,9.25,12.0
NumFlights,608436.0,3.908107,5.057889,0.0,0.0,0.0,7.2,21.0
NumFlightsWithCompanions,608436.0,0.983944,2.003785,0.0,0.0,0.0,0.9,11.0
DistanceKM,608436.0,7939.341419,10260.421873,0.0,0.0,856.4,15338.175,42040.0
PointsAccumulated,608436.0,793.777781,1025.918521,0.0,0.0,85.275,1533.7125,4204.0
PointsRedeemed,608436.0,235.251678,983.233374,0.0,0.0,0.0,0.0,7496.0
DollarCostPointsRedeemed,608436.0,2.324835,9.725168,0.0,0.0,0.0,0.0,74.0


In [93]:
flightsDB.describe(include='object')

#o "top" é a moda e "freq" é a frequencia do valor mais frequente
#"unique" é a quantidade de valores unicos ((36 datas diferentes pq é o primeiro dia de cada mês durante 3 anos))
#"count" é o numero de valores nao nulos

Unnamed: 0,YearMonthDate
count,608436
unique,36
top,12/1/2021
freq,16901


From both numeric and categorical describe we don't notice any weird value.

In [94]:
# flightsDB.dtypes

#### Check Duplicate Values

In [95]:
#Check how many duplicates exist
flightsDB.duplicated().sum()

np.int64(301411)

In [96]:
#Visualize our duplicates
flightsDB[flightsDB.duplicated()]

Unnamed: 0_level_0,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
762715,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
332716,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
904920,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
671534,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
618871,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
999788,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999891,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999911,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
999982,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0


From this visualization we understand that all the rows considered as duplicates have one unique different value for the Loyalty# feature.

In [97]:
#check the percentage of duplicates in our DataFrame
flightsDB.duplicated().sum() / len(flightsDB) * 100

np.float64(49.53865320263758)

!!!!    The percentage of duplicates ir almost 50%    !!!!

Because of this we understand that having Loyalty# as an index can be a wrong approach to check the duplicates so we create a new variable with the Loyalty# as a feature to check again the duplicates considering this feature.

In [120]:
flightsDB_index = pd.read_csv('DM_AIAI_FlightsDB.csv', sep = ",")
flightsDB_index.duplicated().sum() / len(flightsDB_index) * 100

np.float64(0.4771249564457067)

From the new calculation we obtain only 0.48% of duplicated which it makes more sense in our problem.

With this value we can decide to drop the duplicates 

In [121]:
flightsDB_index[flightsDB_index["Loyalty#"] == 263267]
#Here we check that there are duplicates for the Loyalty# number 263267
#the DataFrame below show us all the Data associated to this Loyalty number and we can see that some rows have the exactly same information


Unnamed: 0,Loyalty#,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
1092,263267,2020,6,6/1/2020,0.0,0.0,0.0,0.0,0.0,0.0
3150,263267,2020,6,6/1/2020,0.0,0.0,0.0,0.0,0.0,0.0
14057,263267,2020,5,5/1/2020,0.0,0.0,0.0,0.0,0.0,0.0
25441,263267,2020,5,5/1/2020,0.0,0.0,0.0,0.0,0.0,0.0
37425,263267,2020,4,4/1/2020,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
560775,263267,2019,10,10/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
577675,263267,2019,11,11/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
577676,263267,2019,11,11/1/2019,0.0,0.0,0.0,0.0,0.0,0.0
594576,263267,2019,12,12/1/2019,0.0,0.0,0.0,0.0,0.0,0.0


There are 72 equal rows meaning all 36 unique values (corresponding to 12 months over 3 years) are duplicated.

## General Look at the Data (CustomerDB) - Maria

In [100]:
customerDB.shape

(16921, 19)

In [101]:
customerDB.head(10)

Unnamed: 0_level_0,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,Postal code,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,M2Z 4K1,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.49093,T3G 6Y6,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.28273,-123.12074,V6E 3D9,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,P1W 1K4,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.42873,-75.713364,J8Y 3Z5,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion
193662,Leatrice,Hanlin,Leatrice Hanlin,Canada,Yukon,Whitehorse,60.721188,-135.05684,Y2K 6R0,male,Bachelor,Rural,26262.0,Married,Star,5/7/2015,,3844.57,Standard
927943,Hue,Sellner,Hue Sellner,Canada,Ontario,Toronto,43.653225,-79.383186,P5S 6R4,female,College,Urban,0.0,Single,Star,6/9/2017,,3857.95,Standard
188893,Nakia,Cash,Nakia Cash,Canada,Ontario,Trenton,44.101128,-77.576309,K8V 4B2,male,Bachelor,Suburban,93272.0,Married,Star,12/8/2019,,3861.49,Standard
852392,Arlene,Conterras,Arlene Conterras,Canada,Quebec,Montreal,45.50169,-73.567253,H2Y 2W2,female,Bachelor,Suburban,93272.0,Married,Star,5/30/2018,,3861.49,Standard
866307,Dustin,Recine,Dustin Recine,Canada,Ontario,Toronto,43.653225,-79.383186,M8Y 4K8,male,Bachelor,Suburban,93272.0,Married,Star,10/14/2019,,3861.49,Standard


In [102]:
customerDB.tail(20)

Unnamed: 0_level_0,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,Postal code,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
999987,Layla,Murphy,Layla Murphy,Canada,New Brunswick,Fredericton,46.029263,-66.56515,R4H 2Y2,female,Bachelor,Urban,,Single,Star,3/7/2017,3/7/2017,,Standard
999988,Jana,Parker,Jana Parker,Canada,Quebec,Montreal,45.573672,-73.523012,N6B 1N3,male,College,Rural,,Single,Star,8/22/2017,8/22/2017,,Standard
999989,Ethan,Parker,Ethan Parker,Canada,Ontario,Trenton,44.075379,-77.550375,P8F 5C8,male,College,Rural,,Married,Star,9/12/2015,9/12/2015,,Standard
999990,Ryan,Anderson,Ryan Anderson,Canada,New Brunswick,Moncton,46.106617,-64.714267,B6P 6D0,female,College,Rural,,Married,Star,6/10/2019,6/10/2019,,Standard
999991,Olivia,Cote,Olivia Cote,Canada,New Brunswick,Fredericton,45.95,-66.652437,X3W 5N2,female,College,Suburban,,Married,Star,7/20/2019,7/20/2019,,Standard
999992,Ella,Roy,Ella Roy,Canada,Ontario,Toronto,43.706878,-79.437412,P6D 6N2,male,College,Suburban,,Single,Star,3/27/2021,3/27/2021,,Standard
999993,Elijah,Cook,Elijah Cook,Canada,British Columbia,Dawson Creek,55.701475,-120.181716,W6H 0Z7,female,College,Suburban,,Married,Star,1/27/2015,1/27/2015,,Standard
999994,Ethan,Chan,Ethan Chan,Canada,Ontario,Ottawa,45.365906,-75.723181,B2F 3E1,female,College,Rural,,Married,Star,5/5/2016,5/5/2016,,Standard
999995,Liam,Wong,Liam Wong,Canada,Ontario,Ottawa,45.471557,-75.704868,B3A 2R0,female,College,Suburban,,Married,Star,3/2/2020,3/2/2020,,Standard
999996,Isabella,Ross,Isabella Ross,Canada,Ontario,Toronto,43.690489,-79.436758,B4W 4M6,female,Bachelor,Suburban,,Single,Star,9/14/2018,9/14/2018,,Standard


From the visualization of the head and tail of the data base we can already understand that some errors exist:

    - A column named 'Unnamed' as an index with the number of each row
    - Missing values in some features
    - EnrollmentType as "2021 Promotion" when it's suppose to be a type
We will further analyse this using describe and info.

It's also possible to see that some variables are redundante, such as Costumer Name, First Name and Last Name
To solve this problem we will uniformize all the values in data preparation

In [103]:
customerDB.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16921 entries, 480934 to 100016
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   First Name               16921 non-null  object 
 1   Last Name                16921 non-null  object 
 2   Customer Name            16921 non-null  object 
 3   Country                  16921 non-null  object 
 4   Province or State        16921 non-null  object 
 5   City                     16921 non-null  object 
 6   Latitude                 16921 non-null  float64
 7   Longitude                16921 non-null  float64
 8   Postal code              16921 non-null  object 
 9   Gender                   16921 non-null  object 
 10  Education                16921 non-null  object 
 11  Location Code            16921 non-null  object 
 12  Income                   16901 non-null  float64
 13  Marital Status           16921 non-null  object 
 14  LoyaltyStatus        

From info we can see that:

    - missing values in Income, CustomerLifetimeValue, CancellationDate
* the missing values in the features Income can make sense in cases where customers do not want to share their personal annual income. Or they may also be input errors. (Depends on interpretation).

* We can also believe that it makes sense to have NaN values in “CancellationDate,” as this means that there are customers who have not left the program.

* For the “CustomerLifetimeValue” variable, we believe that it does not make sense to have NaN values because even if the customer has no value for the company, their CustomerLifetimeValue will be 0.

What will we do?

    Analyse with describe to have a different view

In [104]:
#To confirm that missing values exist
customerDB.replace("", np.nan, inplace=True)
customerDB.isna().sum()

First Name                     0
Last Name                      0
Customer Name                  0
Country                        0
Province or State              0
City                           0
Latitude                       0
Longitude                      0
Postal code                    0
Gender                         0
Education                      0
Location Code                  0
Income                        20
Marital Status                 0
LoyaltyStatus                  0
EnrollmentDateOpening          0
CancellationDate           14611
Customer Lifetime Value       20
EnrollmentType                 0
dtype: int64

In [105]:
customerDB.describe()

Unnamed: 0,Latitude,Longitude,Income,Customer Lifetime Value
count,16921.0,16921.0,16901.0,16901.0
mean,47.1745,-91.814768,37758.0384,7990.460188
std,3.307971,22.242429,30368.992499,6863.173093
min,42.984924,-135.05684,0.0,1898.01
25%,44.231171,-120.23766,0.0,3979.72
50%,46.087818,-79.383186,34161.0,5780.18
75%,49.28273,-74.596184,62396.0,8945.69
max,60.721188,-52.712578,99981.0,83325.38


From the numerical describe we can see that:

    - Once again we have the column Unnamed that has no relevant values

From the rest of the infromation we can't find any other problem from the first look

In [106]:
customerDB.describe(include='object')

Unnamed: 0,First Name,Last Name,Customer Name,Country,Province or State,City,Postal code,Gender,Education,Location Code,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,EnrollmentType
count,16921,16921,16921,16921,16921,16921,16921,16921,16921,16921,16921,16921,16921,2310,16921
unique,4941,15404,16921,1,11,29,75,2,5,3,3,3,2449,1260,2
top,Deon,Salberg,Cecilia Householder,Canada,Ontario,Toronto,V6E 3D9,female,Bachelor,Suburban,Married,Star,4/3/2015,7/7/2020,Standard
freq,13,4,1,16921,5468,3390,917,8497,10586,5716,9842,7761,34,8,15773


From the object describe we can conclude that:

    - there are no repeted Customer Names (count = unique = 16921);
    - there's only one Country, Canada
    - other things that will be analysed latter if they are relevant

#### Check Duplicates

In [107]:
customerDB.duplicated().sum()

np.int64(0)

Checking the duplicates we verify that we don´t have any.

But it's still important to check the duplicates without the names features.

In [108]:
customerDB_no_name = customerDB.drop(columns=["First Name", "Last Name", "Customer Name"])
customerDB_no_name.duplicated().sum()

np.int64(0)

The result is the same so we can conclude that there aren't duplicated values in this DataFrame.

# Data Quality Check - Maria e Margarida

To do on this section:
- Identifying missing values - **Margarida**
- Checking and correcting data types - **Maria** (general look at the data)
- detecting and handling duplicate records (handling duplicates?? não é só detecting?) - **Maria** (general look at the data)

# Data Agregation and Exploration - Maria

To do on this section:
- Summing and agregating  data by columns and rows
- Discussing the appropriateness of different operations

## Falta fazer:
1. Feature Engineering - Maria
2. Identify Strange Values - Lourenço