# **Table of contents**

* __[1. Import Data](#import)__
  * [1.1 Import the needed libraries and documents into Jupyter Notebook](#lib)
  * [1.2. Check for duplicates](#duplicates)
* __[2. Customer Data](#digital_contact)__
  * [2.1 Explore Data](#explore_digital_contact)
    * [2.1.1 Basic Exploration](#dt_basic_exploration)
    * [2.1.2 Statistical Exploration](#dt_statis_exploration)
    * [2.1.3 Visual Exploration](#dt_visual_exploration)
  * [2.2 Preprocess Data](#pprocess_digital_contact)
    * [2.2.1 Missing Values](#dt_missing_values)
    * [2.2.2 Normalizing the Data](#dt_normalizing)
  * [2.3 Modelling](#modelling_digital_contact)
    * [2.3.1 Identify the right number of clusters](#dt_clusters)
    * [2.3.2 Training the model with K-Means](#dt_kmeans)
    * [2.3.3. Visualizing in detail the clusters](#dt_visualize_clusters)
    * [2.3.4. Applying K-means after performing PCA](#dt_PCA_clusters)
    * [2.3.5. Applying DBSCAN](#dt_DBSCAN_clusters)
    * [2.3.6. Applying t-SNE](#dt_TSNE_clusters)
    * [2.3.7. Applying DBSCAN after performing t-SNE](#dt_TSNE_DBSCAN_clusters)


# 1. Importing the data

In [118]:
import pandas as pd

In [119]:
customers_df =pd.read_csv("data/DM_AIAI_CustomerDB.csv")
flights_df=pd.read_csv("data/DM_AIAI_FlightsDB.csv",)

### Metadata
#### Customer
- *Loyalty* - Unique customer identifier for loyalty program members;
- *First Name* -Customer's first name;
- *Last Name* - Customer's last name;
- *Customer Name*- Customer's full name (concatenated);
- *Country* - Customer's country of residence;
- *Province or State* - Customer's province or state;
- *City* - Customer's city of residence;
- *Latitude* - Geographic latitude coordinate of customer location;
- *Longitude* - Geographic longitude coordinate of  customer location;
- *Postal code* - Customer's postal/ZIP code;
- *Gender* - Customer's gender;
- *Education* - Customer's highest education level (Bachelor, College, etc.);
- *Location Code* - Urban/Suburban/Rural classification of customer residence;
- *Income* - Customer's annual income;
_ *Marital Status* - Customer's marital status (Married, Single, Divorced);
- *LoyaltyStatus* - Current tier status in loyalty program (Star > Nova > Aurora);
- *EnrollmentDateOpening* - Date when customer joined the loyalty program;
- *CancellationDate* - Date when customer left the program;
- *Customer Lifetime Value* - Total calculated monetary value of customer relationship;
- *EnrollmentType* - Method of joining loyalty program;


#### Flights

- *Loyalty* - Unique customer identifier linking to CustomerDB;
- *Year* - Year of flight activity record;
- *Month* -Month of flight activity record (1-12);
- *YearMonthDate* - First day of the month for the activity period;
- *NumFlights* -Total number of flights taken by customer in the month;
- *NumFlightsWithCompanions* - Number of flights where customer traveled with companions;
- *DistanceKM* - Total distance traveled in kilometers for the month;
- *PointsAccumulated* - Loyalty points earned by customer during the month;
- *PointsRedeemed* - Loyalty points spent/redeemed by customer during the month;
- *DollarCostPointsRedeemed* -Dollar value of points redeemed during the month;

In [120]:
customers_df.head()

Unnamed: 0.1,Unnamed: 0,Loyalty#,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,...,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
0,0,480934,Cecilia,Householder,Cecilia Householder,Canada,Ontario,Toronto,43.653225,-79.383186,...,female,Bachelor,Urban,70146.0,Married,Star,2/15/2019,,3839.14,Standard
1,1,549612,Dayle,Menez,Dayle Menez,Canada,Alberta,Edmonton,53.544388,-113.49093,...,male,College,Rural,0.0,Divorced,Star,3/9/2019,,3839.61,Standard
2,2,429460,Necole,Hannon,Necole Hannon,Canada,British Columbia,Vancouver,49.28273,-123.12074,...,male,College,Urban,0.0,Single,Star,7/14/2017,1/8/2021,3839.75,Standard
3,3,608370,Queen,Hagee,Queen Hagee,Canada,Ontario,Toronto,43.653225,-79.383186,...,male,College,Suburban,0.0,Single,Star,2/17/2016,,3839.75,Standard
4,4,530508,Claire,Latting,Claire Latting,Canada,Quebec,Hull,45.42873,-75.713364,...,male,Bachelor,Suburban,97832.0,Married,Star,10/25/2017,,3842.79,2021 Promotion


In [121]:
flights_df.head()

Unnamed: 0,Loyalty#,Year,Month,YearMonthDate,NumFlights,NumFlightsWithCompanions,DistanceKM,PointsAccumulated,PointsRedeemed,DollarCostPointsRedeemed
0,413052,2021,12,12/1/2021,2.0,2.0,9384.0,938.0,0.0,0.0
1,464105,2021,12,12/1/2021,0.0,0.0,0.0,0.0,0.0,0.0
2,681785,2021,12,12/1/2021,10.0,3.0,14745.0,1474.0,0.0,0.0
3,185013,2021,12,12/1/2021,16.0,4.0,26311.0,2631.0,3213.0,32.0
4,216596,2021,12,12/1/2021,9.0,0.0,19275.0,1927.0,0.0,0.0


# 2. Data Exploration

## 2.1. Basic Analysis and basic preprocessing

In this chapter we will be checking for: __`Duplicates`__ , __`Data types`__ , __`Missing values and Anomalous values`__



In [122]:
customers_df.shape

(16921, 21)

In [123]:
customers_df.columns

Index(['Unnamed: 0', 'Loyalty#', 'First Name', 'Last Name', 'Customer Name',
       'Country', 'Province or State', 'City', 'Latitude', 'Longitude',
       'Postal code', 'Gender', 'Education', 'Location Code', 'Income',
       'Marital Status', 'LoyaltyStatus', 'EnrollmentDateOpening',
       'CancellationDate', 'Customer Lifetime Value', 'EnrollmentType'],
      dtype='object')

### 2.1.1 Handing duplicates

we are going to consider Loyalty# Column as the unique Key, so subset would be Loyalty#

In [124]:
customers_df.duplicated(subset="Loyalty#").sum()

164

- There 164 duplicates let's see the duplicates

In [125]:
duplicates_df=customers_df[customers_df.duplicated(subset="Loyalty#", keep=False)]
duplicates_df.sort_values(by="Loyalty#").head()

Unnamed: 0.1,Unnamed: 0,Loyalty#,First Name,Last Name,Customer Name,Country,Province or State,City,Latitude,Longitude,...,Gender,Education,Location Code,Income,Marital Status,LoyaltyStatus,EnrollmentDateOpening,CancellationDate,Customer Lifetime Value,EnrollmentType
1646,1646,101902,Hans,Schlottmann,Hans Schlottmann,Canada,Ontario,London,42.984924,-81.245277,...,female,College,Rural,0.0,Married,Aurora,1/7/2020,,6265.34,Standard
2668,2668,101902,Yi,Nesti,Yi Nesti,Canada,Ontario,Toronto,43.653225,-79.383186,...,female,Bachelor,Urban,79090.0,Married,Aurora,3/19/2020,,8609.16,Standard
15988,15988,106001,Maudie,Hyland,Maudie Hyland,Canada,New Brunswick,Fredericton,45.963589,-66.643112,...,female,Master,Suburban,14973.0,Divorced,Star,7/16/2015,,12168.74,Standard
700,700,106001,Ivette,Peifer,Ivette Peifer,Canada,Quebec,Montreal,45.50169,-73.567253,...,female,High School or Below,Suburban,10037.0,Single,Star,1/11/2016,,4914.04,Standard
13053,13053,106509,Stacy,Schwebke,Stacy Schwebke,Canada,Ontario,Toronto,43.653225,-79.383186,...,female,College,Suburban,0.0,Single,Star,6/12/2021,,4661.98,Standard


In [126]:
customers_df["Loyalty#"].value_counts(ascending=False)

Loyalty#
678205    3
750665    2
369638    2
615561    2
411734    2
         ..
532945    1
570531    1
111584    1
612339    1
100016    1
Name: count, Length: 16757, dtype: int64

- Lets drop the duplicates the 164 duplicates

In [127]:
customers_df.shape

(16921, 21)

In [128]:
customers_df.drop_duplicates(subset="Loyalty#", inplace=True)

- Now let´s set Loyalty# as the Unique index of the dataframe

In [129]:
customers_df.set_index("Loyalty#", inplace=True)

### 2.1.3 Handling missing and Anomalous values 

In [130]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16757 entries, 480934 to 100016
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               16757 non-null  int64  
 1   First Name               16757 non-null  object 
 2   Last Name                16757 non-null  object 
 3   Customer Name            16757 non-null  object 
 4   Country                  16757 non-null  object 
 5   Province or State        16757 non-null  object 
 6   City                     16757 non-null  object 
 7   Latitude                 16757 non-null  float64
 8   Longitude                16757 non-null  float64
 9   Postal code              16757 non-null  object 
 10  Gender                   16757 non-null  object 
 11  Education                16757 non-null  object 
 12  Location Code            16757 non-null  object 
 13  Income                   16737 non-null  float64
 14  Marital Status       

In [131]:
customers_df.describe(include="O").T

Unnamed: 0,count,unique,top,freq
First Name,16757,4935,Stacey,13
Last Name,16757,15263,Ypina,4
Customer Name,16757,16757,Cecilia Householder,1
Country,16757,1,Canada,16757
Province or State,16757,11,Ontario,5410
City,16757,29,Toronto,3354
Postal code,16757,75,V6E 3D9,911
Gender,16757,2,female,8421
Education,16757,5,Bachelor,10483
Location Code,16757,3,Suburban,5659


In [132]:
customers_df["Country"].value_counts()

Country
Canada    16757
Name: count, dtype: int64

- First name and Last name and Customer Name, are not going to be usefull for our cluster analysis moving foward, so we will be droping
- There is only one value for Country, all our Customers are from Canada. For the Cluster analysis will not be usefull also, because every customer is from Canada
- Education has 5 values, Let's look into them
- MaritalStatus has 3 values, let's check them
- Location Code has 3 values, let's check them
- LoyaltyStatus has 3 values, let's check them
- EnrollmentType has 2 values, let's check them


In [133]:
customers_df["Education"].value_counts()

Education
Bachelor                10483
College                  4248
High School or Below      782
Doctor                    734
Master                    510
Name: count, dtype: int64

In [134]:
customers_df["Marital Status"].value_counts()

Marital Status
Married     9747
Single      4492
Divorced    2518
Name: count, dtype: int64

In [135]:
customers_df["Location Code"].value_counts()

Location Code
Suburban    5659
Rural       5615
Urban       5483
Name: count, dtype: int64

In [136]:
customers_df["LoyaltyStatus"].value_counts()

LoyaltyStatus
Star      7657
Nova      5671
Aurora    3429
Name: count, dtype: int64

- Checking the city

In [137]:
customers_by_city=customers_df.groupby("City").agg(count=("City","count"))
customers_by_city.sort_values(by="City", key=lambda x: x.str.len()).head()

Unnamed: 0_level_0,count
City,Unnamed: 1_level_1
Hull,358
Banff,186
London,174
Regina,409
Ottawa,511


In [138]:
columns_to_remove=["First Name", "Last Name", "Customer Name"]

In [139]:
customers_df.drop(columns=columns_to_remove , inplace=True)

### 2.1.4 Checking and updating DataTypes

In [140]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16757 entries, 480934 to 100016
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Unnamed: 0               16757 non-null  int64  
 1   Country                  16757 non-null  object 
 2   Province or State        16757 non-null  object 
 3   City                     16757 non-null  object 
 4   Latitude                 16757 non-null  float64
 5   Longitude                16757 non-null  float64
 6   Postal code              16757 non-null  object 
 7   Gender                   16757 non-null  object 
 8   Education                16757 non-null  object 
 9   Location Code            16757 non-null  object 
 10  Income                   16737 non-null  float64
 11  Marital Status           16757 non-null  object 
 12  LoyaltyStatus            16757 non-null  object 
 13  EnrollmentDateOpening    16757 non-null  object 
 14  CancellationDate     

- Lets drop the column named "Unnamed: 0" since we have set Loyalty# as the Index


In [141]:
customers_df.drop(columns="Unnamed: 0", inplace=True)

In [142]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16757 entries, 480934 to 100016
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Country                  16757 non-null  object 
 1   Province or State        16757 non-null  object 
 2   City                     16757 non-null  object 
 3   Latitude                 16757 non-null  float64
 4   Longitude                16757 non-null  float64
 5   Postal code              16757 non-null  object 
 6   Gender                   16757 non-null  object 
 7   Education                16757 non-null  object 
 8   Location Code            16757 non-null  object 
 9   Income                   16737 non-null  float64
 10  Marital Status           16757 non-null  object 
 11  LoyaltyStatus            16757 non-null  object 
 12  EnrollmentDateOpening    16757 non-null  object 
 13  CancellationDate         2288 non-null   object 
 14  Customer Lifetime Val

- Lets convert CancelationDate and EnrollmentDateOpening to date

In [143]:
customers_df["EnrollmentDateOpening"].head()

Loyalty#
480934     2/15/2019
549612      3/9/2019
429460     7/14/2017
608370     2/17/2016
530508    10/25/2017
Name: EnrollmentDateOpening, dtype: object

In [None]:
customers_df["EnrollmentDateOpening"]= pd.to_datetime(customers_df["EnrollmentDateOpening"])

In [145]:
customers_df["EnrollmentDateOpening"].head()

Loyalty#
480934   2019-02-15
549612   2019-03-09
429460   2017-07-14
608370   2016-02-17
530508   2017-10-25
Name: EnrollmentDateOpening, dtype: datetime64[ns]

In [147]:
customers_df["CancellationDate"]

Loyalty#
480934           NaN
549612           NaN
429460      1/8/2021
608370           NaN
530508           NaN
             ...    
100012     2/27/2019
100013     9/20/2017
100014    11/28/2020
100015      4/9/2020
100016     7/21/2020
Name: CancellationDate, Length: 16757, dtype: object

In [148]:
customers_df["CancellationDate_T"]=pd.to_datetime(customers_df["CancellationDate"], errors="coerce")

- Let´s filter the dates that failed to convert

In [160]:
customers_df[(customers_df["CancellationDate"].notna()) & (customers_df["CancellationDate_T"].isna())][["CancellationDate","CancellationDate_T"]]

Unnamed: 0_level_0,CancellationDate,CancellationDate_T
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1
314558,2/29/2019,NaT
373118,2/29/2019,NaT


- There is no 29 of February of 2019

In [162]:
customers_df["CancellationDate"]=customers_df["CancellationDate_T"]

In [164]:
customers_df.drop(columns="CancellationDate_T", inplace=True)

In [165]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16757 entries, 480934 to 100016
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Country                  16757 non-null  object        
 1   Province or State        16757 non-null  object        
 2   City                     16757 non-null  object        
 3   Latitude                 16757 non-null  float64       
 4   Longitude                16757 non-null  float64       
 5   Postal code              16757 non-null  object        
 6   Gender                   16757 non-null  object        
 7   Education                16757 non-null  object        
 8   Location Code            16757 non-null  object        
 9   Income                   16737 non-null  float64       
 10  Marital Status           16757 non-null  object        
 11  LoyaltyStatus            16757 non-null  object        
 12  EnrollmentDateOpening    16757 

## 2.2. Feature engineering

- For the EnrollmentDateOpening, we are going to create DaysSinceEnrollment
- For the CancellationDate, we are going to create DaysSinceCancellation
- we are also going to create EnrollmentDurationInDays, which is the difference between the EnrollmentDateOpening and CancellationDate


In [168]:
today=pd.Timestamp.today()
today

Timestamp('2025-10-01 08:19:58.686178')

In [172]:
customers_df["DaysSinceEnrollment"]=(today-customers_df["EnrollmentDateOpening"]).dt.days
customers_df["DaysSinceCancellation"]=(today-customers_df["CancellationDate"]).dt.days
customers_df["EnrollmentDurationInDays"]=(customers_df["CancellationDate"]-customers_df["EnrollmentDateOpening"]).dt.days

In [174]:
customers_df[["EnrollmentDateOpening","CancellationDate","DaysSinceEnrollment","DaysSinceCancellation","EnrollmentDurationInDays"]].head()

Unnamed: 0_level_0,EnrollmentDateOpening,CancellationDate,DaysSinceEnrollment,DaysSinceCancellation,EnrollmentDurationInDays
Loyalty#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
480934,2019-02-15,NaT,2420,,
549612,2019-03-09,NaT,2398,,
429460,2017-07-14,2021-01-08,3001,1727.0,1274.0
608370,2016-02-17,NaT,3514,,
530508,2017-10-25,NaT,2898,,


In [177]:
customers_df.drop(columns=["EnrollmentDateOpening","CancellationDate"], inplace=True)

In [178]:
customers_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16757 entries, 480934 to 100016
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Country                   16757 non-null  object 
 1   Province or State         16757 non-null  object 
 2   City                      16757 non-null  object 
 3   Latitude                  16757 non-null  float64
 4   Longitude                 16757 non-null  float64
 5   Postal code               16757 non-null  object 
 6   Gender                    16757 non-null  object 
 7   Education                 16757 non-null  object 
 8   Location Code             16757 non-null  object 
 9   Income                    16737 non-null  float64
 10  Marital Status            16757 non-null  object 
 11  LoyaltyStatus             16757 non-null  object 
 12  Customer Lifetime Value   16737 non-null  float64
 13  EnrollmentType            16757 non-null  object 
 14  DaysS

## 2.3. Statistical Analysis

In [179]:
customers_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Latitude,16757.0,47.176825,3.307562,42.984924,44.231171,46.087818,49.28273,60.721188
Longitude,16757.0,-91.826873,22.244502,-135.05684,-120.23766,-79.383186,-74.596184,-52.712578
Income,16737.0,37749.877696,30370.336552,0.0,0.0,34148.0,62396.0,99981.0
Customer Lifetime Value,16737.0,7988.896536,6860.98228,1898.01,3980.84,5780.18,8940.58,83325.38
DaysSinceEnrolment,16757.0,2551.297905,718.675331,1371.0,1910.0,2525.0,3178.0,3900.0
DaysSinceCancellation,2286.0,2117.861767,500.382633,1371.0,1690.0,2088.5,2437.0,3900.0
EnrollmentDurationInDays,2286.0,361.130796,568.705455,-1924.0,242.0,244.0,540.75,2148.0
DaysSinceEnrollment,16757.0,2551.297905,718.675331,1371.0,1910.0,2525.0,3178.0,3900.0


## 2.4. Visual exploration

In [None]:
## 2.3. Visual exploration

# 3. Data Preprocessing

# 4. Feature engineering