# Customer Segmentation in a Lisbon Hotel Chain

The objective is to explore the historical customer information of a 4Star Hotel in Lisbon in order to segment customers and discover the distinguishing features of each group.  
  
This should allow market to have a better understanding of customers groups in order to better engage with the customer. These informations may impact several areas of interaction with the customer, eg:
Marketing: channels, timings, reinforcement points, selling points,...
Sales: Pricing , customer value,...
Reception: Types of interaction,...



## DataSet Description

Talk about the variables


## Setup and Import

In [20]:
import os
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import category_encoders as ce
import collections
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

Data Loading and Initial Analysis

In [21]:
ds = pd.read_csv(r'data\dataset.csv', sep=";")
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111733 entries, 0 to 111732
Data columns (total 29 columns):
ID                      111733 non-null int64
Nationality             111733 non-null object
Age                     107561 non-null float64
DaysSinceCreation       111733 non-null int64
NameHash                111733 non-null object
DocIDHash               110732 non-null object
AverageLeadTime         111733 non-null int64
LodgingRevenue          111733 non-null float64
OtherRevenue            111733 non-null float64
BookingsCanceled        111733 non-null int64
BookingsNoShowed        111733 non-null int64
BookingsCheckedIn       111733 non-null int64
PersonsNights           111733 non-null int64
RoomNights              111733 non-null int64
DistributionChannel     111733 non-null object
MarketSegment           111733 non-null object
SRHighFloor             111733 non-null int64
SRLowFloor              111733 non-null int64
SRAccessibleRoom        111733 non-null int64
SRMe

In [22]:
# Display top 10 rows transposed to show all columns
ds.head(10).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ID,1,2,3,4,5,6,7,8,9,10
Nationality,PRT,PRT,DEU,FRA,FRA,JPN,JPN,FRA,FRA,IRL
Age,52,,32,61,52,55,50,33,43,26
DaysSinceCreation,440,1385,1385,1385,1385,1385,1385,1385,1385,1385
NameHash,0x2C371FD6CE12936774A139FD7430C624F1C4D5109CE6...,0x198CDB98BF37B6E23F9548C56A88B00912D65A9AA0D6...,0xDA46E62F66936284DF2844EC4FC542D0DAD780C0EE0C...,0xC45D4CD22C58FDC5FD0F95315F6EFA5A6E7149187D49...,0xD2E3D5BFCA141865669F98D64CDA85AD04DEFF47F8A0...,0xA3CF1A4692BE0A17CFD3BFD9C07653556BDADF5F4BE7...,0x94DB830C90A6DA2331968CFC9448AB9A3CE07D7CFEDD...,0x165B609162C92BF563E96DB03539363F07E784C219A8...,0x44BB41EF2D87698E75B6FBB77A8815BF48DAA912C140...,0x9BEECEE0C18B0957C7424443643948E99A0EC8326EF9...
DocIDHash,0x434FD3D59469C73AFEA087017FAF8CA2296493AEABDE...,0xE3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B...,0x27F5DF762CCDA622C752CCDA45794923BED9F1B66300...,0x8E59572913BB9B1E6CAA12FA2C8B7BF387B1D1F3432E...,0x42BDEE0E05A9441C94147076EDDCC47E604DA5447DD4...,0x506065FBCE220DCEA4465C7310A84F04165BCB5906DC...,0x47E5E4B21585F1FD956C768E730604241B380EDFEA68...,0x6BB66BA80C726B9967988A889D83699B609D11C65AD7...,0x6C456E45A78A20BC794137AE326A81D587B6528B3944...,0x199C61A5442D08987001E170B74D244DF6AF1FC9AE92...
AverageLeadTime,59,61,0,93,0,58,0,38,0,96
LodgingRevenue,292,280,0,240,0,230,0,535,0,174
OtherRevenue,82.3,53,0,60,0,24,0,94,0,69
BookingsCanceled,1,0,0,0,0,0,0,0,0,0


In [23]:
# Summary statistics for numerical variables
summary=ds.describe(exclude=[np.object])   # exclude only objects, they will be treated next
summary=summary.transpose()  # transpose the summary for easier reading
summary.head(len(summary))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,111733.0,55867.0,32254.683151,1.0,27934.0,55867.0,83800.0,111733.0
Age,107561.0,45.639191,17.244952,-10.0,33.0,47.0,58.0,123.0
DaysSinceCreation,111733.0,595.026599,374.657382,36.0,288.0,522.0,889.0,1385.0
AverageLeadTime,111733.0,60.833147,85.11532,-1.0,0.0,21.0,95.0,588.0
LodgingRevenue,111733.0,283.851283,379.131556,0.0,0.0,208.0,393.3,21781.0
OtherRevenue,111733.0,64.682802,123.580715,0.0,0.0,31.0,84.0,8859.25
BookingsCanceled,111733.0,0.002282,0.080631,0.0,0.0,0.0,0.0,15.0
BookingsNoShowed,111733.0,0.0006,0.028217,0.0,0.0,0.0,0.0,3.0
BookingsCheckedIn,111733.0,0.737607,0.730889,0.0,0.0,1.0,1.0,76.0
PersonsNights,111733.0,4.328318,4.630739,0.0,0.0,4.0,6.0,116.0


In [24]:
# Summary statistics for non numerical variables
summary=ds.describe(include=[np.object],percentiles=None)   
summary=summary.transpose()  
summary.head(len(summary))

Unnamed: 0,count,unique,top,freq
Nationality,111733,199,FRA,16516
NameHash,111733,107584,0x15A713CE687991691A18F6CDC56ABE24979C73CF5D51...,75
DocIDHash,110732,103480,0xE3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B...,3032
DistributionChannel,111733,4,Travel Agent/Operator,91019
MarketSegment,111733,7,Other,63680


199 Nationalities problems probably...
We seem to have duplicates on Name Hashes and DocIDHashes , so we may have to do something about this

In [25]:
nationalities=list(set(ds.Nationality))
print(nationalities)
len(nationalities)

['KOR', 'TGO', 'ECU', 'BHS', 'AIA', 'COM', 'BLR', 'UGA', 'SEN', 'HRV', 'TZA', 'DNK', 'RWA', 'THA', 'AGO', 'PNG', 'ALB', 'SLE', 'DOM', 'CPV', 'CMR', 'NCL', 'PAN', 'CRI', 'KNA', 'TJK', 'JPN', 'DEU', 'DMA', 'ABW', 'LTU', 'VIR', 'EST', 'BEL', 'SOM', 'MAR', 'ASM', 'COK', 'WSM', 'VEN', 'VNM', 'GHA', 'HTI', 'CHN', 'KAZ', 'BMU', 'TMP', 'FIN', 'GIB', 'ISL', 'SUR', 'SWZ', 'CYP', 'ESP', 'UKR', 'RUS', 'KEN', 'GNQ', 'CZE', 'SPM', 'IRL', 'GEO', 'TON', 'BFA', 'AFG', 'NLD', 'MKD', 'GBR', 'ITA', 'TUR', 'PAK', 'DZA', 'MOZ', 'IOT', 'CYM', 'MUS', 'USA', 'IRQ', 'ISR', 'MNE', 'TCD', 'ARG', 'CAN', 'ETH', 'BIH', 'BGD', 'IRN', 'BGR', 'GIN', 'MDV', 'ERI', 'BEN', 'MLI', 'FJI', 'NRU', 'BHR', 'AUS', 'QAT', 'LKA', 'PRY', 'ATA', 'SYC', 'ROU', 'GRC', 'PRI', 'URY', 'LBY', 'MMR', 'YEM', 'VCT', 'FRO', 'JOR', 'MDG', 'KWT', 'GUF', 'PCN', 'NOR', 'PYF', 'LCA', 'LUX', 'GRD', 'TTO', 'ATG', 'FSM', 'MWI', 'GTM', 'KGZ', 'PER', 'HUN', 'LVA', 'SDN', 'NAM', 'BDI', 'SMR', 'BRB', 'SWE', 'LBN', 'ZWE', 'POL', 'FLK', 'PHL', 'EGY', 'NZL'

199

In [26]:
#Check the document hashes
print(ds["DocIDHash"].value_counts(sort=True).head(60))

0xE3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855    3032
0xA486FBACF4B4E5537B026743E3FDFE571D716839E758236F42950A61FE6B922B      31
0x2B17E9D2CCEF2EA0FE752EE345BEDFB06741FFC8ECECF45D6BBDBAF9A274FF52      24
0x469CF1F9CF8C790FFA5AD3F484F2938CBEFF6435BCFD734F687EC6D1E968F076      15
0x2A14D03A4827C67E0D39408F103DB417AD496DCE6158F8309E6281185C042003      14
0x9220D336F2DDD7B68F5066878889C7637EE28924B249F968F5EC82D895B108A7      12
0x3856085146F7BC27BD07BFC4CA1991ED4E65E179D7BDB7DBBA7E32620809C799      12
0xD2DBD6039916F6DB10C6564D8EB9A9116811435965D7D00E7DA292066B3ECE91      11
0x1BF60C4718497A0AB8B46FF00708D3250A484DDA0FDC0248999C782807195BCB      11
0x8FA8EB6D044E4F2C691C2091FAB27B92FEFE22122F419975703C3D5BA76AC4A2      10
0xA89022F442F23A6D7486C47C9F968BF35898B36F0EB3531804CA4613FF33DC45      10
0x10DFBA7DC4CFBBC6403B380AE137098B254DBBCBE5DEEB0C3B240E0F12F0C6D4      10
0x6B421376B94F3D1722979458A96DF486DEA0F9290CC05E9699F2762FD0DDA71D      10
0x1B16B1DF538BA12DC3F97ED

One recurring ID, way too many times to be a person, what to do with others?

In [33]:
from statistics import mode
doc_explore=ds.loc[ds['DocIDHash'] == (mode(ds["DocIDHash"]))]

summary=doc_explore.describe(include="all")
summary=summary.transpose()  
summary.head(len(summary))



#doc_explore.DistributionChannel.value_counts()
#doc_explore.MarketSegment.value_counts()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,3032,,,,27711.0,24608.2,2.0,8240.25,20409.5,41602.8,111553.0
Nationality,3032,3.0,PRT,3030.0,,,,,,,
Age,171,,,,39.924,17.1889,2.0,29.0,39.0,53.5,83.0
DaysSinceCreation,3032,,,,939.059,330.527,38.0,690.75,1022.0,1219.0,1385.0
NameHash,3032,2826.0,0x5175AC9E84362C505AED3E76F20320BE69DD1C21AA67...,10.0,,,,,,,
DocIDHash,3032,1.0,0xE3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B...,3032.0,,,,,,,
AverageLeadTime,3032,,,,66.7932,86.6466,-1.0,4.0,30.0,103.25,588.0
LodgingRevenue,3032,,,,279.394,308.564,0.0,101.565,188.0,358.075,4255.0
OtherRevenue,3032,,,,74.2252,156.058,0.0,14.0,41.0,84.0,5105.5
BookingsCanceled,3032,,,,0.00758575,0.0905016,0.0,0.0,0.0,0.0,2.0


Maybe an alias for document not presented, or other situation? It has different ages, 2826 different Name Hashes. 

In [34]:
#Check Distribution channel and Market Segment
doc_explore["DistributionChannel","MarketSegment"].apply(pd.Series.value_counts)

KeyError: ('DistributionChannel', 'MarketSegment')

None the less, we must think on the variables that are only available after arrival, since those cannot be used in the model, or predictions cannot be made before arrival.  

And documents are presented on arrival, but... 

In [18]:
doc_explore=ds.loc[ds['DocIDHash'] == "0xA486FBACF4B4E5537B026743E3FDFE571D716839E758236F42950A61FE6B922B"]

summary=doc_explore.describe(include="all")
summary=summary.transpose()
summary.head(len(summary))

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,31,,,,51674.5,21078.0,12953.0,37829.5,48009.0,63716.5,98732.0
Nationality,31,1.0,PRT,31.0,,,,,,,
Age,31,,,,51.0,0.0,51.0,51.0,51.0,51.0,51.0
DaysSinceCreation,31,,,,602.645,241.352,142.0,460.0,588.0,745.5,1147.0
NameHash,31,9.0,0x8DF2AF984365949E7F4EAB2EBA9BF9CA8DF106B5F2A9...,20.0,,,,,,,
DocIDHash,31,1.0,0xA486FBACF4B4E5537B026743E3FDFE571D716839E758...,31.0,,,,,,,
AverageLeadTime,31,,,,13.5161,13.1779,0.0,4.0,10.0,20.0,50.0
LodgingRevenue,31,,,,778.16,1385.32,59.0,118.0,295.0,650.5,6991.0
OtherRevenue,31,,,,114.056,197.622,7.0,17.0,44.0,112.5,957.0
BookingsCanceled,31,,,,0.290323,0.82436,0.0,0.0,0.0,0.0,4.0


For this second document hash, we see  some things that are equal across different records, the nationality, the age and we can see more consistency on the values.

In [None]:
perguntas: some columns are averages, the ones that are binary represent what? The requests of the last booking? What should we do with the records with the same document hash...
    