# Customer Segmentation in a Lisbon Hotel Chain

The objective is to explore the historical customer information of a 4Star Hotel in Lisbon in order to segment customers and discover the distinguishing features of each group.  
  
This should allow market to have a better understanding of customers groups in order to better engage with the customer. These informations may impact several areas of interaction with the customer, eg:
Marketing: channels, timings, reinforcement points, selling points,...
Sales: Pricing , customer value,...
Reception: Types of interaction,...



## DataSet Description

Talk about the variables


## Setup and Import

In [1]:
import os
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import category_encoders as ce
import collections
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

Data Loading and Initial Analysis

In [6]:
ds = pd.read_csv(r'data\dataset.csv', sep=";")
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111733 entries, 0 to 111732
Data columns (total 29 columns):
ID                      111733 non-null int64
Nationality             111733 non-null object
Age                     107561 non-null float64
DaysSinceCreation       111733 non-null int64
NameHash                111733 non-null object
DocIDHash               110732 non-null object
AverageLeadTime         111733 non-null int64
LodgingRevenue          111733 non-null float64
OtherRevenue            111733 non-null float64
BookingsCanceled        111733 non-null int64
BookingsNoShowed        111733 non-null int64
BookingsCheckedIn       111733 non-null int64
PersonsNights           111733 non-null int64
RoomNights              111733 non-null int64
DistributionChannel     111733 non-null object
MarketSegment           111733 non-null object
SRHighFloor             111733 non-null int64
SRLowFloor              111733 non-null int64
SRAccessibleRoom        111733 non-null int64
SRMe

In [9]:
# Display top 10 rows transposed to show all columns
ds.head(10).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ID,1,2,3,4,5,6,7,8,9,10
Nationality,PRT,PRT,DEU,FRA,FRA,JPN,JPN,FRA,FRA,IRL
Age,52,,32,61,52,55,50,33,43,26
DaysSinceCreation,440,1385,1385,1385,1385,1385,1385,1385,1385,1385
NameHash,0x2C371FD6CE12936774A139FD7430C624F1C4D5109CE6...,0x198CDB98BF37B6E23F9548C56A88B00912D65A9AA0D6...,0xDA46E62F66936284DF2844EC4FC542D0DAD780C0EE0C...,0xC45D4CD22C58FDC5FD0F95315F6EFA5A6E7149187D49...,0xD2E3D5BFCA141865669F98D64CDA85AD04DEFF47F8A0...,0xA3CF1A4692BE0A17CFD3BFD9C07653556BDADF5F4BE7...,0x94DB830C90A6DA2331968CFC9448AB9A3CE07D7CFEDD...,0x165B609162C92BF563E96DB03539363F07E784C219A8...,0x44BB41EF2D87698E75B6FBB77A8815BF48DAA912C140...,0x9BEECEE0C18B0957C7424443643948E99A0EC8326EF9...
DocIDHash,0x434FD3D59469C73AFEA087017FAF8CA2296493AEABDE...,0xE3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B...,0x27F5DF762CCDA622C752CCDA45794923BED9F1B66300...,0x8E59572913BB9B1E6CAA12FA2C8B7BF387B1D1F3432E...,0x42BDEE0E05A9441C94147076EDDCC47E604DA5447DD4...,0x506065FBCE220DCEA4465C7310A84F04165BCB5906DC...,0x47E5E4B21585F1FD956C768E730604241B380EDFEA68...,0x6BB66BA80C726B9967988A889D83699B609D11C65AD7...,0x6C456E45A78A20BC794137AE326A81D587B6528B3944...,0x199C61A5442D08987001E170B74D244DF6AF1FC9AE92...
AverageLeadTime,59,61,0,93,0,58,0,38,0,96
LodgingRevenue,292,280,0,240,0,230,0,535,0,174
OtherRevenue,82.3,53,0,60,0,24,0,94,0,69
BookingsCanceled,1,0,0,0,0,0,0,0,0,0


In [12]:
# Summary statistics for numerical variables
summary=ds.describe(exclude=[np.object])   # exclude only objects, they will be treated next
summary=summary.transpose()  # transpose the summary for easier reading
summary.head(len(summary))

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,111733,,,,55867.0,32254.7,1.0,27934.0,55867.0,83800.0,111733.0
Nationality,111733,199.0,FRA,16516.0,,,,,,,
Age,107561,,,,45.6392,17.245,-10.0,33.0,47.0,58.0,123.0
DaysSinceCreation,111733,,,,595.027,374.657,36.0,288.0,522.0,889.0,1385.0
NameHash,111733,107584.0,0x15A713CE687991691A18F6CDC56ABE24979C73CF5D51...,75.0,,,,,,,
DocIDHash,110732,103480.0,0xE3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B...,3032.0,,,,,,,
AverageLeadTime,111733,,,,60.8331,85.1153,-1.0,0.0,21.0,95.0,588.0
LodgingRevenue,111733,,,,283.851,379.132,0.0,0.0,208.0,393.3,21781.0
OtherRevenue,111733,,,,64.6828,123.581,0.0,0.0,31.0,84.0,8859.25
BookingsCanceled,111733,,,,0.00228223,0.0806315,0.0,0.0,0.0,0.0,15.0


In [13]:
# Summary statistics for non numerical variables
summary=ds.describe(include=[np.object],percentiles=None)   
summary=summary.transpose()  
summary.head(len(summary))

Unnamed: 0,count,unique,top,freq
Nationality,111733,199,FRA,16516
NameHash,111733,107584,0x15A713CE687991691A18F6CDC56ABE24979C73CF5D51...,75
DocIDHash,110732,103480,0xE3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B...,3032
DistributionChannel,111733,4,Travel Agent/Operator,91019
MarketSegment,111733,7,Other,63680


199 Nationalities
We seem to have duplicates on Name Hashes and DocIDHashes , so we may have to do something about this