# Property comparison using clustering

In [1]:
# Import necassary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# # Specifing figure layout
sns.set_context("talk", font_scale=1.5)

In [2]:
# load datasets
df_listings = pd.read_csv('../data/all_listings_cleaned_20210723.csv')
df_room_features = pd.read_csv('../data/room_features20210716.csv')

In [3]:
# Shape of both datasets
print(df_listings.shape)
print(df_room_features.shape)

(27679, 45)
(30227, 153)


In [5]:
# Merge both datasets
df_cluster = pd.merge(df_listings, df_room_features, on='listing_id', how='inner')
print(df_cluster.shape)
del df_cluster['Unnamed: 0']
df_cluster.head()

(27659, 197)


Unnamed: 0,listing_id,customer_id,state,contract_end,zip,region,subregion,holiday_region,property_type,subscription,...,underfloor_heating,vacuum_cleaner,walk-in_shower,wall_bed,wardrobe,wash_basin,water_bed,windbreak,window,cooking
0,97232bc1-cee6-54cc-9965-be13177051d3,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,Online,2022-05-01,182,Ostsee,Mecklenburgische Ostseeküste,Ostsee,holiday_apartment,active,...,0,1,0,0,1,1,0,0,0,1
1,b2e43b01-0a74-5270-95db-e4f649982e72,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,Online,2022-05-01,875,Allgäu,Oberallgäu,Oberallgäu,holiday_apartment,active,...,0,1,0,0,1,1,0,0,0,1
2,892f4fda-e0e8-5a9b-bbed-29f82942ff9a,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,Online,2022-05-01,237,Ostsee,Lübecker Bucht,Ostsee,holiday_apartment,active,...,0,1,0,0,1,1,0,0,0,1
3,e190ea0f-c688-5aa8-ae38-d0aeebdeec65,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,Online,2022-05-01,274,Nordsee,Cuxhaven & Umgebung,Nordsee,holiday_apartment,active,...,0,1,0,0,1,1,0,1,0,1
4,67e29a5f-1299-535c-ab06-d5c3ae750e9f,350d46c1-2a43-5053-a33c-40cd3a4c8b95,Online,2022-04-01,243,Ostsee,Geltinger Bucht,Ostsee,holiday_apartment,active,...,0,1,1,0,1,1,0,1,1,1


In [None]:
# Looking for all features
list(df_cluster.columns)

In [8]:
# Combine duplicated columns
df_cluster['dishwasher_x'] = np.where(df_cluster['dishwasher_x'] == 0, df_cluster['dishwasher_y'], df_cluster['dishwasher_x'])
df_cluster['dryer_x'] = np.where(df_cluster['dryer_x'] == 0, df_cluster['dryer_y'], df_cluster['dryer_x'])
df_cluster['terrace_x'] = np.where(df_cluster['terrace_x'] == 0, df_cluster['terrace_y'], df_cluster['terrace_x'])

# Delete duplicated columns
df_cluster.drop(['dishwasher_y', 'dryer_y', 'terrace_y'], axis=1, inplace=True)

# Rename columns
df_cluster.rename(columns={'dishwasher_x': 'dishwasher', 
                    'dryer_x': 'dryer', 'terrace_x': 'terrace',
                    'sun_umbrella_': 'sun_umbrella',
                    'colouring_book_/_pencils': 'colouring_book_pencils',
                    "child's_bed": 'childs_bed', 'awning_': 'awning',
                    'air_conditioning_': 'air_conditioning',
                    'CDs/_DVDs': 'CDs_DVDs', 'living_/_dining_room': 'living_dining_room',
                    'living_/_bedroom': 'living_bedroom', 'children`s_room': 'childrens_room',
                    'Library': 'library'}, inplace=True)

In [9]:
# Check if categorical features have a sum less than 10 for true values
np.any(df_cluster.loc[:,'option_allergic':].sum(axis=0) <= 10)

False

In [10]:
# Drop unneccassary features for clustering (comparison properties)
df_cluster.drop(['state', 'contract_end', 'subscription', 'binding_inquiry'],axis=1, inplace=True)

In [11]:
df_cluster.shape

(27659, 189)

## Distance measures for mixed data: Gower’s dissimilarity

Clustering algorithms are based on distance measures to define if objects are considered similar or not. Distances need to be defined between two objects in order to use clustering algorithms. A problem with defining distances can occur when a data set consists of mixed data, for instance, numeric, binary, nominal and ordinal data (section 2.2.3). For example how do you measure the similarity between a red car that weights 1400 kg and a blue car that weights 1200 kg? A solution is to use Gower’s dissimilarity measure (GD) that can calculate the distance between two entities whose attributes have a mix of categorical and numerical values.

In [13]:
# Calculate distance matrix
import gower
distance_matrix = gower.gower_matrix(df_cluster)

In [14]:
# Shape of distance matrix
distance_matrix.shape

(27659, 27659)

In [21]:
# Save distance matrix in CSV file
from numpy import savetxt
savetxt('distance_matrix.csv', distance_matrix, delimiter=',')

# load numpy array from csv file
#from numpy import loadtxt
# load array
#df_cluster = loadtxt('distance_matrix.csv', delimiter=',')

## K-medoids and Partitioning Around Medoids

Now that we have a distance matrix in place, we need to choose a clustering algorithm to infer similarities/dissimilarities from these distances. Just like K-means and hierarchical algorithms go hand-in-hand with Euclidean distance, the Partitioning Around Medoids (PAM) algorithm goes along with the Gower distance. PAM is an iterative clustering procedure just like the K-means, but with some slight differences. Instead of centroids in K-means clustering, PAM iterates over and over until the medoids don't change their positions. The medoid of a cluster is a member of the cluster which is representative of the median of all the attributes under consideration.

### Silhouette Width to select the optimal number of clusters

The silhouette width is one of the very popular choices when it comes to selecting the optimal number of clusters. It measures the similarity of each point to its cluster, and compares that to the similarity of the point with the closest neighboring cluster. This metric ranges between -1 to 1, where a higher value implies better similarity of the points to their clusters. Therefore, a higher value of the Silhouette Width is desirable. We calculate this metric for a range of cluster numbers and find where it is maximized. 

Kmedoids

## DBSCAN

To understand DBSCAN in more detail, let’s dive into it. The main concept of DBSCAN algorithm is to locate regions of high density that are separated from one another by regions of low density. 

Density at a point P: Number of points within a circle of Radius Eps (ϵ) from point P.
Dense Region: For each point in the cluster, the circle with radius ϵ contains at least minimum number of points (MinPts).

In [141]:
from sklearn.cluster import DBSCAN

# Configuring the parameters of the clustering algorithm
dbscan_cluster = DBSCAN(eps=0.01, 
                        min_samples=100, 
                        metric="precomputed")

# Fitting the clustering algorithm
dbscan_cluster.fit(distance_matrix)

# Adding the results to a new column in the dataframe
## first try: column cluster --> eps=0.3; min_sample=5
df_cluster["cluster_001_100"] = dbscan_cluster.labels_

# Show head of new dataset
df_cluster.head()

# Export new CSV
df_cluster.to_csv('../data/clustering_20210723.csv')

Unnamed: 0,listing_id,customer_id,zip,region,subregion,holiday_region,property_type,option_allergic,option_non_smoking_only,option_holiday_with_your_pet,...,wash_basin,water_bed,windbreak,window,cooking,cluster_01_5,cluster_01_4,cluster_01_6,cluster_01_3,cluster_001_100
0,97232bc1-cee6-54cc-9965-be13177051d3,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,182,Ostsee,Mecklenburgische Ostseeküste,Ostsee,holiday_apartment,1,1,1,...,1,0,0,0,1,-1,-1,-1,-1,-1
1,b2e43b01-0a74-5270-95db-e4f649982e72,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,875,Allgäu,Oberallgäu,Oberallgäu,holiday_apartment,1,1,1,...,1,0,0,0,1,-1,-1,-1,-1,-1
2,892f4fda-e0e8-5a9b-bbed-29f82942ff9a,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,237,Ostsee,Lübecker Bucht,Ostsee,holiday_apartment,1,1,1,...,1,0,0,0,1,-1,-1,-1,-1,-1
3,e190ea0f-c688-5aa8-ae38-d0aeebdeec65,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,274,Nordsee,Cuxhaven & Umgebung,Nordsee,holiday_apartment,1,1,1,...,1,0,1,0,1,-1,-1,-1,-1,-1
4,67e29a5f-1299-535c-ab06-d5c3ae750e9f,350d46c1-2a43-5053-a33c-40cd3a4c8b95,243,Ostsee,Geltinger Bucht,Ostsee,holiday_apartment,1,1,0,...,1,0,1,1,1,-1,-1,-1,-1,-1


In [48]:
print(df_cluster.cluster_01_5.nunique())
print(df_cluster.cluster_01_5.unique())

93
[-1  0  7 72  1  2  3  4 53  5 73  6 40 63  9  8 45 25 10 11 12 38 24 13
 14 15 16 17 91 52 18 19 20 21 22 23 26 28 27 29 30 31 49 37 32 33 35 34
 36 39 55 67 41 42 43 44 46 65 47 48 50 51 54 56 64 57 58 59 60 61 62 89
 82 85 80 66 68 69 70 71 75 74 84 76 77 78 79 81 90 83 86 87 88]


In [137]:
print(df_cluster.cluster_01_4.nunique())
print(df_cluster.cluster_01_4.unique())

191
[ -1   0 141  10   1 113   2   3   4 182 186   5  18   6  31  14 181   7
   8   9  11  12 110  13 152  15  16  17  19 109 121  20  21  22  91  52
  23  24  25  26  27  77  28  29  30  32 148 124  33  34  35  36  37  38
  39 185  40 140  41 188  42  43  44  74  45  46  47 176 102  48  79  49
  50  51  53  54  55  56  57  58 143 129  59  60  61  62  63 117 173  64
 169  65  66  67  68  69  70  71  72  73  75  76  78  80  81  82  83  84
  85  86  87  88  89  90  92  93  94  95  96  97  98  99 100 101 130 103
 104 105 106 107 157 108 183 111 112 135 114 146 133 115 116 127 118 119
 120 184 122 123 125 126 128 131 132 134 136 187 137 138 139 142 171 144
 155 145 147 149 150 151 159 168 153 154 172 156 162 158 160 161 163 164
 165 166 167 178 170 174 175 177 179 180 189]


In [55]:
print(df_cluster.cluster_01_6.nunique())
print(df_cluster.cluster_01_6.unique())

53
[-1  0  4 39 16 40  1  2  3 38 20  5 19  6 49  7 51  8  9 10 11 12 13 14
 15 27 17 18 35 32 21 22 23 24 33 25 26 28 29 30 31 43 34 36 37 41 45 42
 50 44 46 47 48]


In [68]:
print(df_cluster.cluster_01_3.nunique())
print(df_cluster.cluster_01_3.unique())

413
[ -1   0  40 303   1 233   2 410 384   3   4   5   6   7   8   9  10 394
  11  12  13 403  14  15  48  16  72  17  19  18  20  21  22  23  24  25
  26  27  28  29  30  31  32  33  34  35 300  36  37  38  39 358  41  42
 377 109  43 329  44  45  46  47 163  50  49  51 328  52  53  54  55  56
 123  57 409  58 122  59  60  61  62  63  64  65  66  67  68  69  96  70
  71  73  74  75  89  76  77  78  79 194  80  81  82  83  84  85  86  87
 396 130  88  90  91  92  93  94  95  97  98  99 100 101 102 103 291 104
 105 106 107 262 352 108 110 111 112 113 114 115 190 387 116 117 118 136
 230 119 120 121 355 124 125 126 127 276 128 129 241 131 132 133 134 135
 137 375 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
 154 235 155 269 156 157 158 159 160 161 162 164 165 166 167 168 169 170
 171 172 173 174 175 200 176 177 340 178 179 180 181 182 183 184 185 186
 187 188 189 191 192 357 193 195 196 197 198 199 279 201 202 315 203 204
 205 206 207 317 208 209 210 211 212 213 214 21

In [145]:
df_cluster.cluster_01_5.value_counts()

-1     16460
 0     10587
 7        69
 4        14
 26       12
       ...  
 72        3
 17        3
 47        3
 75        2
 36        2
Name: cluster_01_5, Length: 93, dtype: int64

In [147]:
df_cluster.head()

Unnamed: 0,listing_id,customer_id,zip,region,subregion,holiday_region,property_type,option_allergic,option_non_smoking_only,option_holiday_with_your_pet,...,wardrobe,wash_basin,water_bed,windbreak,window,cooking,cluster_01_5,cluster_01_4,cluster_01_6,cluster_01_3
0,97232bc1-cee6-54cc-9965-be13177051d3,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,182,Ostsee,Mecklenburgische Ostseeküste,Ostsee,holiday_apartment,1,1,1,...,1,1,0,0,0,1,-1,-1,-1,-1
1,b2e43b01-0a74-5270-95db-e4f649982e72,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,875,Allgäu,Oberallgäu,Oberallgäu,holiday_apartment,1,1,1,...,1,1,0,0,0,1,-1,-1,-1,-1
2,892f4fda-e0e8-5a9b-bbed-29f82942ff9a,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,237,Ostsee,Lübecker Bucht,Ostsee,holiday_apartment,1,1,1,...,1,1,0,0,0,1,-1,-1,-1,-1
3,e190ea0f-c688-5aa8-ae38-d0aeebdeec65,6e5e6ab0-34d3-5662-9259-7ae7eb021acb,274,Nordsee,Cuxhaven & Umgebung,Nordsee,holiday_apartment,1,1,1,...,1,1,0,1,0,1,-1,-1,-1,-1
4,67e29a5f-1299-535c-ab06-d5c3ae750e9f,350d46c1-2a43-5053-a33c-40cd3a4c8b95,243,Ostsee,Geltinger Bucht,Ostsee,holiday_apartment,1,1,0,...,1,1,0,1,1,1,-1,-1,-1,-1
