## Assignment: $k$ Means Clustering

## *Do two questions. - Did 1 and 2*

## **Jenny Schilling (xdj3kg)**


In [4]:
! git clone https://github.com/jennyschilling/kmc

fatal: destination path 'kmc' already exists and is not an empty directory.


In [16]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

**Q1.** This question is a case study for $k$ means clustering.

1. Load the `airbnb_hw.csv` data. Clean `Price` along with `Beds`, `Number of Reviews`, and `Review Scores Rating`.
2. Maxmin normalize the data and remove any `nan`'s (`KMeans` from `sklearn` doesn't accept `nan` input).
3. Use `sklearn`'s `KMeans` module to cluster the data by `Beds`, `Number of Reviews`, and `Review Scores Rating` for `k=6`.
4. Use `seaborn`'s `.pairplot()` to make a grid of scatterplots that show how the clustering is carried out in multiple dimensions.
5. Use `.groupby` and `.describe` to compute the average price for each cluster. Which clusters have the highest rental prices?
6. Use a scree plot to pick the number of clusters and repeat steps 4 and 5.

In [13]:
df = pd.read_csv('./kmc/data/airbnb_hw.csv')

# rename columns
df['price'] = df['Price']
df['num_beds'] = df['Beds']
df['num_reviews'] = df['Number Of Reviews']
df['review_score'] = df['Review Scores Rating']

X = df.loc[:,['price', 'num_beds', 'num_reviews', 'review_score']]
print(X.shape)
X.describe()

# cleaning price
X['price'] = X['price'].str.replace(',', '').astype(int)
X.describe()

# cleaning beds
X['num_beds'] = X['num_beds'].fillna(1).astype(int)
X.describe()

# cleaning num of reviews
# not necessary

# cleaning review scores
null_reviews = df['review_score'].isnull()
has_reviews = df['review_score'] > 0
pd.crosstab(null_reviews, has_reviews)

# check work
X = X.dropna()
X.describe()

(30478, 4)


Unnamed: 0,price,num_beds,num_reviews,review_score
count,22155.0,22155.0,22155.0,22155.0
mean,154.787633,1.556985,16.505439,91.99323
std,148.836621,1.043273,24.308241,8.850373
min,10.0,0.0,1.0,20.0
25%,85.0,1.0,2.0,89.0
50%,125.0,1.0,7.0,94.0
75%,190.0,2.0,20.0,100.0
max,10000.0,16.0,257.0,100.0


In [15]:
# Maxmin normalize the data and remove any nan's
Z = X.drop('price',axis=1)
scaler = MinMaxScaler()
Z_scaled = scaler.fit_transform(Z)
Z.describe()

In [18]:
# Use sklearn's KMeans module to cluster the data by Beds, Number of Reviews, and Review Scores Rating for k=6.
model = KMeans(n_clusters=6, max_iter=300, n_init = 10, random_state=0)
model = model.fit(Z)
Z['cluster'] = model.labels_
Z.describe()

Unnamed: 0,num_beds,num_reviews,review_score,cluster
count,22155.0,22155.0,22155.0,22155.0
mean,1.556985,16.505439,91.99323,2.544708
std,1.043273,24.308241,8.850373,0.96416
min,0.0,1.0,20.0,0.0
25%,1.0,2.0,89.0,2.0
50%,1.0,7.0,94.0,3.0
75%,2.0,20.0,100.0,3.0
max,16.0,257.0,100.0,5.0


In [None]:
# Use seaborn's .pairplot() to make a grid of scatterplots that show how the clustering is carried out in multiple dimensions.


In [None]:
# Use .groupby and .describe to compute the average price for each cluster. Which clusters have the highest rental prices?

In [None]:
# Use a scree plot to pick the number of clusters and repeat steps 4 and 5.

**Q2.** This is a question about $k$ means clustering. We want to investigate how adjusting the "noisiness" of the data impacts the quality of the algorithm and the difficulty of picking $k$.

1. Run the code below, which creates four datasets: `df0_125`, `df0_25`, `df0_5`, `df1_0`, and `df2_0`. Each data set is created by increasing the amount of `noise` (standard deviation) around the cluster centers, from `0.125` to `0.25` to `0.5` to `1.0` to `2.0`.

```
import numpy as np
import pandas as pd

def createData(noise,N=50):
    np.random.seed(100) # Set the seed for replicability
    # Generate (x1,x2,g) triples:
    X1 = np.array([np.random.normal(1,noise,N),np.random.normal(1,noise,N)])
    X2 = np.array([np.random.normal(3,noise,N),np.random.normal(2,noise,N)])
    X3 = np.array([np.random.normal(5,noise,N),np.random.normal(3,noise,N)])
    # Concatenate into one data frame
    gdf1 = pd.DataFrame({'x1':X1[0,:],'x2':X1[1,:],'group':'a'})
    gdf2 = pd.DataFrame({'x1':X2[0,:],'x2':X2[1,:],'group':'b'})
    gdf3 = pd.DataFrame({'x1':X3[0,:],'x2':X3[1,:],'group':'c'})
    df = pd.concat([gdf1,gdf2,gdf3],axis=0)
    return df

df0_125 = createData(0.125)
df0_25 = createData(0.25)
df0_5 = createData(0.5)
df1_0 = createData(1.0)
df2_0 = createData(2.0)
```

2. Make scatterplots of the $(X1,X2)$ points by group for each of the datasets. As the `noise` goes up from 0.125 to 2.0, what happens to the visual distinctness of the clusters?
3. Create a scree plot for each of the datasets. Describe how the level of `noise` affects the scree plot (particularly the presence of a clear "elbow") and your ability to definitively select a $k$.
4. Explain the intuition of the elbow, using this numerical simulation as an example.