<a href="https://colab.research.google.com/github/mvince33/Coding-Dojo/blob/main/week09/6_20_Code_along_Clustering_for_Supervised_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task: Cluster Houses

In this project we will:

1. Cluster houses by neighborhood
2. Visualize the neighborhood clusters
3. Use neighborhood clusters as a new feature for predictive modeling
4. Compare model evaluation with and without clusters as a feature.



In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, mean_squared_error, \
mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Useful Functions

In [None]:
def plot_2d_clusters(data, clusters, random_state=42):
  """Uses 2d data and number of clusters to fit a kmeans and plot the clusters"""

  # kmeans
  km = KMeans(n_clusters=clusters, random_state=random_state)
  km.fit(data)

  # plot clusters
  plt.figure(figsize=(10,8))
  plt.scatter( x=data.iloc[:,0], y=data.iloc[:,1], c=km.labels_, 
          cmap = 'tab20', s=2)
  ticks = np.sort(np.unique(km.labels_))
  plt.xlabel(data.columns[0])
  plt.ylabel(data.columns[1])
  plt.colorbar(ticks=ticks)
  plt.show();

# Function to calculate and plot the silhouette score and inertia of a KMeans model for various values of k

def plot_k_values(data, ks, random_state=42):
  """plot the silhouette score and inertia 
  of a KMeans model for various k values"""

  sils = []
  inertias = []

  for k in ks:
    km = KMeans(n_clusters=k, random_state=random_state)
    km.fit(data)
    sils.append(silhouette_score(data, km.labels_))
    inertias.append(km.inertia_)
      
  # plot inertia and silhouette score
  fig, axes = plt.subplots(2,1, figsize=(9,7))
  axes[0].set_xlabel('number of clusters')
  axes[0].set_ylabel('Inertia', color = 'blue')
  axes[0].plot(ks, inertias, color = 'blue', label='inertia', marker ='o')
  axes[0].grid()

  axes[1].plot(ks, sils, color = 'red', label='silhouette score', marker='+')
  axes[1].set_ylabel('Silhouette Score', color = 'red')
  axes[1].set_xlabel('Number of Clusters')
  axes[1].grid()

  # plt.grid()
  plt.show()

def evaluate_regression(y_true, y_pred, name='model'):
  scores = pd.DataFrame(index=['R2','MAE','RMSE'],
                        columns=[name])
  scores.loc['R2', name] = r2_score(y_true, y_pred)
  scores.loc['MAE', name] = mean_absolute_error(y_true, y_pred)
  scores.loc['RMSE', name] = np.sqrt(mean_squared_error(y_true, y_pred))
  return scores


In [None]:
# Load Data
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vT9qetZw-uGS1u44KiW-XOJJkhmX0BKPdcsQ_X9cwTHlsTvlBHbEyA5G_D8r9knBbPOQ7My-W4pTfy2/pub?gid=2140088293&single=true&output=csv')
df.head()

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.corr()

### Lets compare the location of the houses with their price.

We can do this with some beautiful graphics using Plotly's `scatter_mapbox()`, ([documenation](https://plotly.github.io/plotly.py-docs/generated/plotly.express.scatter_mapbox.html))

In [None]:
# Lets use plotly as cool way to show houses on a map
import plotly.express as px

px.scatter_mapbox(df, lat='lat', lon='long', color='price',
                  mapbox_style="open-street-map", width=1000, height=800)

#### To use our data for clustering we consider target is not known 
- lets split the data and use it as unlabeled data (without target)


## Using KMeans Cluster for modeling by adding cluster as a feature in data

# Validation Split

In [None]:
# validation split
X = df.drop(columns = ['price'])
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

In [None]:
# Scale the data


# Tune K

### Lets form a cluster based on only location (lat & long)

## Tune K Using Only Location Data

In [None]:
# Silhouette and Inertia Plots


- No clear elbow
- Silhouette score is high for k =2 and 11 
- we will try first no of clusters = 11 and then for the highest score of 2


## Plot 2 Clusters

In [None]:
# Scatter Plot of Neighborhood in 2 Clusters


# Plot 4 clusters

In [None]:
# Scatter Plot of Neighborhoods in 4 clusters


# Modeling

## Linear Regression Baseline without Clusters

In [None]:
# Without clusters


## Linear Regression Model with Clustering as Feature Extraction



## 2 clusters

In [None]:
# Let's see what adding the Kmeans clusters does!

# make copies of the data to add cluster feature to


# create subset of data with only latitude and longitude


# fit a kmeans model on just the training location data


# add clusters as a new feature in the training and testing

# create a new model to fit on the data with the cluster feature


# evaluate the new model on the training and testing data


# combine the training and testing scores into one dataframe


## 4 clusters

In [None]:
# Let's see what adding the Kmeans clusters does!

# make copies of the data to add cluster feature to


# create subset of data with only latitude and longitude


# fit a kmeans model on just the training location data


# add clusters as a new feature in the training and testing

# create a new model to fit on the data with the cluster feature


# evaluate the new model on the training and testing data


# combine the training and testing scores into one dataframe


# Results

Our neighborhood clusters improved our model performance!

# Next Steps: How could we continue to improve this model?
* Further tune K
* Use Ridge/Lasso/ElasticNet to add regularization
* Try other models than LinearRegression

# Limitations:

To put KMeans into a pipeline as a transformer, we would have to create a custom transformer.  Otherwise we can't use this technique with GridSearchCV without some amount of data leakage between folds.  The leakage is small, but it will result in some slight loss of confidence in the scores produced by GridSearchCV

# Bonus, if there is extra time: 

Another way to make latitude and longitude non-linear would be to raise them to a higher power.  What if we added new features that were the latitude and longitude each squared?  How would that affect our model?