# Clustering National Home Markets

You work for a large national bank, who has a large lending business devoted toward providing loans to people who want to borrow to buy homes across the United States. The bank wants to have a model which can identify how similar, at any given time, the national real estate market is to other real estate periods which have occured in the past. After all, quantifying the nature of today's real estate market to those that have occured in the past will help the bank understand its lending risk, as well as the potential for new growth.

To that effect, you've decided to use the `KMeans` unsupervised learning algorithm to segment different periods in the U.S. market for national residential house prices.

In [22]:
# Import the modules
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.cluster import KMeans

## Read in the `national-home-sales.csv` file from the Resources folder and create a DataFrame. Set the “date” column to create the DatetimeIndex. Be sure to include parameters for `parse_dates` and `infer_datetime_format`.

In [23]:
# Read in the CSV file as a Pandas DataFrame
home_sales_df = pd.read_csv(
  Path("../Resources/national-home-sales.csv"),
  index_col="date", 
  parse_dates=True, 
  infer_datetime_format=True 
)

# Review the DataFrame
home_sales_df.head()

  home_sales_df = pd.read_csv(
  home_sales_df = pd.read_csv(


Unnamed: 0_level_0,inventory,homes_sold,median_sale_price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-01,1250798,377964,289000
2020-02-01,1265253,405992,294000
2020-03-01,1316823,507324,303000
2020-04-01,1297460,436855,304000
2020-05-01,1289500,421351,299000


In [24]:
# Read in the CSV file as a Pandas DataFrame
home_sales_df = pd.read_csv(
  Path("../Resources/national-home-sales.csv")
)
#convert the date to datetime
home_sales_df['date'] = pd.to_datetime(home_sales_df['date'], format='%m/%d/%y')

#set it to the index
home_sales_df.set_index('date', inplace=True)
# Review the DataFrame
home_sales_df.head()



Unnamed: 0_level_0,inventory,homes_sold,median_sale_price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-01,1250798,377964,289000
2020-02-01,1265253,405992,294000
2020-03-01,1316823,507324,303000
2020-04-01,1297460,436855,304000
2020-05-01,1289500,421351,299000


## Create two lists: one to hold the list of inertia scores and another for the range of k values (from 1 to 11) to analyze.

In [25]:
# Create a a list to store inertia values
inertia = []

# Create a a list to store the values of k
k = list(range(1, 11))
k

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

## Using a for-loop to evaluate each instance of k, define a K-means model, fit the K-means model based on the scaled DataFrame, and append the model’s inertia to the empty inertia list that you created in the previous step.

In [26]:
# Create a for-loop where each value of k is evaluated using the K-means algorithm
for i in k:
    k_model = KMeans(n_clusters = i, random_state=2)
    # Fit the model using the spread_df DataFrame
    k_model.fit(home_sales_df)  
    # Append the value of the computed inertia from the `inertia_` attribute of the KMeans model instance
    inertia.append(k_model.inertia_)

# YOUR CODE HERE
inertia

[8048111161391.518,
 3567121196527.0806,
 1894158408526.619,
 1373651829317.953,
 1140083301128.2942,
 917853817187.4152,
 830183087592.7928,
 673192783478.791,
 569538768521.654,
 474516707147.0744]

## Store the values for k and the inertia in a Dictionary called `elbow_data`. Use `elbow_data` to create a Pandas DataFrame called `df_elbow`.

In [30]:
# Create a Dictionary that holds the list values for k and inertia
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)
df_elbow.head()

# Create a DataFrame using the elbow_data Dictionary
df_elbow = pd.DataFrame({"k": k, "inertia": inertia})
df_elbow.head()



Unnamed: 0,k,inertia
0,1,8048111000000.0
1,2,3567121000000.0
2,3,1894158000000.0
3,4,1373652000000.0
4,5,1140083000000.0


## Using hvPlot, plot the `df_elbow` DataFrame to visualize the elbow curve.

In [32]:
# Plot the DataFrame
df_elbow.hvplot.line(
    x="k",
    y = "inertia",
    title= "Elbow Curve",
    xticks = k
)



## Perform the following tasks for each of the two most likely values of `k`:

* Define a K-means model using `k` to define the clusters, fit the model, make predictions, and add the prediction values to a copy of the scaled DataFrame and call it `spread_predictions_df`.

* Plot the clusters. The x-axis should reflect home "inventory", and the y-axis should reflect either the "median_sale_price" or "homes_sold" variable.

In [35]:
# Define the model with the lower value of k clusters
# Use a random_state of 1 to generate the model
model = KMeans(n_clusters=3, random_state=1)

# Fit the model
model.fit(home_sales_df)

# Make predictions
k_lower = model.predict(home_sales_df)
k_lower
# Create a copy of the DataFrame and name it as spread_df_predictions
home_sales_predictions_df = home_sales_df.copy()

# Add a class column with the labels to the spread_df_predictions DataFrame
home_sales_predictions_df['clusters_lower'] = k_lower

In [43]:
display(home_sales_predictions_df.head(50))
display(home_sales_predictions_df.tail(50))

Unnamed: 0_level_0,inventory,homes_sold,median_sale_price,clusters_lower
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-01-01,1250798,377964,289000,1
2020-02-01,1265253,405992,294000,1
2020-03-01,1316823,507324,303000,1
2020-04-01,1297460,436855,304000,1
2020-05-01,1289500,421351,299000,1
2020-06-01,1219863,587635,310000,1
2020-07-01,1165359,700733,323000,1
2020-08-01,1066903,652878,328000,1
2012-02-01,2078931,304737,160000,2
2012-03-01,2120173,394034,171000,2


Unnamed: 0_level_0,inventory,homes_sold,median_sale_price,clusters_lower
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-11-01,1803188,393948,234000,3
2015-12-01,1615706,489030,238000,0
2016-01-01,1604314,341903,228000,0
2016-02-01,1639331,365754,227000,0
2016-03-01,1719652,486821,239000,0
2016-04-01,1793818,526883,245000,3
2016-05-01,1831656,591629,252000,3
2016-06-01,1871580,661162,259000,3
2016-07-01,1890218,584704,257000,3
2016-08-01,1850571,614179,256000,3


In [44]:
# Plot the clusters
home_sales_predictions_df.hvplot.scatter(
    x="inventory",
    y="median_sale_price",
    by="clusters_lower"
).opts(yformatter="%.0f")

In [39]:
# Define the model with the higher value of k clusters
# Use a random_state of 1 to generate the model
model = KMeans(n_clusters=4, random_state=1)

# Fit the model
model.fit(home_sales_df)

# Make predictions
k_lower = model.predict(home_sales_df)
k_lower
# Create a copy of the DataFrame and name it as spread_df_predictions
home_sales_predictions_df = home_sales_df.copy()

# Add a class column with the labels to the spread_df_predictions DataFrame
home_sales_predictions_df['clusters_lower'] = k_lower

In [40]:
# Plot the clusters
# Plot the clusters
home_sales_predictions_df.hvplot.scatter(
    x="inventory",
    y="homes_sold",
    by="clusters_lower"
).opts(yformatter="%.0f")

## Answer the following question

* Considering the plot, what’s the best number of clusters to choose, or value of k? 
    >* Your Answer Here