## Clustering and Regressions

We have 2 basic data sources:

[Merged Sales] - This is sales data by zip code. It came from https://www.redfin.com/news/data-center/ and it's got a lot of stats in there. Most of the data is medians so it doesn't tell us much about the outliers which almost certainly are skewing some of the data pretty heavily. The Price per SQ Foot is one way to see a little bit of how the market is shaped. These are broken out by zip code and month.

[2018_demographic data] - This is 2018 demographic info by zip code.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

### Import Data
#### Demographics

In [19]:
demographics = pd.read_csv("2018_demographic_data_edited.csv", delimiter=',')

print(demographics.shape) # removed some of the blank columns and columns with strings in excel

demo = demographics.fillna(demographics.mean())

(33120, 2151)


#### Sales

In [36]:
sales = pd.read_csv("med_sale_price_yoy.csv", delimiter=',')
sales.rename(columns={"Zip Code": "zip"}, inplace = True)
sales.head()

Unnamed: 0,zip,Feb-16,Mar-16,Apr-16,May-16,Jun-16,Jul-16,Aug-16,Sep-16,Oct-16,...,Dec-19,Jan-20,Feb-20,Mar-20,Apr-20,May-20,Jun-20,Jul-20,Aug-20,Sep-20
0,501,,,,,,,,,,...,,,,,,,,,,
1,1005,15.40%,5.70%,-29.70%,-24.00%,33.70%,8.60%,5.70%,-9.10%,-4.10%,...,7.30%,-4.30%,6.30%,-7.90%,2.20%,-4.40%,12.40%,-1.30%,4.50%,1.30%
2,1010,,,,,,,,,,...,,,,,,,,,,
3,1031,705.00%,612.00%,-24.60%,-24.60%,-83.20%,-13.70%,-3.30%,-2.00%,27.60%,...,43.90%,-13.80%,70.90%,162.00%,126.90%,-1.00%,-37.80%,-48.30%,-51.10%,681.40%
4,1037,,,-23.90%,,102.70%,150.40%,150.40%,,,...,,,,,,,,,-24.40%,-14.50%


## Clustering Algorithm
### K Means
#### Run Model

In [21]:
km = KMeans(n_clusters=6, init='k-means++')
clstrs = km.fit(demo)
print (clstrs.cluster_centers_.shape)
print (clstrs.labels_)

(6, 2151)
[1 3 3 ... 1 1 1]


#### Add cluster labels and sales data

In [31]:
#add the column for clusters
demo['cluster'] = clstrs.labels_
print(demo.shape)


#join the sales data
data = demo.set_index('zip').join(sales.set_index('zip'))

print(data.shape)

(33120, 2152)
(162174, 2173)


#### Separate into groups

#### Run Linear Regressions for each Group