# Zillow Project
## Clustering Module
By Michael P. Moran

## Table of contents
1. [Project Planning](#project-planning)
1. [Prepare Environment](#prepare-environment)
    1. [Sub paragraph](#subparagraph1)
1. [Acquisition](#acquisition)
1. [Preparation](#preparation)
1. [Exploration](#exploration)
1. [Modeling](#modeling)

## TODO
- SQL query
    - [X] only include properties with a transaction in 2016 &/or 2017 (along with zestimate error and date of transaction).
- Removing lots
    - [ ] find a ratio between taxlandvalue and lotsizesquarefeet to exclude lots (there are some with 70k taxappraisalvalue and a lotsize of 30-40k square feet
    - [X] fill taxdelinquencyflag with N for the NaNs.
    - [ ] what to do with taxdelinquencyyear? maybe combine the flag with the year to create a variable that reflects how long it has been delinquent and put a 0 for those that are not delinquent
    - [ ] combine calculatedsqft and lotsizesqft???
    - [ ] combine bedroomcnt and bathroomcnt?
    - [ ] encode delinquency column
    
### Presentation
- [ ] Topic will be 3 key takeaways from project
    - e.g., I created this really cool function
    - e.g., or garagecnt is determining the poolcnt

## Project Planning <a name="project-planning"></a>

### Goals

### Deliverables

### Data Dictionary & Domain Knowledge

- regionidcounty
    - 3101 --- 6037 (Los Angeles)
    - 1286 --- 6059 (Orange County)
    - 2061 --- 6111 (Ventura County) 

In [None]:
LA = 3101
ORANGE = 1286
VENTURA = 2061

### Hypotheses

1. Low calculatedsqft is correlated with a higher logerror
1. Low lotsizesqft is correlated with a higher logerror
1. Low taxvaluedollarcnt is correlated with a higher logerror
1. Low bedroomcnt is correlated with a higher logerror
    - 2 to 4 bedrooms have higher logerror
1. Low bathroomcnt is correlated with a higher logerror
    - 1 to 3 bathrooms have higher logerror
1. bedroomcnt and bathroomcnt are positively correlated
1. calculatedsqft and taxvaluedollarcnt are positively correlated
    - Yes. There is a 0.6 correlation coefficient
1. lotsizesqft and taxvaluedollarcnt are positively correlated
    - No. Tax value actually goes down with bigger lots (are these lots without houses??)
1. lotsizesqft is not driving logerror because it is really high for condos (I believe it includes the whole development the condo is on) but the logerror is low for them


### Thoughts & Questions

- remove the condominums and planned unit developments?

## Prepare Environment <a name="prepare-environment"></a>

In [None]:
import acquire_zillow
import prepare_zillow
import explore_zillow
from importlib import reload

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
# pd.set_option('display.width', 1000)

**Reload modules to capture changes**

In [None]:
acquire_zillow = reload(acquire_zillow)
prepare_zillow = reload(prepare_zillow)
explore_zillow = reload(explore_zillow)

## Acquisition <a name="acquisition"></a>

In [None]:
df = acquire_zillow.get_zillow_from_csv("zillow_data.csv")

## Preparation <a name="preparation"></a>

### Summarize Data

In [None]:
prepare_zillow.summarize(df)

### Handle Missing Values

Run the first function that returns missing value totals by column: Does the attribute have enough information (i.e. enough non-null values) to be useful? Choose your cutoff and remove columns where there is not enough information available. Document your cutoff and your reasoning.

**Drop certain columns**

- Same information
    - calculatedbathnbr
        - because it has 99% of the same values as bathroomcnt
    - finishedsquarefeet12
        - because it has the same information as calculatedfinishedsquarefeet except for 9 rows.
    - structuretaxvaluedollarcnt
        - because it has the same info as taxvaluedollarcnt
        
- 100% or near 100% missing values
    - architecturalstyledesc
    - basementsqft
    - buildingclassdesc
    - decktypeid
    - finishedfloor1squarefeet
    - finishedsquarefeet13
	- finishedsquarefeet15
    - finishedsquarefeet50
    - finishedsquarefeet6
	- fireplacecnt
    - fireplaceflag
    - garagecarcnt
    - garagetotalsqft
    - hashottuborspa
    - numberofstories
	- poolsizesum
    - pooltypeid10
    - pooltypeid2
	- storydesc
    - taxdelinquencyyear (not sure how to impute this one)
    - threequarterbathnbr
    - typeconstructiondesc
	- yardbuildingsqft17
    - yardbuildingsqft26
    
- too difficult to impute
    - regionidneighborhood (almost 50% missing; not sure how to impute this)
    
- inferior information
    - fullbathcnt
        - because bathroomcnt has more fine-grained information; it includes half bathrooms, etc.

- unsure what to do with
    - airconditioningdesc
    - heatingorsystemdesc


**Impute 0 for certain columns**

- hashottuborspa
- poolcnt
- pooltypeid7

**Impute values for certain columns**
- most frequent value
    - buildingqualitytypeid (7)
    - propertyzoningdes (LAR1)
    - regionidcity (12447)
    - regionidzip (?)
    - yearbuilt (1950)
- linear regression
    - lotsizesquarefeet
- constant
    - taxdelinquencyflag ("N")

**Drop rows**
- Those with NaN in columns with only few NaNs (not worth the time to impute)
    - taxvaluedollarcnt
    - landtaxvaluedollarcnt
    - taxamount
    - censustractandblock
    
those with nan in landtaxvaluedollarcnt

In [None]:
prepare_zillow.df_missing_vals_by_col(df)

### Handle Duplicates

### Fix Data Types

### Handle Outliers

### Run prepare function to do everything

In [None]:
df = prepare_zillow.prepare_zillow(df)

### Add columns

**add column with abs of logerror**

In [None]:
df["logerror_abs"] = df.logerror.abs()

## Exploration  <a name="exploration"></a>

**Bin logerror**

**by explicit bins**

In [None]:
bins = pd.IntervalIndex.from_tuples([(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6)], closed="left")
# # df["logerror_bin"] = pd.cut(df.logerror_abs, bins=6, labels=[0, 1, 2, 3, 4, 5])
# logerror_bin = pd.cut(df.logerror_abs, bins=bins)
# intervals_to_labels = {str(index): i for i, index in enumerate(bins)}
# # df_tmp.drop(columns="logerror_bin")
# logerror_bin_label = logerror_bin.apply(lambda x: intervals_to_labels[str(x)])
# # for b, l in bins_to_labels:
# #     df[df.logerror_bin == b]["logerror_bin"] = i
# # df.logerror_bin.value_counts(dropna=False)
# # # df = df.astype({"logerror_bin": int})
# df["logerror_bin"] = logerror_bin_label
df["logerror_bin"] = explore_zillow.series_bin_with_labels(df.logerror_abs, bins, (0, 1, 2, 3, 4, 5))

**by quartile**

In [None]:
df["logerror_bin_quart"] = pd.qcut(df.logerror_abs, q=4)

**Create sample of df for exploration**

In [None]:
df_sample = df.sample(n=25_000, random_state=123)

**Create lists holding column names for continuous and categorical variables**

In [None]:
continuous_cols = ["calculatedfinishedsquarefeet",
                "latitude", "longitude", "lotsizesquarefeet",
                "yearbuilt", "taxvaluedollarcnt", "landtaxvaluedollarcnt", "logerror"]

contin_and_cat_cols = ["bathroomcnt", "bedroomcnt", "poolcnt"]

In [None]:
explore_zillow.df_plot_numeric(df_sample, continuous_cols, "logerror_bin")

In [None]:
explore_zillow.df_jitter_plot(df_sample, contin_and_cat_cols, continuous_cols, "logerror_bin")

In [None]:
# def relplot_num_and_cat(df: pd.DataFrame, x: str, y: str, hue: str) -> pd.DataFrame:
#     """
#     Write a function that will use seaborn's relplot to plot 2 numeric (ordered) variables
#     and 1 categorical variable. It will take, as input, a dataframe, column name indicated
#     for each of the following: x, y, & hue.
#     """
#     sns.relplot(x=x, y=y, hue=hue, data=df, alpha=0.8)
#     plt.show

# relplot_num_and_cat(df_sample[df_sample.logerror_bin != bins[0]] , "longitude", "latitude", "logerror_bin")

In [None]:
# def swarmplot_num_and_cat(df: pd.DataFrame, X: str, Y: list, hue: str=None) -> None:
#     """
#     Write a function that will take, as input, a dataframe, a categorical column name,
#     and a list of numeric column names. It will return a series of subplots: a swarmplot
#     for each numeric column. X will be the categorical variable.
#     """
#     cols = 3
#     rows = round(len(Y) / cols) if len(Y) // cols > 0 else 1
    
#     plt.figure(figsize=(15, 15))
#     for i, y in enumerate(Y):
#         plt.subplot(rows, cols, i + 1)
#         sns.swarmplot(x=X, y=y, data=df, hue=hue, palette="Set2")
#     plt.plot

# swarmplot_num_and_cat(df_sample.sample(n=10_000), "logerror_bin",
#                       ["calculatedfinishedsquarefeet", "lotsizesquarefeet"])

In [None]:
def crosstab_cat(df: pd.DataFrame, cols: list) -> None:
    """
    Write a function that will take a dataframe and a list of categorical columns to plot
    each combination of variables in the chart type of your choice.
    """
    for outer in cols:
        for inner in cols:
            if outer == inner:
                continue
            plt.figure(figsize=(10, 8))
            ct = pd.crosstab(df[outer], df[inner], margins=True)#.apply(lambda r: r/r.sum(), axis=1)
            sns.heatmap(ct, cmap="YlGnBu", annot=True, cbar=False, fmt=".2f")
            #print(pd.crosstab(df[outer], df[inner], margins=True).apply(lambda r: r/r.sum(), axis=1))
            plt.show()
    
crosstab_cat(df_sample, contin_and_cat_cols + ["logerror_bin"])
#pd.crosstab(df.logerror_bin, df.bathroomcnt, margins=True)

### Train-Test Split

### JointPlot

### PairGrid

### Heatmap

### T-Tests

#### Logerror among conties
**Is logerror significantly different for properties in Los Angeles County vs
Orange County (or Ventura County)?**

**LA v. Orange**
- H0: logerror is not different for properties in LA County v. Orange County
    - Reject the null hypothesis. There is a significant difference.

In [None]:
df.regionidcounty.value_counts()
logerror_la = df[df.regionidcounty == LA].logerror
logerror_orange = df[df.regionidcounty == ORANGE].logerror
logerror_ventura = df[df.regionidcounty == VENTURA].logerror

In [None]:
explore_zillow.series_ttest(logerror_la, logerror_orange)

**LA v. Ventura**
- H0: logerror is not different for properties in LA County v. Ventura County
    - Reject the null hypothesis. There is a significant difference.

In [None]:
explore_zillow.series_ttest(logerror_la, logerror_ventura)

**Orange v. Ventura**
- H0: logerror is not different for properties in Orange County v. Ventura County
    - Fail to reject the null hypothesis.

In [None]:
explore_zillow.series_ttest(logerror_orange, logerror_ventura)

**Conclusions**

There are significant differences in logerror when comparing LA County to the other two. However, there is no difference in logerror between Orange and Ventura counties. So, I should include this variable as a feature but bin them based on whether they are in LA County or not.

#### Logerror based on tax delinquency status

**Is logerror significantly different for properties that are delinquent on their taxes vs those that are not?**
- H0: There is no differnece in logerror for properties that are delinquent v. those that are not
    - Reject the null hypothesis. There is a significant difference in logerror between houses that are delinquent and those that are not.

In [None]:
logerror_delinq = df[df.taxdelinquencyflag == "Y"].logerror
logerror_not_delinq = df[df.taxdelinquencyflag == "N"].logerror

In [None]:
explore_zillow.series_ttest(logerror_delinq, logerror_not_delinq)

**Conclusions**

I will include taxdelinquencyflag as a feature because there are significant differences in logerror for properties that are delinquent v. those that are not. A possible theory is that a tax delinquent status indicates a possible flipped house, which could cause a significant change in the value.

#### Logerror based on yearbuilt

**Is logerror significantly different for properties built prior to 1960 than those built later than 2000?**
- H0: There is no difference in logerror between properties built before 1960 and those built later than 2000
     - Reject the null hypothesis. There is a significant difference.

In [None]:
logerror_pre1960 = df[df.yearbuilt < 1960].logerror
logerror_post2000 = df[df.yearbuilt > 2000].logerror

In [None]:
explore_zillow.series_ttest(logerror_pre1960, logerror_post2000)

**Conclusions**

I may want to include yearbuilt as a feature and bin it. Or I may want to create separate models based on when the house was built

### Chi2 Tests

Because there are many discrete variables, you can the chi-squared test to test proportions. If you split logerror into quartiles, you can expect the overall probability of falling into a single quartile to be 25%. Now, add another variable, like bedrooms (and you can bin these if you want fewer distinct values) and compare the probabilities of bedrooms with logerror quartiles. See the example in the Classification_Project notebook we reviewed on how to implement chi-squared.

#### bedroomcnt v. logerror

H0: The bins for bedroom count and absolute value of logerror are independent
    - Reject the null hypothesis. There are not independent

**Bin bedroomcnt**

In [None]:
bedroomcnt_bins = pd.IntervalIndex.from_tuples([(0, 2), (2, 4), (4, 6), (6, 100)], closed="right")
# df["bedroomcnt_bin"] = explore_zillow.series_bin_with_labels(df.bedroomcnt, bedroomcnt_bins, )
df["bedroomcnt_bin"] = pd.cut(df.bedroomcnt, bins=bedroomcnt_bins)

In [None]:
explore_zillow.series_chi2_test(df.bedroomcnt_bin, df.logerror_bin_quart)

**Conclusions**

Bedroom count and logerror are related and have some dependency on each other.

#### bathroomcnt v. logerror

H0: The bins for bathroom count and absolute value of logerror are independent
    - Reject the null hypothesis. There are not independent

In [None]:
df.bathroomcnt.value_counts()

**Bin bathroomcnt**

In [None]:
bathroomcnt_bins = pd.IntervalIndex.from_tuples([(0, 2), (2, 4), (4, 6), (6, 8), (8, 100)], closed="right")
# df["bedroomcnt_bin"] = explore_zillow.series_bin_with_labels(df.bedroomcnt, bedroomcnt_bins, )
df["bathroomcnt_bin"] = pd.cut(df.bathroomcnt, bins=bathroomcnt_bins)

In [None]:
explore_zillow.series_chi2_test(df.bathroomcnt_bin, df.logerror_bin_quart)

**Conclusions**

Bathroom count and logerror are related and have some dependency on each other, and it's higher than for bedroom count

### Clustering

In [None]:
from sklearn.cluster import KMeans

#### logerror_abs alone

**Elbow Method**

In [None]:
def kmeans_elbow(X: pd.DataFrame, nclusters_width, **kwargs):
    intertias = []
    nclusters_range = range(1, nclusters_width + 1)
    for n in nclusters_range:
        kmeans = KMeans(n_clusters=n, **kwargs)
        kmeans.fit(X)
        intertias.append(kmeans.inertia_)
    
    kmeans_perf = pd.DataFrame(list(zip(nclusters_range, intertias)), columns=['n_clusters', 'ssd'])

    plt.scatter(kmeans_perf.n_clusters, kmeans_perf.ssd)
    plt.plot(kmeans_perf.n_clusters, kmeans_perf.ssd)

    plt.xticks(nclusters_range)
    plt.xlabel('Number of Clusters')
    plt.ylabel('Sum of Squared Distances')
    plt.title('The elbow method')
    plt.show()

In [None]:
# X = df[["logerror_abs"]]

# intertias = []
# for n in range(1, 11):
#     kmeans = KMeans(n_clusters=n)
#     kmeans.fit(X)
#     intertias.append(kmeans.inertia_)
    
# kmeans_perf = pd.DataFrame(list(zip(range(1, 11), intertias)), columns=['n_clusters', 'ssd'])

In [None]:
# plt.scatter(kmeans_perf.n_clusters, kmeans_perf.ssd)
# plt.plot(kmeans_perf.n_clusters, kmeans_perf.ssd)

# plt.xticks(range(1, 11))
# plt.xlabel('Number of Clusters')
# plt.ylabel('Sum of Squared Distances')
# plt.title('The elbow method')
# plt.show()

kmeans_elbow(df[["logerror_abs"]], 8, random_state=123)

In [None]:
def kmeans_fit_and_predict(X: pd.DataFrame, **kwargs) -> np.ndarray:
    kmeans = KMeans(**kwargs)
    kmeans.fit(X)
    return kmeans.predict(X), kmeans.labels_, kmeans.inertia_

df['cluster_target'], labels, interia = kmeans_fit_and_predict(df[["logerror_abs"]], n_clusters=4, random_state=123)
#     X = df[["logerror_abs"]]

#     kmeans = KMeans(n_clusters=4)
#     kmeans.fit(X)
#     df['logerror_abs_cluster'] = kmeans.predict(X)

In [None]:
sns.relplot(data=X, x='logerror_abs', y=0, hue='logerror_abs_cluster', legend="full")
plt.show()

In [None]:
kmeans_elbow(df[["latitude", "longitude"]], 8, random_state=123)

In [None]:
df['latlong_cluster'], labels, interia = kmeans_fit_and_predict(df[["latitude", "longitude"]], n_clusters=4, random_state=123)
#     X = df[["logerror_abs"]]

#     kmeans = KMeans(n_clusters=4)
#     kmeans.fit(X)
#     df['logerror_abs_cluster'] = kmeans.predict(X)

In [None]:
plt.figure(figsize=(15, 15))
sns.relplot(data=df, x='longitude', y="latitude", legend="full", hue='latlong_cluster')
plt.show()

In [None]:
df.latlong_cluster.value_counts()

In [None]:
pd.get_dummies(df.latlong_cluster, prefix="latlong_cluster")

### Summarize Conclusions

- From PairPlot
    - logerror v. others
        - logerror is worse for homes with a relatively small calculatedsqft and lotsizesqft
        - logerror is worse in the middle of the latitude and longitude values
        - logerror is worse for homes with a relatively low taxvaluedollarcnt, landtaxvaluedollarcnt, and structuretaxvaluedollarcnt
        - worse for homes with fewer bedrooms and bathrooms
    - taxvaluedollarcnt
        - all the variables like this one (landtaxvaluedollarcnt, structuretaxvaluedollarcnt) have similar scatterplots when compared to other variables. They appear to be giving the same information.
    - calculatedfinishedsqft
        - slight correlation with yearbuilt
        - slight correlation with taxvaluedollarcnt (0.6 Pearson R - see HeatMap)
    - bathroomcnt
        - positive correlation between calculatedfinishedsquarefeet
    - bedroomcnt
        - positive correlation between calculatedfinishedsquarefeet
    - latitude
        - houses near the middle of the latitude range have a higher square footage and taxvalue
    - lotsizesquareft
        - as square footage increases, taxvalue appears to decrease (a litle odd?)
        - bedroomcnt and bathroomcnt appear to decrease as well
- From RelPlot
    - Houses with a logerror > 1 appear to form clusters based on longitude and latitude

## Modeling <a name="modeling"></a>

### Feature Engineering & Selection

**FEATURES**
- [ ] **standardize all dependent variables, including binned ones (have bins be 0.1, 0.2, 0.3, etc)
- regionidcounty
    - [ ] Create dummy variable of la_county where 1 is yes and 0 is no.
- taxdelinquencyflag
- yearbuilt
    - [ ] Create a new variable that reflects the age of the house from 2017. Maybe bin the houses by 20 year intervals
- bedroomcnt + bathroomcnt
    - [ ] combine bedroomcnt and bathroomcnt and bin them

### Train & Test Model

### Summarize Conclusions