In [1]:
'''
# standard python modules
%matplotlib inline
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pydataset
import seaborn as sns
from math import sqrt
from sklearn.feature_selection import RFE, SelectKBest, chi2, f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
import statsmodels.api as sm
from scipy.stats import spearmanr

# my modules
import src.wrangle as wr
import src.evaluate as ev 
import src.put_it_together as pit

# setting random seed to 7
np.random.seed(7)

# turning off red warnings
import warnings
warnings.filterwarnings("ignore")

# module for fixing imported modules
from importlib import reload
'''



In [4]:
import src.put_it_together as pit
import src.wrangle as wr

In [5]:
simple_df = wr.zillow_2017()
complex_df = wr.zillow_2017(simple=False)

## Data Dictionary
**For the Simple Model**
| Column | Description |
| --- | ---|
| baths | Number of Bathrooms |
| bedrooms | Number of Bedrooms |
| sq_feet | Total finished square feet of home |
| fips | Broken down below. Fips represents the code for the county where a house resides |
| fips: 6037 | Los Angeles County |
| fips: 6059 | Orange County |
| fips: 6111 | Ventura County |
| tax_value | Proxy for the home's value to potential buyers |


**For the Complex Model**
| Column | Description |
| --- | ---|
| bath_adv | Number of Bathrooms, including partial bathrooms |
| bedrooms | Number of Bedrooms |
| lot size | The size of the lot on which the house sits |
| squared_sq_feet| The square feet of a home squared again to give the variable polynomial features |
| fips | Broken down below. Fips represents the code for the county where a house resides |
| fips: 6037 | Los Angeles County |
| fips: 6059 | Orange County |
| fips: 6111 | Ventura County |
| tax_value | Proxy for the home's value to potential buyers |

## First thing, first. Cleaning our data. 
- We removed outliers that ended up contributing to about 11% of the sample. Usually, we don't think of 11% of a population as being outliers. In this case however, we want to predict house prices that actual customers for houses will be looking to purchase.
- As such, we eliminated houses with certain features including:
1. More than 5 bedrooms or 1 bedroom
2. 5 or more bathrooms
3. more than 6,000 square feet
4. lot sizes which are smaller than 750 square feet
5. houses which are older than a century 

- While a few of our important variables do not look "pretty", I think they are much more workable than what we began with.

## Furthermore, from the beginning I wanted to carry a comparison between two different sets of features for our modeling. 
- I wanted to see just how profoundly some of these features play off each other and influence our ability to predict home values.
- Those feature sets will be identified as the complex model and the simple model, the difference being that the complex model has more variables.

1. The simple model contains four features: bedrooms, baths, sqaure feet of the house, and fips (county information)

2. The complex model contains the following features: bedrooms, bathrooms - but including half bathrooms, lot size, square feet of the house, size of the lot, and year the home was built. 


### After all our cleaning was completed, we were left with 48,223 properties

### Taking a look at the data after we alter it due to our outlier detection methods showing us glaring problems.

In [None]:
pit.explore_simple(simple_df)
pit.explore_complex(complex_df)

## Exploring our features' correlation with the target variable - "tax_value"

In [None]:
pit.correlate_viz(simple_df, complex_df, 'tax_value')

## Another look at correlation between our model features and our target variable, 'tax_value':

In [None]:
for columns in simple_df:
    pit.spearman_test(simple_df, 'tax_value', columns)

In [None]:
for columns in complex_df:
    pit.spearman_test(complex_df, 'tax_value', columns)

### I had a suspicion that we could not effectively handle the data sets as 1 individual model because the samples would diverge around location. So, I investigated that suspicion and found that each county had unique features and distributions of property values. 

In [None]:
pit.fips_viz(simple_df)

In [None]:
compare_df = pit.simple_regression_workhorse(simple_df)

In [None]:
final_comparison_df = pit.complex_regression_workhorse(complex_df, compare_df)

## 

In [None]:
final_comparison_df

## How does the simplest model perform? 

In [None]:
pit.pie_chart1(compare_df)

### It is pretty abysmal. Only 29% of the variation in our proxy variable for house selling prices, tax_value, can be explained by our model's features.

## Testing our best model(s) on out of sample data

In [None]:
pit.overload_pies(final_comparison_df)

In [None]:
final = pit.test_model(complex_df, simple_df)

In [None]:
final

In [None]:
pit.final_pies(final)

# Final Takeaway

- Our models are not good - as expected. 
- I'm not sure that it is possible to effectively predict tax assessed value beyond using the previous year's value
- Tax Assessed Value is probably not even a good proxy for sales prices for homes.
- Too many subjective measures are included, such as tax assessor's biases, the buyer's biases, whatever exemptions were filed and when they were filed for and other convoluting factors
- 