# Zillow 2017 Predictions
<br>

### Project Goals  
- Discover drivers of Single Family home value for Zillow in 2017.
- Use drivers to develop a machine learning model that accurately predicts home value
- This information could be used for furture years in helping Zillow achieve max profit

In [1]:
# Data science imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Statistical/mathmatical imports
from scipy import stats
from scipy.stats import pearsonr, spearmanr
from sklearn.model_selection import train_test_split
from math import sqrt
# ML imports
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression, RFE
# Premade functions
import acquire as a
import wrangle as w
#Removes big scary warnings
import warnings
warnings.filterwarnings('ignore')

# Acquire <br>
- Data aquired from SQL zillow database  
- It contained 52441 rows and 6 columns before cleaning  
- Each row represents a house in 2017 
- Each column represents a feature associated with the house  

In [3]:
# Use acquire function to import messy data
df = a.get_zillow_data()
# Lets take a peak
df.head(1)

Unnamed: 0,bedroomcnt,bathroomcnt,calculatedfinishedsquarefeet,taxvaluedollarcnt,lotsizesquarefeet,yearbuilt
0,5.0,4.0,2148.0,165392.0,10408.0,1976.0


# Prepare <br>
- Renamed colums to read easier on the eyes  
- Checked for nulls in the data and dropped said nulls  
- Checked that column data types were appropriate and had to change as necessary  
- Got rid of major outliers that skewed the data 
- Split data into train, validate and test, stratifying on 'tax_value'  

In [4]:
# Use prepare function to clean data 
df = w.prep_zillow(df)
# Quick peak into the cleaned data
df.head(1)

Unnamed: 0,bedrooms,bathrooms,house_sqft,tax_value,lot_size_sqft,year_built
0,5,4.0,2148,165392,10408,1976


# Data Dictionary
<br>

- This will help with any questions or information on this dataset

| Feature | Definition |
|:--------|:-----------|
|bedroomcnt|Specifies the number of bedrooms in the home|
|bathroomcnt|Specifies the number of bathrooms in the home|
|calculatedfinishedsquarefeet|Specifies the total finished square footage of the home|
|taxvaluedollarcnt|The total tax assessed value of the parcel|
|lotsizesquarefeet|Specifies total square footage of lot the home sits on|
|yearbuilt|Year that the home was built|


# Split data into train/validate/test sample dataframes
<br>

- 20% test, 80% train_validate  
- Then of the 80% train_validate: 30% validate, 70% train    

In [6]:
# splitting data into train, validate, and test
train, validate, test = w.split_zillow(df)
# lets show some train data
train.head(3)

Unnamed: 0,bedrooms,bathrooms,house_sqft,tax_value,lot_size_sqft,year_built
31949,3,2.0,1404,176411,7704,1957
51617,3,2.0,1456,72444,5369,1954
42382,3,1.0,1055,64630,6550,1953


# Exploration
<br>

- 
- 
-  

### question 1

In [None]:
# function for visual 1
# make sure to print out reject or fail to reject print statement

#### My takeaway from this is that   
<br>

For this question, there are two continuous variables at hand so a a pearsonr test seems appropiate to determine if there is a relationship.  
<br>

$H_o$ (Null Hypothesis): is that there is no linear correlation

$H_a$ (Alternative Hypothesis): There is a linear correlation between   

### question 2

In [None]:
# function for visual 2
# make sure to print out reject or fail to reject print statement

#### My takeaway from this is that   
<br>

For this question, there are two continuous variables at hand so a a pearsonr test seems appropiate to determine if there is a relationship.  
<br>

$H_o$ (Null Hypothesis): is that there is no linear correlation

$H_a$ (Alternative Hypothesis): There is a linear correlation between  

### question 3

In [None]:
# function for visual 3
# make sure to print out reject or fail to reject print statement

#### My takeaway from this is that   
<br>

For this question, there are two continuous variables at hand so a a pearsonr test seems appropiate to determine if there is a relationship.  
<br>

$H_o$ (Null Hypothesis): is that there is no linear correlation

$H_a$ (Alternative Hypothesis): There is a linear correlation between  

# Exploration Summary
<br>

- 
- 
- 


# Modeling
<br>

- 
-
- 

### Test Model  
<br>

- I am choosing the   
- I will now run my model on the test data to gauge how it will perform on unseen data.

### Modeling Wrap
<br>

- Test score outperforms the baseline and I would reccomend this model for production, as it beat the baseline by %


# Conclusion
<br>

### Summary
<br>

- 
- 
- 
- 

### Recommendations
<br>

-  
-   
-   

### Next Steps
<br>

- If provided more time to work on this project I would 