# Zillow Clustering Project
Sophia Stewart<br>
Stephanie Jones<br>
Codeup | Data Science, Hopper Cohort<br>
Monday, January 10, 2020

# About the Project
#### Goal
Identify drivers of error in predicting home value for single family properties<br>
#### Why?
We want to improve our zestimate home value predictions so that we can better serve those who purchase and sell homes<br>

# Data Dictionary
### DataFrames
| DFs | Meaning |
| :-------- | -------: |
| zillow   | Full, original dataframe retrieved from the zillow mySQL database |
| train    | Sample (56%) of zillow used for exploring data and fitting/training models|
| validate | Sample (24%) of zillow used to evaluate multiple models |
| test     | Sample (20%) of zillow used to evaluate the best model |

### Variables
| Target | Meaning |
| :-------- | -------: |
| logerror | Our target variable; the Zestimate error which we want to minimize |


| Variables | Meaning |
| :-------- | -------: |
| tax_value | The property's tax assessed value |
| beds     | Number of bedrooms |
| baths    | Number of bathrooms, including fractional bathrooms |
| fullbaths | Number of full bathrooms |
| latitude | The property's latitude |
| longitude | The property's longitude |
| sq_ft    | Calculated total finished living area |
| yearbuilt | The year the property was built |
| age      | The age of the property |
| transactiondate | The date the property was sold |

<!-- | Clustering | Meaning |
| :-------- | -------: |
| beds_scaled | Standard-scaled `beds` |
| baths_scaled | Standard-scaled `baths` |
| sq_ft_scaled | Standard-scaled `sq_ft` |


| Modeling | Meaning |
| :-------- | -------: |
| x_train  | `train`, with scaled `tax_value`, `age`, `sqft` columns |
| y_train  | `train`, but only the target |
| x_validate | `validate`, with scaled `tax_value`, `age`, `sqft` columns |
| y_validate | `validate`, but only the target |
| x_test   | `test`, with scaled `tax_value`, `age`, `sqft` columns |
| y_test   | `test`, but only the target | -->


# Executive Summary
High `logerror` threatens our credibility as a primary source of home valuation predictions in the real estate market. We want to find the drivers of Zestimate prediction `logerror` to provide more accurate home valuations. 
<br><br>
To do this, we will use clustering to identify patterns in our 2017 single-unit property data and use those clusters to build a model which will be used for predicting logerror. If we can predict `logerror`, we can use those predictions to make more accurate predictions of home values.

# Step 1 | Acquire and Wrangle
In our `wrangle.py` module you will find the following functions:
- `acquire_zillow()` acquires zillow data from a csv file or from a sequel query (see query inside of function within module)
- `clean_zillow()` cleans the acquired data
    - filter out non-single unit properties using `propertylandusetypeid`, `beds`, `baths`, and `sqft`
    - drop null rows and columns with > 50% missing values
    - create `age` column from `yearbuilt`
    - drop any remaining null values
    - correct dtypes for int values
    - drop `propertylandusetypeid`, `transactiondate`, `yearbuilt`, `unitcnt`
    - remove outliers
- `split_zillow(df)` split data into train, validate, and test dfs


In [1]:
import wrangle as w

train, validate, test = w.split_zillow(w.clean_zillow(w.acquire_zillow()))

print(f'Train Shape: {train.shape}\nValidate Shape: {validate.shape}\nTest Shape: {test.shape}')

NumExpr defaulting to 8 threads.


Train Shape: (20184, 9)
Validate Shape: (6729, 9)
Test Shape: (6729, 9)


# Conclusion
>Your conclusion summary should addresses the questions you raised in the opening of the project, which we would want to see at the end of every final notebook. Ideally, when the deliverable is a report, the summary should tie together your analysis, the drivers of the outcome, and how you would expect your ML model to perform in the future on unseen data, in layman's terms.

## Recommendations
>Your notebook should end with actionable recommendations based on your insights and analysis on a way to make a better model, such as a new feature or an algorithm or something you found that doesn't work.

## Next Steps
>Your conclusion should include next steps from a data science perspective that will assist in improving your research. Ideally, if you talk about trying more algorithms to improve performance, think about why you need to improve performance. And if the business calls for it, remember the best way to improve performance is to have better predictors/features. If you talk about gathering more data, being specific about what data you think will help you understand the problem better and why is the way to go!