![for sale image, from https://time.com/5835778/selling-home-coronavirus/](https://api.time.com/wp-content/uploads/2020/05/selling-home-coronavirus.jpg?w=800&quality=85)

# Project Title

## Overview

A one-paragraph overview of the project, including the business problem, data, methods, results and recommendations.

## Business Problem

Summary of the business problem you are trying to solve, and the data questions that you plan to answer to solve them.

Questions to consider:

- Who are your stakeholders?
- What are your stakeholders' pain points related to this project?
- Why are your predictions important from a business perspective?

## Data Understanding

Describe the data being used for this project.

Questions to consider:

- Where did the data come from, and how do they relate to the data analysis questions?
- What do the data represent? Who is in the sample and what variables are included?
- What is the target variable?
- What are the properties of the variables you intend to use?

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
import sklearn.metrics as metrics
from random import gauss
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats as stats

%matplotlib inline

In [None]:
df = pd.read_csv('data/kc_house_data.csv')

In [None]:
df.columns

In [None]:
# Let's add a describe here
df.describe()

In [None]:
# check info

In [None]:
X = df.drop('price', axis=1)
y = df['price']

In [None]:
numeric_X = X.select_dtypes(exclude=['object'])

In [None]:
corr = df.corr()
price_corr = corr['price']
price_corr

In [None]:
baseline_X = df['sqft_living']

In [None]:
baseline_model = sm.OLS(y, sm.add_constant(baseline_X))
baseline_result = baseline_model.fit()
baseline_result.summary()

In [None]:
model_multi_numeric = sm.OLS(y, sm.add_constant(numeric_X))
result_multi_numeric = model_multi_numeric.fit()
result_multi_numeric.summary()

In [None]:
# turn categorical into numeric
#date - turn into year
# address - look into dropping values outside king county
#decide on these two! ^^
#waterfront
#greenbelt
#nuisance
#view
#condition
#grade
#heat_source
#sewer_system


In [None]:
df['waterfront'].unique()

In [None]:
df['waterfront'] = df['waterfront'].map({'YES': 1, 'NO': 0})
df['greenbelt'] = df['greenbelt'].map({'YES': 1, 'NO': 0})
df['nuisance'] = df['nuisance'].map({'YES': 1, 'NO': 0})
df['view'] = df['view'].map({'NONE': 0, 'FAIR': 1, 'AVERAGE' : 2, 'GOOD' : 3, 'EXCELLENT': 4})
df['condition'] =  df['condition'].map({'Poor': 0, 'Fair': 1, 'Average': 2, 'Good': 3, 'Very Good': 4})
df['grade'] = df['grade'].map({'1 Cabin': 0, '2 Substandard': 1, '3 Poor': 2, '4 Low': 3, '5 Fair': 4,'6 Low Average': 5,'7 Average': 6,'8 Good': 7,'9 Better': 8,'10 Very Good': 9,'11 Excellent': 10,'12 Luxury': 11,'13 Mansion': 12})
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
# df['year'] = df['date'].dt.year
df['zip'] = df['address'].str[-20:-15].astype(int)
df = df.drop(['date',], axis=1)

In [None]:
df.info()

In [None]:
df_sewer_system = pd.get_dummies(df['sewer_system'], prefix='sewer_system')
df_heat_source = pd.get_dummies(df['heat_source'], prefix='heat_source')
df = pd.concat([df, df_sewer_system, df_heat_source], axis=1)
df = df.drop(['sewer_system', 'heat_source'], axis=1)


In [None]:
# Load the zipcode CSV file into a separate dataframe
zipcodes = pd.read_csv('data/king-co-zip-table.csv')
zipcodes['ZIPCODE']
zipcodes.info()

In [None]:
mask = df['zip'].isin(zipcodes['ZIPCODE'])
df = df[mask]
df.info()

In [None]:
df_categorical = df.select_dtypes(exclude=['object'])
numeric_X_cat = df_categorical.drop('price', axis=1)
y = df_categorical['price']

model_multi_numeric_cat = sm.OLS(y, sm.add_constant(numeric_X_cat))
result_multi_numeric_cat = model_multi_numeric_cat.fit()
result_multi_numeric_cat.summary()

In [None]:
from math import sin, cos, sqrt, atan2
def haversine(lat1, lon1, lat2, lon2):
    """
    Calculates the Haversine distance between two points on the Earth's surface.
    """
    R = 6371  # radius of Earth in km
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    d = R * c
    return d
# Amazon headquarters GPS location (Seattle city center)
city_lat = 47.641944
city_long = -122.127222
uni_lat = 47.654167
uni_long = -122.308056
df_categorical["distance_to_amazon"] = haversine(city_lat, city_long, df_categorical["lat"], df_categorical["long"])
df_categorical["distance_to_uni"] = haversine(uni_lat, uni_long, df_categorical["lat"], df_categorical["long"])
df_categorical.info()

In [None]:
df_categorical = df_categorical.select_dtypes(exclude=['object'])
numeric_X_cat = df_categorical.drop('price', axis=1)
y = df_categorical['price']

model_multi_numeric_cat = sm.OLS(y, sm.add_constant(numeric_X_cat))
result_multi_numeric_cat = model_multi_numeric_cat.fit()
result_multi_numeric_cat.summary()

In [None]:
model = sm.OLS(endog=y, exog=numeric_X_cat).fit()

model_preds = model.predict(numeric_X_cat)
resids = y - model_preds

fig, ax = plt.subplots()

ax.scatter(model_preds, resids)
ax.set_xlabel('predicted housing prices')
ax.set_ylabel('residual')
plt.suptitle('Residuals Vs. Predictions');

In [None]:
resids.hist(bins=50);

In [None]:
sm.qqplot(resids, line='r');

In [None]:
y.hist(bins=40)

In [None]:
# check for outliers 

Q1 = y.quantile(0.25)
Q3 = y.quantile(0.75)
IQR = Q3 - Q1

# define the upper and lower thresholds
upper_thresh = Q3 + 1.5 * IQR
lower_thresh = Q1 - 1.5 * IQR

# remove rows where 'col_name' is an outlier
df = df[(df['price'] > lower_thresh) & (df['price'] < upper_thresh)]

In [None]:
numeric_X_cat.info()

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(8, 6))

ax1.scatter(numeric_X_cat['grade'], y)

In [None]:
vif_model = sm.OLS(endog=y, exog=sm.add_constant(numeric_X_cat)).fit()
vif_table = 1 / (1-vif_model.rsquared)
vif_table

## Data Preparation

Describe and justify the process for preparing the data for analysis.

Questions to consider:

- Were there variables you dropped or created?
- How did you address missing values or outliers?
- Why are these choices appropriate given the data and the business problem?

What I want to do to the data?

Drop floors - worst P value and high SE

Drop lat long or use to see distance from key landmarks (need to get new data for this?)

Categorical data --- need to clean (what to prioritize for getdummies)


From readme: If you are feeling overwhelmed or behind**, we recommend you **ignore** some or all of the following features:

* `date`
* `view`
* `sqft_above`
* `sqft_basement`
* `yr_renovated`
* `address`
* `lat`
* `long`

Need to train test split

Handling Missing Values

Handling Non-Numeric Data

(Handling any other weird data that needs cleaning)

School district, walking score 

Parks/green space per 

Remove outliers -- based on address/lat long/distance

## Modeling

Describe and justify the process for analyzing or modeling the data.

Questions to consider:

- How did you analyze the data to arrive at an initial approach?
- How did you iterate on your initial approach to make it better?
- Why are these choices appropriate given the data and the business problem?

## Evaluation

The evaluation of each model should accompany the creation of each model, and you should be sure to evaluate your models consistently.

Evaluate how well your work solves the stated business problem. 

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model? Is it over or under fit?
- How well does your model/data fit any modeling assumptions?

For the final model, you might also consider:

- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?

### Baseline Understanding

- What does a baseline, model-less prediction look like?

In [None]:
# code here to arrive at a baseline prediction

### First  Model

Before going too far down the data preparation rabbit hole, be sure to check your work against a first 'substandard' model! What is the easiest way for you to find out how hard your problem is?

In [None]:
# code here for your first 'substandard' model

In [None]:
# code here to evaluate your first 'substandard' model

### Modeling Iterations

Now you can start to use the results of your first model to iterate - there are many options!

In [None]:
# code here to iteratively improve your models

In [None]:
# code here to evaluate your iterations

### 'Final' Model

In the end, you'll arrive at a 'final' model - aka the one you'll use to make your recommendations/conclusions. This likely blends any group work. It might not be the one with the highest scores, but instead might be considered 'final' or 'best' for other reasons.

In [None]:
# code here to show your final model

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project (future work)?


In [None]:
# code here to evaluate your final model

In [None]:
.astype(int)# code here to evaluate your final model