# Introduction

|Given|Explanation|
|---|---|
|Main Objective| Use the provided King County Housing data to address a clients need.|
|Client| Erin Robinson
|Client assignment| Find properties in poor neighborhoods that I can invest in (buy, sell) in a socially responsible way. I want my costs back plus a little profit.

### Data set and thought process
King County Housing data set with $N = 21597$ properties where $n = 21420$ are individual properties ($n = 176$ properties were sold more than once).
The data set has already been cleaned for the specific chosen clients assignment at hand.

Since there was no further definition given for *poor neighborhoods*, I defined them as follows:
*poor neighborhoods* = Zipcode areas where at least 80% of the properties' prices fall into the second quartile.
I.e. the maximum buying price would be the median price over all properties. For all analysis and graphic depictions I used this sub dataset of poor neighborhoods according to this definition. Data cleaning process as well as some premature data exploration can be found in the [cleaning_data.ipynb](cleaning_data.ipynb).

### Hypothesis
The following hypotheses were chosen in order to decide on what properties to focus on within the poor neighborhood areas once those were found.
1. The lower the current condition, the lower the price
2. The farther away from the city center, the lower the price
3. The longer ago the last renovation, the lower the price

# Setting working environment

In [None]:
# IMPORT LIBRARIES
import pandas as pd         # data handling
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt # plotting

import plotly.express as px # plotting
from plotly.subplots import make_subplots
import plotly.graph_objects as go

import folium               # for heatmap on streetmap
from folium import plugins

In [None]:
# IMPORT DATA AS DATA FRAME
df = pd.read_csv('data/kch_poor_neighborhood_clean_data.csv')

# drop column with old indices
df.drop(labels=df.iloc[:,[0]], axis=1,inplace=True)
df.head(2)

# EDA

First lets look where the poor neighborhoods lie. Visually inspect whether they are near to each other or if there are distinct areas. How many zipcode areas are still left after data cleaning process?

In [None]:
# Zipcode areas that are left; i.e. are "poor neighborhoods"
df.zipcode.nunique()

Make a $Price/ft^2\$$ heatmap to get an overview on spread of properties.
Tools and basic code for heatmap borrowed from [Qingkai](https://qingkaikong.blogspot.com/2016/06/using-folium-3-heatmap.html?m=1). Made adjustments according to my data.

In [None]:
url_base = 'http://server.arcgisonline.com/ArcGIS/rest/services/'
service = 'World_Street_Map/MapServer/tile/{z}/{y}/{x}'
tileset = url_base + service

m = folium.Map(location=[47.60621, -122.33207], zoom_start=10,\
                control_scale = True, tiles=tileset, attr='USGS style')

m.add_child(plugins.HeatMap(zip(df['lat'], df['long'], df['price_sqft_living']), radius = 12))

Interim take away: Seattle city center has no properties that lie in what fits my definition of a poor neighborhood. Most of the properties don't lie at the waterfront.

## Distributions and descriptive statistics

How are condition, years since last renovation, total price and price per $ft^2$ of living space distributed over poor neighborhood properties?

First make new data frame to use for calculations with time since last renovation since there are a lot of properties where I do not know whether they were never renovated or if that information is simply missing.

In [None]:
# make new df (time since last renovation)
df_with_renovation = pd.DataFrame(df.query('yrs_since_renovation!=0'))

In [None]:
# Distributions
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(20,8))
plt.suptitle('Distribution of condition, price, time since last renovation', fontsize=30)
fig.tight_layout(h_pad=4, pad=2)

sns.histplot(data=df, x='condition', ax=ax[0][0] )
ax[0][0].set_xlabel('Condition [1-5]', fontsize=20);
ax[0][0].set_ylabel('Count', fontsize=20);

sns.histplot(data=df_with_renovation, x='yrs_since_renovation', ax=ax[0][1]);
ax[0][1].set_xlabel('Years since renovation', fontsize=20);
ax[0][1].set_ylabel('');

sns.boxplot(data=df, x='price_sqft_living', ax=ax[1][0])
ax[1][0].set_xlabel('Price/sqft [$]', fontsize=20);

In [None]:
df.loc[:,['price_sqft_living','condition','dist_to_seattle']].describe()

In [None]:
df_with_renovation.loc[:,['yrs_since_renovation','price_sqft_living','condition','dist_to_seattle']].describe()

In [None]:
df['condition'].value_counts()

### [Condition](https://info.kingcounty.gov/assessor/esales/Glossary.aspx?type=r#b)
Most properties in poor neighborhoods have a condition of value "3" (count(3) = 4261, median = 3):
* Average - Some evidence of deferred maintenance and normal obsolescence with age in that a few minor repairs are needed, along with some refinishing. All major components still functional and contributing toward an extended life expectancy. Effective age and utility is standard for like properties of its class and usage.

Some properties in poor neighborhoods have a condition of value "2" (count(2) = 57):
* Fair - Badly worn. Much repair needed. Many items need refinishing or overhauling, deferred maintenance obvious, inadequate building utility and systems all shortening the life expectancy and increasing the effective age.

Properties with conditions 2 and 3 are promising objects to (a) renovate socially responsible, i.e. no unnecessary luxury renovation as would with conditions 4 and 5 upgrading, while (b) not having to invest so much so that the selling price would result in a far too high price to still be socially responsible (as would most probably be with properties of condition 1).

### Years since last renovation
There are $n = 128$ properties with information on last renovation. Most of these properties haven't been renovated for around 30 years ($\mu = 29.63, median = 29.50$), so there are enough properties to choose from with information on last renovation.

### Price per $ft^2$
Although we are only looking at poor neighborhoods now, there are still some extreme outliers price-wise with the maximum $price/ft^2 = 791.67 \$$ being 4.5 times higher than the average $price/ft^2 = 173.52 \$$.

## Hypothesis Tests
|#|Hypothesis|
|---|---|
|1|The lower the current condition, the lower the price|
|2|The farther away from the city center, the lower the price|
|3|The longer ago the last renovation, the lower the price

### Plot relationships

In [None]:
# bin condition for distinct color scheme
df['condition']                 =  pd.cut(df['condition'], bins=5, labels=[1,2,3,4,5])

df_with_renovation['condition'] =  pd.cut(df_with_renovation['condition'],
                                          bins=df_with_renovation['condition'].nunique(),
                                          labels=df_with_renovation['condition'].unique())

In [None]:
# Create a scatter plot using Plotly
fig = px.scatter(data_frame=df_with_renovation, x='yrs_since_renovation', y='price_sqft_living',
                 color='condition', labels=dict(yrs_since_renovation='Years since renovation',
                                               price_sqft_living='Price/sqft [$]'),
                 title='Relationship between price/sqft, years since renovation and property condition for King County Housing'
                 'properties in poor neighborhoods', opacity=.7)

fig.show()

In [None]:
# Create a scatter plot using Plotly
fig = px.scatter(data_frame=df, x='dist_to_seattle', y='price_sqft_living',
                 color='condition', title='Relationship between price/sqft, distance to Seattle center and property condition',
                 labels=dict(dist_to_seattle='Distance to Seattle [km]',
                             price_sqft_living='Price/sqft [$]'),
                             opacity=.7)

fig.show()

### Test for correlations

In [None]:
# condition is categorical, use spearman correlation
df[["price_sqft_living", "condition"]].corr(method='spearman')

In [None]:
df[["price_sqft_living", "dist_to_seattle"]].corr()

In [None]:
df_with_renovation[["price_sqft_living", "yrs_since_renovation"]].corr()

# Results

__Overview__
|#|Hypothesis|Result|
|---|---|---|
|1|The lower the current condition, the lower the price|$\rho = .05$|
|2|The farther away from the city center, the lower the price|$r = -.09$|
|3|The longer ago the last renovation, the lower the price|$r = -.03$

* Correlation between price/$ft^2$ and condition is near to not existent.
* Correlation between price/$ft^2$ and distance to Seattle city center is at best marginal.
* Correlation between price/$ft^2$ and time since last renovation is near to not existent.
* Correlation directions for all as hypothesized.

The data for poor neighborhoods are very homogenous. There are some outliers price-wise, but over all not much difference ($\mu = 173.53, median = 166.67, s = 49.34$). All properties are outside of Seattle, so there is not much variability in distance to center. Time since renovation is also very homogenous (right skewed, $\mu = 29.63, s = 15.31, median = 29.50$).
Additionally, there are very few properties with condition ratings of 1 and 2, reducing their weights. 
This might explain why we don't see any stronger correlations.


## Recommendation for Client

Since there are no meaningful correlations, it doesn't seem necessary to exclude further zip code areas in respect to distance to city center.
As for expected (renovation) costs it seems to make sense to look closer into properties with a KCH condition rating of 2 or 3 (1 would most probably become very expensive which would be represented in the later selling price). A further consideration of time since last renovation is only in so far efficient if we focus on properties where we have that information or which are not older than 40 years. Otherwise this information would have to be retrieved from KCH. It should be considered that the longer ago the last renovation, the more probably unforeseen problems (and costs) will arise, independent from KCH condition rating.

I specifically recommend my client, Erin Robinson, to take properties into consideration for an investment that meet the following criteria:
* properties in 22 *poor* zip code areas
* KCH condition of 2 or 3
* last renovation not longer ago than 40 years
* since it should still be affordable after renovation: $price(total) <= median$ (social responsibility).
* Additionally looking for properties with at least 3 bedrooms to meet needs of families (social responsibility).

The following properties meet all of these criteria:

In [None]:
# Need data frame with all information to look for bedroom number. import
kch = pd.read_csv('data/kch_clean_data.csv')

In [None]:
possible_properties = df.query('(condition==2 or condition==3) and (yrs_since_renovation>0 and yrs_since_renovation<=40)')['id'].values

### Final properties - investment possibilities

In [None]:
final_properties = kch.query('id in @possible_properties and price<=price.median() and price_sqft_living<=price_sqft_living.median() and bedrooms>=3'
                             )[['id','date','price','price_sqft_living','sqft_living','bedrooms','bathrooms','yr_renovated','yr_built','floors','sqft_lot','lat','long']]

final_properties

In [None]:
url_base = 'http://server.arcgisonline.com/ArcGIS/rest/services/'
service = 'World_Street_Map/MapServer/tile/{z}/{y}/{x}'
tileset = url_base + service

m = folium.Map(location=[47.60621, -122.33207], zoom_start=10,\
                control_scale = True, tiles=tileset, attr='USGS style')

m.add_child(plugins.HeatMap(zip(final_properties['lat'], final_properties['long'], final_properties['price_sqft_living']), radius = 12))