# Exploratory Data Analysis (EDA)

## Scope of this project
Get familiar with the _King County Housing Data_ and perform an **Exploratory Data Analysis** (EDA) with focus on the following particular requests by the stakeholder.

The stakeholder: <br>
Nicole Johnson, buyer, who seeks for a "Lively, central neighborhood, middle price range, right timing (within a year)"

---------------------------
# Table of Content
1) Import and first impression of the dataset
2) Initial Hypotheses about the Dataset
3) Explore and clean the dataset
-------------------------------


## 1) Import and first impression of the dataset

First of all, we will load the data into the workspace as a _dataframe object_ using **_pandas_** and display the main characteristics of it.

In [None]:
# import the necessary libraries we need for your analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
# import dataset
df = pd.read_csv('data/King_County_House_prices_dataset.csv', parse_dates=['date'])
df

In [None]:
# So, what size does the dataset has?
print("\n", f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns.", "\n")

# Now, let us take a view to the columns and their type:
df.info()

- For most of the columns the dtype looks reasonable except for _sqft_basement_ which is of type "object", but we expect it to be a "float" since the variable gives us the size of the basement in square feet.
- For _waterfront_ we see that it is a "float64" and not boolean as we might have thought (either the house has a waterfront or not). So, lets have a quick view:

In [None]:
df.waterfront.unique()


We see, rather than TRUE/FALSE the column has already been one-hot encoded and contains 0/1 as well as nan (not a number).<br>
Speaking of nan, how many nan do we have in each column?

In [None]:
df.isna().sum()

Except for _waterfront_, _yr_renovated_ and _view_ the variables (i.e. columns) are complete in the sense that no missing values appear.<br>
But, additionally let us check wether there are any duplications or multiple entries which need to be cleaned.

In [None]:
df["id"].duplicated().value_counts()

Indeed, the _id_ which is unique for each house has 177 duplications. So, are there really duplications of complete rows or does the house id just occur more than once? (For example indicating that a house has been bought and sold several times within the given time period) 

In [None]:
df.duplicated().unique()

As there is not a single TRUE value, there are no duplications of complete rows. So, for now we will keep them. <br>
Lastly, we will take a brief view on some basic descriptive statistical parameters for each variable:

In [None]:
df.describe()

Here, allthough being numeric, statistics for variables such as _id_, _waterfront_, _zipcode_, _latitude_ and _longitude_ can be ignored.  
To get more insights, let's visualise some variables of the table which might be of special interest for our purpose (i.e. the requests of the stakeholder):
- the _price_
- the _year built_ (and _year renovated_ if applicable)
- the _living size_ and overall _(lot) size_ of the houses
- quality - specified via _condition_ and _grade_

In [None]:
fig, axes = plt.subplots(1, 7, figsize=(18, 8))

sns.boxplot(ax=axes[0], data=df.price)
sns.boxplot(ax=axes[1], data=df.yr_built)
sns.boxplot(ax=axes[2], data=df['yr_renovated'])
sns.boxplot(ax=axes[3], data=df['sqft_living'])
sns.boxplot(ax=axes[4], data=df['sqft_lot'])
sns.boxplot(ax=axes[5], data=df['condition'])
sns.boxplot(ax=axes[6], data=df['grade'])

axes[0].set_title('Price [USD]')
axes[1].set_title('Year Built')
axes[2].set_title('Year Renovated')
axes[3].set_title('size living [sqft]')
axes[4].set_title('size lot [sqft]')
axes[5].set_title('condition')
axes[6].set_title('grade');

We can see that variables _price_, _sqft_living_ and _sqft_lot_ appear to be right skewed distributed as these have a couple of outliers towards higher values whereas _yr_built_ and _condition_ seem to have a rather symmetrical distribution. For most of the houses, the _grade_ varies between 6 and 9 with a few outliers towards both sides of the distribution. Having said this, the variable _yr_renovated_ seems to be corrupted. So lets have a look at the values:

In [None]:
df.yr_renovated.unique()

So, there are both nan and zeros which we need to replace when proceeding with the data cleaning. We already know that there are 3842 nan, so let's count the zeros, as well:

In [None]:
(df['yr_renovated'] == 0).sum()


So, out of 21,597 values we have 3,842 nan plus 17,011 zeros, ergo - at maximum - 744 values left considering there may still be some duplications.

## 2) Initial Hypotheses about the Dataset

Hypotheses related to the stakeholder: <br>
To buy a central neighborhood house in middle price range with right timing (within a year) it will
- H1) be older than 50 years and has not been renovated in the last 25 years or
- H2) has a below-average grade or
- H3) has a below-average livingsquare

## 3) Explore and clean the dataset



Before we start with the cleaning of the data, it suitable to specify and define the stakeholders requests in terms of available parameters:
- lively, central neighborhood:
- middle price range:
- right timing (within a year):

### Add/remove columns

In [None]:
# As we want to know the right timing of buying within a year, we generate a 'month' column
df['month'] = df.date.dt.month

# Add prices for lot and living as prices per squarefeet
df['sqft_lot_per_sqft'] = df['sqft_lot'] / df['price']
df['sqft_living_per_sqft'] = df['sqft_living'] / df['price']

# For our purpose, we don't need sqft_above and sqft_basement, as these add up to sqft_living and we are only interested 
# in the overall size of the living area and overall lot size.
df.drop(['sqft_above','sqft_basement'], axis=1, inplace=True)

In [None]:
df