<div style="position: relative;">
<img src="https://user-images.githubusercontent.com/7065401/98728503-5ab82f80-2378-11eb-9c79-adeb308fc647.png"></img>

<h1 style="color: white; position: absolute; top:27%; left:10%;">
     INE Bootcamp
</h1>
<h2 style="color: white; position: absolute; top:36%; left:10%;">
    Data Analysis, Visualization and Predictive Modeling
</h2> 

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:58%; left:10%;">
    <b>David Mertz, Ph.D.</b>
</h3>

<h3 style="color: #ef7d22; font-weight: normal; position: absolute; top:63%; left:10%;">
    <b>Data Scientist</b>
</h3>
</div>

<div style="width: 100%; height: 200px; background-color: #222; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Data Analysis for Machine Learning
</h1>

<br><br> 
</div>

> We have used scikit-learn already in our basic polynomial fitting.  but let us use it in a more systematic way, and consider some issues that we need to in real-world data science.

We'll work with a Kaggle dataset: [House Sales in King County, USA](https://www.kaggle.com/harlfoxem/housesalesprediction).

<img src="https://user-images.githubusercontent.com/7065401/110563684-6a7b3100-812a-11eb-961b-ec7d2f25c008.jpg" style="width:400px; float: right; margin: 0 40px 40px 40px;"/>

These are the features of the dataset:

* **id**: a notation for a house
* **date**: Date house was sold
* **price**: Price is prediction target
* **bedrooms**: Number of Bedrooms/House
* **bathrooms**: Number of bathrooms/bedrooms
* **sqft_living**: square footage of the home
* **sqft_lot**: square footage of the lot
* **floors**: Total floors (levels) in house
* **waterfront**: House which has a view to a waterfront
* **view**: Has been viewed
* **condition**: How good the condition is ( Overall )
* **grade**: overall grade given to the housing unit, based on King County grading system
* **sqft_above**: square footage of house apart from basement
* **sqft_basement**: square footage of the basement
* **yr_built**: Built Year
* **yr_renovated**: Year when house was renovated
* **zipcode**: zip
* **lat**: Latitude coordinate
* **long**: Longitude coordinate
* **sqft_living15**: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area
* **sqft_lot15**: lotSize area in 2015(implies-- some renovations)

Importing the required libraries:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
sns.set_theme()

<h2 style="font-weight: bold;">
    Exploratory Data Analysis
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Loading the dataframe:

In [None]:
df = pd.read_csv('data/kc_house_data.csv')
df.head()

<h2 style="font-weight: bold;">
    Getting a feeling for the data
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

The first step when analyzing data is cleaning. Understanding if we've loaded the data correctly and we have valid values. This is a process that will involve multiple steps, but for now, we start with our _5 minute_ check:

In [None]:
df.shape

With `shape` we know that there are 21,613 rows, with 21 columns (features). Let's check for red flags on those features:

In [None]:
df.info()

`info` gives you a quick summary of both the type and the count for each column. In this case the data seems correct, there are no missing values and the types are correct.

Zip code is interesting.  House prices are often driven by zip code ("desirable" neighborhoods), but the numeric order of these codes has no pattern in relation to these prices.  The zip code is known as a "categorical variable" rather than a quantitative one.

In [None]:
df.zipcode.unique()

The date is now encoded as a timestamp string, which is not directly useful.  We can convert it to a datetime:

In [None]:
list(df.date[:4])

In [None]:
days = pd.to_datetime(df.date)
days.head()

However, that form is also not yet directly useful since machine learning models want *numbers* to work with.  We can continue conversion to get this into "nanoseconds since the epoch" (the beginning of Unix time in 1970).  The specific numbers are not important, but only that they go up with the passage of time.  Let us add that as a potentially useful feature.  I.e. perhaps housing prices change over time.

In [None]:
df['since_epoch'] = days.astype(int)
df.iloc[:5, -5:]

<h2 style="font-weight: bold;">
    High-level feature selection
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Our objective is to predict the price of a house based on the features that we know about the house. For example, we know that a larger lot size and more bedrooms will relate with a highest price. It makes sense to drop the internal ID, and the unencoded date. Let us drop zip code for the moment, but we will come back to it. Latitude and longitude measure something similar about geographic location, and we retain them.

Feature selection can be very important to an ML model. With pandas is simple to exclude columns:

In [None]:
zipcodes = df.zipcode  # Save for later
df.drop(columns=['id', 'zipcode', 'date'], inplace=True)
df.head()

<h2 style="font-weight: bold;">
    Correlation between variables
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Some variables will have higher (positive or negative) correlation with the price. We know that the surface area of a house is positively correlated with its price: the larger the house, a higher price. But what about others? We can build a simple correlation plot to understand a little bit better the relationship between different variables:

In [None]:
df[['price', 'since_epoch', 'bedrooms', 'bathrooms', 'sqft_above', 'sqft_living', 'lat', 'long']].corr()

It is a bit interesting that latitude is somewhat correlated with price, and longitude essentially not at all. However, we probably expect that it is the *interaction* of these variables that measure an actual effect.  That is, the "rich neighborhood" is not necessarily the one farthest north, south, east, or west; quite likely it is some region in the middle, along both axes.

We can use a visualization to summarize these variables and their correlations:

In [None]:
corr = df.corr()
fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(corr, ax=ax, linewidths=0.01);

Let us think about the fairly strong correlation of `grade` and `price`?  Is it a linear relationship?

In [None]:
print(df.corr().loc['grade', 'price'])
df.plot.scatter(x='grade', y='price', figsize=(15, 4), color="darkblue", marker='.');

It feels like there is some connection between the ordinal grade and the quantitative price.  But we have quite a bit of variability of price within a grade.  Looking at price logarithmically makes the pattern a bit sharper, but still not entirely so.

In [None]:
df.plot.scatter(x='grade', y='price', figsize=(15, 4), 
                logy=True, color="darkblue", marker='.');

<h2 style="font-weight: bold;">
    More cleaning, identifying outliers
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Linear regression (along with other ML models) can be sensitive to outliers:

In [None]:
df.describe()

🤔 A house with 33 bedrooms? There's something going on here:

In [None]:
fig, ax = plt.subplots(figsize=(15, 3))
sns.boxplot(data=df[['bedrooms', 'bathrooms']], orient="h");

It makes sense for a (really expensive) house to have, let's say 10 bedrooms, but 33 seems like an error.

In [None]:
df[df['bedrooms'] == 33].T

33 bedrooms and only 1.75 bathrooms? 😅 clearly an error.

In [None]:
df.drop(15870, inplace=True)

Now, what about those properties without bathrooms? That is strange, let's take a look:

In [None]:
df[df['bathrooms'] == 0]

Now that we look, it perhaps makes a little bit more sense. Maybe those are just warehouses or other type of storage unit facilities? Without more information is now difficult to make a decision. This is an important lesson: **domain expertise is fundamental when analyzing data**

We will not remove any additional house at this point, just keep a mental note of the suspicious absence of bathrooms listed. How are other variables doing?

In [None]:
fig, ax = plt.subplots(figsize=(15, 4))
sns.boxplot(data=df[['sqft_living','sqft_above', 'sqft_basement']], orient="h");

This probably requires a little bit more analysis, but let's proceed.

<h2 style="font-weight: bold;">
    One-hot encoding
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

The `zipcode` feature we dropped above raises a problem. Machine learning models often do not understand categorical features like zip code. To a machine learning algorithm, a `zipcode` value of 98199 is "greater than" 98102, which is greater than 98001. However, the more expensive houses are unlikely to follow this same order (nor the exact reverse).

In this data, there are 70 zip codes, and varying numbers of houses in each.

In [None]:
zipcodes.value_counts()

We can create "dummies" to represent these categorical values as new numeric variables.  This is what is called "one-hot" encoding.

In [None]:
pd.get_dummies(zipcodes)

We *could* add these 70 new features to our dataset, which would then be entirely numeric.  We do not do so at this point, however (not out of principle, just convenience of pedagogy).

<h2 style="font-weight: bold;">
    Feature scaling and normalization
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

There is a final **IMPORTANT** point to discuss: "scaling" and "normalizing" features. It has a mathematical explanation, but basically, what we **do not** want is to have features whose units occupy dramatically different numeric ranges. For example:

In [None]:
cols = ['bedrooms', 'sqft_lot', 'since_epoch']
df[cols].head()

The values here are too dissimilar, which will make many algorithms perform poorly. We can scale these features to remove the units. 

Read more here: [Importance of Feature Scaling](http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html)

In [None]:
scaled = StandardScaler().fit_transform(df[cols])
df_scale = pd.DataFrame(scaled, columns=['scaled_br', 'scaled_lot', 'scaled_epoch'])
df_scale

In [None]:
fig, ax = plt.subplots(figsize=(15, 4))
sns.violinplot(data=df_scale, orient='h');

There are still some lot sizes that are far larger than is typical, but at least we do not have 18 orders of magnitude difference between the kinds of variables.  Most likely, this is accurate data, but includes some large farms or ranches along with ¼ acre city lots. All have a mean of zero and standard deviation of one in their scaled version.

Let's scale all of our data and separate features from target. The capital "X" for the (multiple) features—i.e. independent variables—and the lower "y" for the (single) target are a very common convention harkening back to high school algebra.

In [None]:
X = df.drop('price', axis=1)   # everything except the price
y = df.price                   # just the price
X_scaled = StandardScaler().fit_transform(X)

<h2 style="font-weight: bold;">
    Train/test splits
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

There is another **very important** topic we skipped in the look and linear and polynomial fitting.  We earlier used *ALL* the data for both training and prediction (or scoring).  This leads to a problem called *overfitting*.  The model learns to memorize the data it is given rather than genuinely model the underlying behavior.

The way we deal with this problem is to split the data into two parts, one to perform the training with, the second to hold in reserve for evaluation of the quality of the model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=1)
print("X_train shape:", X_train.shape)
print("X_test.shape: ", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test.shape: ", y_test.shape)

<h2 style="font-weight: bold;">
    Modeling
</h2>

![orange-divider](https://user-images.githubusercontent.com/7065401/98619088-44ab6000-22e1-11eb-8f6d-5532e68ab274.png)

Let's see now how our Linear Regression performs on our cleaned and scaled data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, random_state=1)
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Score: {score:0.3f}")

It **seems** a little better than in the initial lesson, as a straight linear regression.  However, in reality, it is actually quite a **lot** better because this is a fair model that uses a train/test split.  Doing the split will always reduce the score, but the higher score of the unsplit data is purely overfitting, and will fail dramatically when it sees novel data.

Let us try also with a polynomial, as we did in the earlier lesson.

In [None]:
X_poly = PolynomialFeatures(2).fit_transform(X_scaled)
X_polytrain, X_polytest, y_train, y_test = train_test_split(X_poly, y, random_state=1)

In [None]:
model = LinearRegression().fit(X_polytrain, y_train)
score = model.score(X_polytest, y_test)
print(f"Score: {score:0.3f}")

In [None]:
model.predict(X_polytest[:5])

Many models that are not linear, nor even polynomial, are also available.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor().fit(X_polytrain, y_train)
score = model.score(X_polytest, y_test)
print(f"Score: {score:0.3f}")

In [None]:
# explicitly require this experimental feature
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
model = HistGradientBoostingRegressor().fit(X_polytrain, y_train)
score = model.score(X_polytest, y_test)
print(f"Score: {score:0.3f}")

<div style="width: 100%; height: 200px; background-color: #ef7d22; text-align: center; padding-top: 20px; margin-bottom: 40px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Exercises
</h1>

<br><br> 
</div>

Within scikit-learn itself, a number of datasets are made available.  One often used one contains similar house sales data as that we looked at here.  These datasets are not stored as Pandas DataFrames, but we can easily construct one from the attributes of a dataset object.  First take a look at the attributes provided using `dir(ca_housing)` and try to understand what each is.

In [None]:
from sklearn import datasets
ca_housing = datasets.fetch_california_housing()

We can construct a DataFrame, then perform the same cleanup and modeling as we did for the King County data.

In [None]:
df_ca = pd.DataFrame(ca_housing.data, columns=ca_housing.feature_names)
df_ca['TARGET'] = ca_housing.target
df_ca

Create and evaluate a model to predict (1990s) California housing prices.

In [None]:
# your code goes here


<div style="width: 100%; height: 400px; background-color: #222; text-align: center; padding-top: 120px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    Review and questions
</h1>

<br><br> 
</div>

<div style="width: 100%; height: 400px; background-color: #ef7d22; text-align: center; padding-top: 120px;">
<br><br>

<h1 style="color: white; font-weight: bold;">
    <a style="color: white;" href="https://docs.google.com/forms/d/1FGx7gzZzOgahGF1X6ZOOo2nGMHbHpHIqMysdYg5_WBw/viewform?edit_requested=true" target="_blank">Evaluation</a>
</h1>

<br><br> 
</div>

---
<div style="position: relative; text-align: right;">
<img src="https://user-images.githubusercontent.com/7065401/98614301-dcf01780-22d6-11eb-9c8f-65ebfceac6f6.png" style="width: 130px; display: inline-block;"></img>

<img src="https://user-images.githubusercontent.com/7065401/98864025-08deda80-2448-11eb-9600-22aa17884cdf.png" style="height: 100%; max-height: inherit; position: absolute; top: 20%; left: 0px;"></img>
<br>

<h2 style="font-weight: bold;">
    David Mertz, Ph.D.
</h2>

<h3 style="color: #ef7d22; margin-top: 0.8em">
    Data Scientist
</h3>
<hr>
<br><br>

<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    david.mertz@gmail.com
</p>
<p style="font-size: 80%; text-align: right; margin: 10px 0px;">
    linkedin.com/in/dmertz/
</p>

</div>

<br><br><br>