## Part 1: Look at the Big Picture
**Summary:** In this mini-project we will use California census data to build a model of housing prices. Our dataset is the CA Housing Prices Dataset from the StatLib repository, which is based on the 1990 census. Some of the features in this dataset are population, income, and housing prices for each district. 

**Goal**: Your ML model should learn from this dataset in order to predict the median housing price in any district. 

**Where to begin?** Some common questions at work might be: How will this model be used? How will it benefit the company or people using it? What does the current (if any) solution look like (e.g., are people hand calculating this result and if so, how)?

**Frame the problem:** Is it supervised, unsupervised, hybrid, or reinforcement learning? Is it a classification or regression task, or something else? Should you use batch or online learning? What assumptions have you or others made about this task? (Note: we haven't covered all these concepts yet; this is just to give you an idea of how you can start thinking about setting up ML projects).

**Task:** We are given labeled training examples, so this will be supervised learning.
Because we are asked to predict a value (median housing price), this is a regression task. Specifically, this is a multiple regression task because we will use many features to make our prediction. We haven't covered this in class yet, but the next step is to choose our *performance measure* -- the typical performance measure for regression tasks is the *Root Mean Square Error (RMSE)* cost function. Basically, this tells us how much error our model makes in its predictions. 

## Part 2: Load the Data

In [42]:
# import packages
import os
import tarfile
import urllib
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [3]:
## Part 2: Get the Data
# Create a function to download housing.tgz: a comma-separated values (CSV) file representing the dataset

# Specify URL & path names
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

# Creates 'datasets/housing' dir in workspace
# Downloads & extracts housing.tgz
def get_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()


In [4]:
get_housing_data()

In [5]:
# Load the data using pandas
# Returns a pandas DataFrame object containing all the data
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)


## Part 3: Visualize the Data

In [6]:
# Peak at the dataset using DataFrame's head() method to return the top 5 rows
housing = load_housing_data()
housing.head()


**About the Housing Dataset** 

Each row represents one district. 
There are 10 features: longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, median_house_value, ocean_proximity. 

**TODO**: Call the *info()* method on the Housing dataset. Do you notice anything about total_bedrooms and ocean_proximity? 

**TODO**: Check the categories and number of districts in each for the *ocean_proximity* feature by using the *value_counts( )* method. Note: You can select a feature by name and use it just as you would an index into an array. For example: housing["latitude"]. 

In [8]:
housing["latitude"]

**TODO:** Now try calling the *describe( )* method on the housing dataset. What does this return? 

### Part 3a: Histograms
Another visualization option is to plot a histogram. A histogram will show the number of instances (y-axis) that have a given value range (x-axis). 

Some things to notice about the histograms:
<ol>
    <li> Median income: has been **preprocessed** -- scaled and capped with new range: [0.5..15]. This means that a value of 5 represents about $50,000.  </li>
    <li> Median age has also been capped. </li>
    <li> PROBLEM! Median house value was also capped! But this is our target attribute (our labels). When will this be a problem? </li>
    <li> The attributes have different scales. This could be a problem later. </li>
    <li> Several of these histograms are **tail-heavy** -- they extend farther to the right of the median than the left. This could make it harder for some ML algorithms to learn patterns. How do you think we can get around this? </li> 
</ol>

In [11]:
%matplotlib inline 
housing.hist(bins=50, figsize=(20,15))
plt.show()

### Part 3b: Create a Test Set 
Before creating more visualizations, let's set aside a portion of our dataset for testing. Typically, we can visualize the data to search for patterns and guess which models and features will be a good starting point. However, we don't want to introduce *data snooping bias*. In order to avoid this, we can reserve the testing set now, before loading more visualizations. That way we'll only visualize the training set that our ML algorithm will learn from. The most common data split is 80% training and 20% testing (test_size = 0.2). Another common split is 70% training, 20% testing, and 10% development/validation. 

In [12]:
# Split the dataset into a training set and testing set 
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)


Most people use the above code to randomly sample and split their datasets. Ideally however, we should split using stratified sampling. In this approach, the population would be divided into homogenous subgroups (strata) and the correct number of instances will be sampled from each stratum (or group) to ensure that the test set represents the entire population (in the entire dataset). 

For the housing prediction task, experts say that median income is the most important feature for predicting the median housing price (our prediction goal). So we can separate the median income into categories (strata) and then select the appropriate number of samples from each category to ensure correct representation in our training/testing sets. 

In [13]:
# Create income_cat feature with 5 categories 
housing["income_cat"] = pd.cut(housing["median_income"], bins = [0., 1.5, 3.0, 4.5, 6., np.inf], labels = [1, 2, 3, 4, 5])
housing["income_cat"].hist()

Now we can do stratified sampling on the income category. 

In [14]:
# Perform stratified sampling on income_cat to create 80% training & 20% testing sets
split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
for train_index, test_index in split.split(housing, housing["income_cat"]): 
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# Check income category proportions
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

    

In [15]:
# Remove income_cat attribute to return dataset back to original 
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis = 1, inplace = True)


### Part 3c: Back to Visualizations
We'll save a copy of our training set and use it to create more data visualizations. 

In [16]:
housing = strat_train_set.copy()

In [17]:
# Use latitude & longitude to create a geographical scatterplot of all districts
housing.plot(kind = "scatter", x = "longitude", y = "latitude", alpha = 0.1)

In [None]:
### TODO ###
# What happens when you remove the alpha parameter in the code above? 


In [19]:
# Are housing prices related to location & population density? 

# param s: radius of each circle represents district's population
# param c: color represents price
# param cmap jet: use a predefined color map called jet (uses blue (low prices) to red (high prices))
housing.plot(kind = "scatter", x = "longitude", y = "latitude", alpha = 0.4,
            s = housing["population"]/100, label = "population", figsize = (10,7),
            c = "median_house_value", cmap = plt.get_cmap("jet"), colorbar = True)
plt.legend()

In [20]:
# Use pandas to check which attributes/features are correlated with median housing value
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize = (12,8))

In [21]:
# Take a closer look at median house value and median income
# What do you think the plotted horizontal lines represent? Will they be a problem?
housing.plot(kind = "scatter", x = "median_income", y = "median_house_value", alpha = 0.1)


## Part 4: Prepare Your Data for ML Algorithms

First, we'll revert to our clean training set. Then we'll separate predictors (features, attributes, x) and labels (output, target, y). 

In [22]:
housing = strat_train_set.drop("median_house_value", axis = 1)
housing_labels = strat_train_set["median_house_value"].copy()


### Part 4a: Handle Missing Values
At the beginning of Part 3, you might have noticed that *total_bedrooms* was missing some values. Some ML algorithms don't work well when features are missing, so we need to fix this before training. There are 3 common approaches for how to handle missing values:
<ol>
    <li> Get rid of the corresponding sample. </li>
    <li> Get rid of the whole feature. </li>
    <li> Set the missing values to some value: zero, the mean, median, etc. </li>
</ol>

This processing can be done using DataFrame's methods or with Scikit-Learn. 

In [23]:
# Create a SimpleImputer
# Replace each attribute's missing values with the median of that attribute
imputer = SimpleImputer(strategy = "median")

# Because we can only compute median on numerical attributes
# Copy the data without the ocean_proximity categories (because they're text)
housing_num = housing.drop("ocean_proximity", axis = 1)

# Fit the imputer to the training data
# Imputer computes median of each attribute & stores result in its statistics_ instance variable
imputer.fit(housing_num)

print(imputer.statistics_)
#print(housing_num.median().values)

In [24]:
# Use the trained Imputer to transform training set 
# by replacing missing values with learned medians
# Returns NumPy array containing transformed features
X = imputer.transform(housing_num)
print(X)

In [25]:
# Put X back into a DataFrame
housing_tr = pd.DataFrame(X, columns = housing_num.columns, index = housing_num.index)
print(housing_tr)

### Part 4b: Text to Numbers
Most ML algorithms prefer to work with numbers instead of text, so we need to convert the *ocean_proximity* text into numerical categories. 

In [26]:
# Look at the ocean_proximity attribute/feature
housing_cat = housing[["ocean_proximity"]]
print(housing_cat.head(10))

In [28]:
# Represent text categories as numerical categories
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

print(housing_cat_encoded[:10])


In [29]:
# Print list of categories 
print(ordinal_encoder.categories_)

One problem with this type of representation is that some ML algorithms will assume nearby values are more similar than distant values, which isn't the case for *ocean_proximity* (e.g., in this case categories 0 & 4 are more similar than categories 0 & 1). 

To fix this, we can use a common ML representation called **one-hot encoding**. For this type of feature representation, we create one binary attribute per category, for example: one attribute equal to 1 when the category is INLAND and 0 otherwise, another attribute equal to 1 when the category is NEAR OCEAN and 0 otherwise, and so on for each. 

In [None]:
### TODO ###
# Use Scikit-Learn's OneHotEncoder to convert categorical values into one-hot vectors

# Create a OneHotEncoder
cat_encoder = 

# Call the OneHotEncoder's fit_transform method on housing_cat
housing_cat_1hot = cat_encoder.

# Returns a SciPy sparse matrix
print(housing_cat_1hot)

In [None]:
### TODO ###
# Print the list of categories using the encoder's categories_ instance variable
print( )

### Part 4c: Feature Scaling & Transformation Pipelines
Feature Scaling is the most common data cleaning or preprocessing step you will perform. ML algorithms don't perform well when the input numerical values for attributes have different scales. There are 2 common approaches to get all features to have the same scale: 
<ol>
    <li> Normalization: also called min-max scaling. Values are shifted and rescaled to range from 0..1. Scikit-Learn uses the *MinMaxScaler* class for this. </li>
    <li> Standardization. This approach doesn't force values to fall within a specific range, which can be a problem for some algorithms. However, it's less affected by outliers. Scikit-Learn uses the *StandardScaler* class for this.  </li>
</ol>

There are several data transformation steps you might have to execute in the correct order. Scikit-Learn provides a *Pipeline* class to help with this. 

In [35]:
# Small pipeline for numerical attributes
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy = "median")),
    ('std_scaler', StandardScaler())
])

housing_num_tr = num_pipeline.fit_transform(housing_num)
print(housing_num_tr)

Instead of modifying the categorical and numerical columns separately, we can use *ColumnTransformer* to apply the appropriate transformation to each column. 

In [37]:
# Get list of numerical column names
num_attribs = list(housing_num)

# Get list of categorical column names 
cat_attribs = ["ocean_proximity"]

# ColumnTransformer: numerical columns should be transformed using num_pipeline
#                    categorical columns should be transformed using OneHotEncoder
full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs)
])

# Apply full pipeline to housing dataset
housing_prepared = full_pipeline.fit_transform(housing)
print(housing_prepared)

## Part 5: Select & Train a Model
So far we've:
<ol>
    <li> Framed the problem/task </li>
    <li> Downloaded, explored, and visualized our data </li>
    <li> Preprocessed our training and testing sets </li>
</ol>

Now we just need to choose and train our model!
Let's use Linear Regression, which we'll be learning about soon in class. 

In [39]:
# Linear Regression Model
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)


In [41]:
# Try the model on a few instances
test_data = housing.iloc[:5]
test_data_labels = housing_labels.iloc[:5]
test_data_prepared = full_pipeline.transform(test_data)

# Predicted output (labels) by lin_reg model
print("Predictions:", lin_reg.predict(test_data_prepared))

# Actual labels
print("Labels:", list(test_data_labels))

And it's that easy! We fit the model and then used it to make predictions. But how good are the predictions compared to the actual labels? 

In [43]:
# Measure RMSE of the training set
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

What does this RMSE mean? Is our model good or bad? Basically, this says that our typical prediction error for predicting housing price is about $69050. This is probably not ideal for a real-world application. But this is a good example of *underfitting*: likely our features didn't give us enough information to make good predictions or the model isn't powerful enough. How do you think we can handle underfitting?

Note: normally we will use *cross-validation* to evaluate our models, but that will appear in future code projects.


## Part 6: Fine-tune Your Model
By this step in your project, you would have tried a few models and selected the one with the best RMSE. Once you've got the best model, you then want to *fine-tune* it. Depending on the model, that could mean doing any of the following: 

<ol>
    <li> Tune hyperparameters with a search algorithm, e.g., Grid Search or Randomized Search. </li>
    <li> Ensemble approach: combine the models that perform the best. </li>
    <li> Check and correct errors: Inspect the best performing models and check their errors. Depending on the errors you might drop less useful features, add more features, remove outliers, etc. </li>
</ol>

Once you've got the best possible model (including learning algorithm, hyperparameters, and features), then you want to evaluate it on your test set.

In [45]:
# Get predictors & labels from test set
# Run full_pipeline to transform data
# Evaluate final model on test set

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = lin_reg.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

print(final_rmse)


Note: if we had chosen a different model, we could have lowered this further. 

## All done!
At a job you might be asked to present your results and models, including what steps you took, what worked and didn't work, assumptions, model limitations, etc. Then you'd launch your system, monitor it, and maintain it. 