**CPTR** 435 Machine Learning 


Name:Kaleb Tsegaye

This activity is adapted from the notebook provided for chapter 2 of *Hands-On Machine Learning with Scikit-Learn & TensorFlow 2nd ed* by Geron (2019).

For the original notebook and all other code/data from the book, see:
https://github.com/ageron/handson-ml2



# End-to-end Machine Learning project (Part I: Exploring the data set)

The purpose of this activity is to understand the workflow of a machine learning project from start to finish. The specific task and ML algorithms we see in this notebook are not as important as understanding the process that we go through to approach the problem. In the future, you will see that this process translates well to new problems and ML algorithms.

## Problem: Predict house prices

Suppose you are a data scientist working for a real estate company. Your task is to predict median house values in Californian districts, given a number of features from these districts.

The main steps you will go through are:
1. Learn about the problem
2. Get the data (and examine its structure)
3. Create a test set
4. Explore and visualize the training set
5. Prepare and clean the data for ML algorithms
6. Select/develop an ML algorithm
7. Tune the approach
8. Evaluate trained model
9. Deploy the resulting ML system


The data set is based on the 1990 California census data. For pupose of the example, the book author (Geron) added a categorical attribute and removed some features. 

An *input* instance in this problem is a *block group* (refered to as a *district* in the book). A block group has a population of 600 to 3000 people. The *output* is the *median house price* for the *block group* (district).

## Classification vs. Regression

While the Iris problem was a *classification* problem (predict species for given iris), here the predicted output is not a class label, but median house value (output) for a given district (input). Since the range of the median house value is continuous, not a discrete class assignment, our task today is a *regression* problem.

## Supervised vs. Unsupervised

Since our data has the correct median house values (output) for each district, this problem is a *supervised learning* problem.

**Note (from Geron)**: You may find little differences between the code outputs in the book and in these Jupyter notebooks: these slight differences are mostly due to the random nature of many training algorithms: although I have tried to make these notebooks' outputs as constant as possible, it is impossible to guarantee that they will produce the exact same output on every platform. Also, some data structures (such as dictionaries) do not preserve the item order. Finally, I fixed a few minor bugs (I added notes next to the concerned cells) which lead to slightly different results, without changing the ideas presented in the book.

# Setup

First, let's make sure this notebook works well in both python 2 and 3, import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures:

In [None]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

# Where to save the figures
PROJECT_ROOT_DIR = "."
#IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images")

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

# Get the data

While the Iris data set is conveniently included in ML libraries such as Scikit-Learn, most data we want to work with is not. With real world applications of ML, our data must be loaded from external files. In some cases the data may be in formats that are easy to read into our programs (e.g. XML, JSON, CSV, SQL). In other cases, there may be considerable preprocessing involved to first prepare the data so it may be loaded into our ML systems.

Our data set today will be downloaded from the course instructor's github account and saved locally in a ``datasets`` subdirectory. Here we create a function to download and extract the data. 

For efficiency, ``urlretrieve`` only downloads the file if has not already been downloaded. If it sees the file already at the destination path on your computer, it will not download the file again. 

https://docs.python.org/3.0/library/urllib.request.html#urllib.request.urlretrieve

In [None]:
import os
from six.moves import urllib

# URL for data file
DOWNLOAD_URL = "https://raw.githubusercontent.com/ackleywill/CPTR435/main/housing.csv"
# https://www.kaggle.com/datasets/camnugent/california-housing-prices

# local path where data file will be stored on computer (or in virtual environment)
HOUSING_PATH = os.path.join("datasets", "housing")

def fetch_housing_data(housing_url=DOWNLOAD_URL, housing_path=HOUSING_PATH):
    # create local directories for storing data files (if necessary)
    # NOTE: if running this in Colaboratory, these directories will not be
    # created on your computer, but in the virtual environment for the notebook
    # in colaboratory. It will only be available to this notebook, not others.
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)

    # build local path for data file
    csv_path = os.path.join(housing_path, "housing.csv")
    # download datafile if not already downloaded
    urllib.request.urlretrieve(housing_url, csv_path)


Actually call our new function to download the data.

In [None]:
fetch_housing_data()

## Looking at how the data is structured

Before we start working with our data set(s), we want to learn how the data is structured, both its file format and the data structure it is read into.

The data file ``housing/housing.csv`` is a comma separated value text file. This is a common text-based file format. You can view the file with a text editor or load it into Excel. 

We'll let a panda do it.

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

Now we call our new function to load the data into our notebook using pandas and display the first 5 rows of our DataFrame (``head()``). This will show us what types of features we are working with in our data set.

In [None]:
housing = load_housing_data()
housing.head()

### Characteristics of our data

Pandas provides several functions for describing the nature of our data set. These include the data types used to store the values, and statistics about the data set.

For a description of our pandas DataFrame, we use the function ``info()``. This will tell us the number of rows (entries), number of columns, and the data type for each column.

In [None]:
housing.info()

There are 20,640 instances (districts) in our data set. While this is not a large data set compared with those used in production machine learning systems, it is much larger than our Iris data set (150 instances).


To see the number of times each unique value appears in a given column, we use the function ``value_counts()``.

In [None]:
housing["ocean_proximity"].value_counts()

The function ``describe()`` displays various statistics about each column in the DataFrame.

In [None]:
housing.describe()

We can use matplotlib to plot histograms for each column in our data set. The histograms are computed with the pandas ``hist`` function.

Do you see any patterns or interesting observations?
- How many households are in most districts?
- Where are most districts located (common latitude and longitude)? 
- Common median age? What are we really asking?
- Common median house value? Outliers?
- Common income range? Is it in dollars? What is max income?
- Notice anything strange about housing median age or median house value?

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15)) # calls matplotlib.pyplot.hist to plot histograms

plt.show()

# Create a test set

Before we do any further examination, we need to split our data set into *training* and *testing* sets. We will then set our test set aside and not look at it until we evaluate our ML system. 

If we wanted to be really proper, we would have split our data set before even looking at value counts, attribute statistics and  histograms. The less we know about the test set, the better. Otherwise, there is a temptation to "teach to the exam" when developing our ML approach. The problem with optimizing the approach for the test set, is that our ML system may perform well on instances in our test set, but may do poorly on future instances (i.e. it won't *generalize* well to new districts). 

As with the Iris data set, we will randomly assign instances (districts) to either the training or testing set. So that we get the same results when we rerun our code, we will set the random seed to the same value (42) each time. For this problem we will use a 80/20 split for training/testing.


In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
test_set.head()

Curious. We see several entries with NaN for total bedrooms. 

*How many of these entries are there?*

In [None]:
test_set['total_bedrooms'].isnull().sum()

How would we get just these NaN entries?

In [None]:
null_frames = test_set[test_set['total_bedrooms'].isnull()]

In [None]:
null_frames

*Do we have a similar situation in the training set?*

In [None]:
train_set['total_bedrooms'].isnull().sum()

*What happens if we go back and resplit our data set, but this time use a different random seed (e.g. 12 instead of 42)?*

## Using stratified sampling

When we randomly split our data set each instance is equally likely to end up in the test set. Treating each instance the same is appropriate in many cases.

Suppose that, for our task, we chat with experts who say that median income for a district is very important for predicting the median home value. Seems reasonable. Families with more income are likely to have more expensive homes.

Due to randomness, it's possible that median income distribution in our training set may be different than our test set. If the instances in our training set is not representative of the instances in our test set (and ultimately the real world), our ML system will perform poorly.

Let's look at the current median income distribution among districts in our full data set.


In [None]:
housing["median_income"].hist()

So most districts are not fabulously wealthy, even in California.

When we split our data set into training and testing sets, we want to ensure that each median income level is equally represented in both sets. To do this, we use a sampling technique called *stratified sampling*. Our instances are grouped into subgroups called *strata*. Samples are randomly drawn from each stratum such that the training set is representative of the test set.

## Creating the strata

We want each stratum to have a reasonable number of instances. Small stratum are more likely to be over/under represented in the test set. For our data set, we will create 5 strata. Our stategy will be: 
1. Category 1 ranges from 0 to 1.5
2. Category 2 ranges from 1.5 to 3
3. Category 3 ranges from 3 to 4.5
4. Category 3 ranges from 4.5 to 6
5. Category 3 ranges from 6 to infinity.





In [None]:
housing["income_cat"] = pd.cut(housing['median_income'], 
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
housing["income_cat"].head()

In [None]:
housing["income_cat"].value_counts()

In [None]:
housing["income_cat"].hist()

In [None]:
housing.head()

Not all strata are the same size, but each has a reasonable number of instances.

## Performing the stratified sampling

Scikit-Learn provides the class ``StratifiedShuffleSplit`` that will help us perform a split using our strata that we just created.

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

This class can be used to create multiple different train/test splits. Here we only want one train/test split. If we were performing 10-fold crossvalidation (we'll get to that later), we would want 10 different train/test splits.

The ``split()`` function iterates through each train/test split. So we use a for-loop to get the split even though there is only one split in this case. For each iteration, we get the indices of the instances for the training set (1D ndarray) and test set (1D ndarray).




In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# Create split object. Want 80% train, 20% test.
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

# Split object is a generator. Need for loop to make it generate a split.
# split(x, label) -> splits array x between train and test making sure that
# each label (e.g. each income category) is equally represented in both train and test sets.
# The probability distribution of the labels should be roughly the same in both sets.
for train_index, test_index in split.split(housing, housing["income_cat"]):
    print('Training samples: {}, testing samples: {}'
          .format(train_index.shape, test_index.shape))
    
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

## Verify stratified sampling

We've split our data set. Now let's check to see if each strata is represented in the test set with the same probability of the overall data set.

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)  # stratified sampled test set

In [None]:
housing["income_cat"].value_counts() / len(housing)    # overall data set (both train and test)

To see the effect of stratified sampling of our data set, let's compare our split to one without stratified sampling.

In [None]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

# regular, non-stratified sampling split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

# create a data frame to show the probability of each strata 
compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()

# compute percentage each strata is over/under represented in the test sets
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

In [None]:
compare_props

Note that the first income category is very small compared with the others.

*What do you expect would happen after a stratified split if we adjusted the cuttoff for category 1 to go from 0 to 2.0 instead of 0 to 1.5?*

The test set from the stratified sampling split is more representative of the overall data set than the non-stratified approach.

Now that we have split our data, we can remove the "income_cat" column we created to define the strata.

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# Discover and visualize the data to gain insights

Now that we have set aside the test set, we can spend more time exploring the training set. Our exploration helps us develop an effective approach for solving the task.

So we don't accidentally mess up our training set as we experiment and explore it, we will create a copy of it. 

In [None]:
housing = strat_train_set.copy()

## Visualizing geographic data

Since our districts have latitude and longitude we can visualize the locations of the districts with a scatter plot (x-axis is longitude, y-axis is latitude).

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")


Looks kind of like California. 

However, with the size of the data points its hard to see if there are data points really close together. 

If we set ``alpha``=0.1 (default is 1), then the dots are mostly transparent. This will show us if there are many districts really close together in some areas.

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)


Now we can see that there are dense clusters of districts in a few areas.

## Visualize housing prices

Let's add more information to our plot.
- Make the size of the dot proportional to district population (option ``s``).
- Set dot color (option ``c``) based on median house value. Blue = low value. Red = high value.

The argument `sharex=False` fixes a display bug (the x-axis values and legend were not displayed). This is a temporary fix (see: https://github.com/pandas-dev/pandas/issues/10611). Thanks to Wilmer Arellano for pointing it out.

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()


## Looking for correlations

Features are the input to machine learning systems. In theory, there is no such thing as a bad feature. Given enough training data, ML algorithms will learn to ignore unhelpful features. However, in practice, there is often too little data. So, it is nice if we can identify the most important features for our task. 

Identifying the most informative features can be difficult or impractical for some problems. One method is to compute correlation coefficients between each pair of attributes in our data set.

In [None]:
corr_matrix = housing.corr(numeric_only=True)


In [None]:
corr_matrix

For this problem, we are particularly interested in how much the attributes are correlated with median house value. 

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

Correlation coefficients range between -1 and 1.
- close to 1 implies a strong *positive* correlation between the attributes (i.e. house value goes *up* as the attribute goes up)
- close to -1 implies a strong *negative* correlation between the attributes (i.e. house value goes *down* as the attribute goes up)
- close to 0 implies little or no linear relationship between the attributes

Note: Correlation coefficient only measures *linear* relationships between attributes, not *nonlinear* relationships.

Which attribute has the strongest relationship with median house value?

Which attribute has little correlation with median house value?

## Plotting attributes

Another way to check for correlations between attribute pairs is to create scatter plots of attribute pairs. With 11 attributes in our data set, we would have $11^2 = 121$ plots. For the sake of this example, we will just consider a subset of 4 attributes. 


In [None]:
# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))


Each row has the same attribute along the y-axis. Each column has the same attribute along the x-axis.

The diagonal plots show histograms for the row/column attribute.

Once again, from these plots it seems that median income has the strongest correlation with median house value.


In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])


As we may have noticed when first looking at the attribute histograms, we see that median house value appears to top out at \\$500K. It is unlikely that this was the maximum value of a Californian house, even in 1990. More likely, when data was collected, there were boxes for house value ranges and the last box was \$500K+. 

Is this a problem? Potentially. The system may not make accurate predictions for houses worth more than \$500K. To deal with this we have a couple options:
1. Go back and collect accurate median house values for this districts. This may not be possible since the data set was collected in 1990.
2. Remove districts with median house value of \$500K from training and test sets. 


# Experimenting with attribute combinations

Another thing to look at before developing an ML system is combining attributes to create new, more informative features.

Consider the "total rooms" attribute. This is the total number of rooms in a district. When thinking about the median value for a house, the total number of rooms in a district does not seem like it would be that informative. Our correlation coefficients and scatter plots support this intuition. 

However, if we consider the *average number of rooms per house* in a district, that  seems like it would have more correlation with median house price in a district. We can create a new attribute, "rooms per household", by combining our "total room" and "households" attributes.

In [None]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
corr_matrix = housing.corr(numeric_only=True)
corr_matrix["median_house_value"].sort_values(ascending=False)

Hmmm...while "rooms per household" has a stronger correlation with house price than "total rooms", it's not much stronger.

*Why isn't it stronger?*


In [None]:
corr_matrix["total_rooms"].sort_values(ascending=False)

*What do you see?*

*What is the relationship between "total rooms" and "households", the two attributes used to compute "rooms per household"?*

It is a common frustration to have a "brilliant idea" for a new feature and find out it does not improve performance of an ML system. Typically, we find in this case that the new feature does not add information that was not already available with the existing features. 

However, sometimes our feature ideas do provide additional information. Consider "bedrooms per room". This attribute has a stronger correlation than "total bedrooms". Hence, our feature engineering efforts can pay off at times.

# Questions

#### 1.	How many instances does the data set have? How many attributes originally?



#### 2.	What is the importance of using stratified sampling?


#### 3.	In the example presented, which attribute was used in the stratified sampling process and why?


#### 4.	How many strata were used in the example? What was the range for each one?


#### 5.	Explain in a few words how the stratified sampling works.


#### 6.	Why is using correlations useful for gaining insights into data?


#### 7.	In the example presented, which attribute showed the highest correlation with the median house value?


#### 8.	What is attribute combinations? How can this help?