
###Hands on Machine Learning and Data Science Project

**Over the next few weeks, you will have the opportunity to work through a full machine learning project from end to end!**

This will rely on all the modules we've introduced so far, plus some more that we'll teach you on the way.  The next few weeks will be less about introducing new content and more about getting you comfortable with applying them to actual real world data.




# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures.

In [None]:
import sklearn

# Common imports
import numpy as np
import os
import pandas as pd

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
np.random.seed(42)



### Let's get a feel for the data we'll be working with



In [None]:
housing = pd.read_csv("https://raw.githubusercontent.com/AstraZeneca-Code-Club/intermediate_python/main/datasets/housing/housing.csv")
housing.head()

In [None]:
housing.info()

In [None]:
# remember we can get the breakdown of each value in a column like this
housing["ocean_proximity"].value_counts()

In [None]:
# ...and we can get some summary statistics like this...
housing.describe()

In [None]:
# Histograms are a great way ot get an initial feeling for the data
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
# to make this notebook's output identical at every run


In [None]:
### split the data into a training and a test set


from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

In [None]:
test_set.head()

In [None]:
housing["median_income"].hist()

In [None]:
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])

In [None]:
housing["income_cat"].value_counts()

In [None]:
housing["income_cat"].hist()

### A Gentle Introduction to SKlearn

Scikit learn is a very popular data science and machine learning library in Python.  It contains many useful functions to make preparing your data more straightforward

The first thing we need to think about is how we divide our data.  To train a machine learning model, we need some data to train it on, and a held-out fraction to test how well the model predicts on unseen examples
One of the best ways to do this (especially with small datasets) is using stratified splitting.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

# instantiate the SKlearn class
# it assigns the indices to either the test or train splits
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [None]:
# let's see how well the stratified sampling did for our test set labels vs the full set of labels
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

In [None]:
housing["income_cat"].value_counts() / len(housing)

In [None]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100

In [None]:
compare_props

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# Discover and visualize the data to gain insights

In [None]:
housing = strat_train_set.copy()

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)


In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
             s=housing["population"]/100, label="population", figsize=(10,7),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
             sharex=False)
plt.legend()

In [None]:
# One useful method that comes built into pandas is producing a correpation matrix.  This shows us straight away which variables are correlated with each other
corr_matrix = housing.corr()

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

In [None]:
# let's look a bit closer at the medium income scatter plot, since this looks like a promising predictor.  Note the various threshold caps present in the data
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])

In [None]:
# we won't spend too much time on this, but for illustration I've shown how you might start combining attributes to produce additional metrics

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

In [None]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

In [None]:
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()

In [None]:
housing.describe()

# Prepare the data for Machine Learning algorithms, using techniques we have used in the past few weeks

In [None]:
# Exercise set 1
# 1) create two variables: `housing` and `housing_labels`, one of which contains only the data we want to predict from, one of which only contains the targets
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

# 2) calculate the median of the 'total_bedrooms' column, and use this to fill the null values for this column.
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)


In [None]:
# SKlearn provides the concept of Pipelines to allow you to take your data from raw form to prepared.  We'll have a look at some of the most components in a pipeline

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

Remove the text attribute because median can only be calculated on numerical attributes:

In [None]:
# create a subset of the training data containing only the numerical attributes
housing_num = housing.drop("ocean_proximity", axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])

In [None]:
# use the inputer from SKlearn to calculate the median of every attribute
imputer.fit(housing_num)

In [None]:
# we can access the output using its statistics_ attribute
imputer.statistics_

Check that this is the same as manually computing the median of each attribute:

In [None]:
housing_num.median().values

Transform the training set:

In [None]:
# apply the transform method to apply the calculated median values to the training set
X = imputer.transform(housing_num)

In [None]:
# Let's make a separate dataframe out of the transformed data
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

In [None]:
imputer.strategy

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

In [None]:
housing_tr.head()

Now let's preprocess the categorical input feature, `ocean_proximity`:

In [None]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)


In [None]:
# one popular strategy is to one-hot encode catagorical variables, so each distinct catagory is identified by a bit vector
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()` method:

In [None]:
housing_cat_1hot.toarray()

Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:

In [None]:
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
cat_encoder.categories_

Now let's build a pipeline for preprocessing the numerical attributes:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

In [None]:
housing_num_tr

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

In [None]:
housing_prepared

In [None]:
housing_prepared.shape


### Now it's time for the fun bit - training some machine learning models on your dataset

 Linear Regression models are about as straight forward as you can get - they fit a linear function that, given the input data, tries to predict the output. Whilst it's fitting, it calculates a loss metric - how far the predictions are from the actual value - and makes an adjustment to its coefficients using a gradient descent algorithm

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
# like with the preprocessing functions, SKlearn models provide a fit and a transform method (for training and predicting respectively)
lin_reg.fit(housing_prepared, housing_labels)

In [None]:
# let's try the full preprocessing pipeline on a few training instances
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))

Compare against the actual values:

In [None]:
print("Labels:", list(some_labels))

In [None]:
some_data_prepared

In [None]:
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

**Note**: since Scikit-Learn 0.22, you can get the RMSE directly by calling the `mean_squared_error()` function with `squared=False`.

In [None]:
from sklearn.metrics import mean_absolute_error

lin_mae = mean_absolute_error(housing_labels, housing_predictions)
lin_mae

In [None]:
from sklearn.tree import DecisionTreeRegressor
# Exercise set 2
# 1) Instantiate a decision tree regressor and fit it to the training data, as we did above for the LinReg model
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

In [None]:
# 2) use your trained decision tree to make predicitons on the dataset
housing_predictions = tree_reg.predict(housing_prepared)
#  3) calculate the mean squared error for the decision tree model and compare it to the linear regression model
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

# Fine-tune your model

One quick way to get a better idea of how your model is performing is to use cross validation.  Instead of training on a single block of data and testing on the held out set, CV takes the training set and splits it into k folds.  It then trains the model k times, using each fold once as the test set and the other k-1 as training data.
This is a great way to avoid the model overfitting, and you can get an idea of the confidence in a certain performance metric, through the standard deviation calculated accross the k folds

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [None]:
# let's define a quick utility function to make comparing our model performance easiser
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

In [None]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)



In [None]:
# Exercise set 3 -
# 1) Train a random forest regressor (look up the sklearn docs if you want to change parameters from the defaults)
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

In [None]:
# 2) calcualate performance metrics the the random forest
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

In [None]:
# 3) use the cross validation approach to evaluate the random forest model perfmance on 10 folds of the training data
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

In [None]:
scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
pd.Series(np.sqrt(-scores)).describe()