![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [2]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)

fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model

In [3]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression().fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [4]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

In [5]:
some_data_prepared = full_pipeline.transform(some_data) 

print('Prediction:', lin_reg.predict(some_data_prepared))
print()
print('label:', list(some_labels))

Prediction: [181746.54359616 290558.74973505 244957.50017771 146498.51061398
 163230.42393939]

label: [103000.0, 382100.0, 172600.0, 93400.0, 96500.0]


# measure this regression model’s RMSE on the whole training set
* sing Scikit-Learn’s mean_squared_error() function:

In [6]:
from sklearn.metrics import mean_squared_error

In [7]:
housing_prediction = lin_reg.predict(housing_prepared)

lin_mse  = mean_squared_error(housing_prediction, housing_labels)
lin_rmse = np.sqrt(lin_mse)

lin_rmse

67593.20745775253

# judge on the RMSE result for this model
write down your answar

The error is exceedingly large at **67593**, far surpassing the range of house sales between **120,000** and **256,000**. This discrepancy is primarily attributed to overfitting, indicating that the current model is too simplistic. Hence, it's advisable to opt for a more complex model to better capture the underlying patterns in the data.

# Let’s train a Decision Tree Regressor model
## more powerful model

In [8]:
from sklearn.tree import DecisionTreeRegressor

In [9]:
tree_reg = DecisionTreeRegressor().fit(housing_prepared, housing_labels)

# Now evaluate the model on the training set
* using Scikit-Learn’s mean_squared_error() function:

In [10]:
housing_prediction = tree_reg.predict(housing_prepared)

tree_mse  = mean_squared_error(housing_prediction, housing_labels)
tree_rmse = np.sqrt(tree_mse)

tree_rmse

0.0

# Explaine this result
write down your answar

The error is 0, yet the model isn't perfect. This discrepancy suggests that the model is overfit because we evaluate it on the training data. The model appears to be excessively complex, resulting in it memorizing the training data rather than learning from it.

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [11]:
from sklearn.model_selection import cross_val_score

In [12]:
scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring = "neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

2- display the resultant scores and calculate its Mean and Standard deviation

In [13]:
def display_scores(scores):
    print("Scores:", scores)
    print()
    print("Mean:", scores.mean())
    print()
    print("Standard deviation:", scores.std())
    
display_scores(tree_rmse_scores)    

Scores: [65078.25891532 70696.02750088 69317.03471236 71590.76310531
 73540.25698882 67980.28180135 67155.89636265 68593.47233815
 67226.35976615 70788.91383135]

Mean: 69196.72653223416

Standard deviation: 2371.663291454293


3-repaet the same steps to compute the same scores for the Linear Regression  model

*notice the difference between the results of the two models*

In [14]:
scores= cross_val_score(lin_reg, housing_prepared, housing_labels, scoring = "neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)

display_scores(lin_rmse_scores)

Scores: [65000.67382615 70960.56056304 67122.63935124 66089.63153865
 68402.54686442 65266.34735288 65218.78174481 68525.46981754
 72739.87555996 68957.34111906]

Mean: 67828.38677377408

Standard deviation: 2468.0913950652284


## Let’s train one last model the RandomForestRegressor.

In [15]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor().fit(housing_prepared, housing_labels)

housing_prediction = forest_reg.predict(housing_prepared)

forest_mse  = mean_squared_error(housing_prediction, housing_labels)
forest_rmse = np.sqrt(forest_mse)

print(forest_rmse)

scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring = "neg_mean_squared_error", cv = 10)

18433.4121948378


# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [16]:
fore_rmse_scores = np.sqrt(-scores)

display_scores(fore_rmse_scores)

Scores: [47211.27972361 51336.73936059 49640.52536047 51775.79330693
 52407.94507835 47420.22663188 47481.30874502 50588.73395716
 49234.67655459 49980.91663789]

Mean: 49707.81453564808

Standard deviation: 1781.08677273244


# Save every model you experiment with
*using the joblib library*

In [17]:
import joblib

joblib.dump(lin_reg, "linear_regression.pkl")
joblib.dump(tree_reg, "decision_tree.pkl")
joblib.dump(forest_reg, "random_forest.pkl")

['random_forest.pkl']

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor
*It may take a long time*

In [18]:
from sklearn.model_selection import GridSearchCV

In [19]:
param_grid = [
 {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
 {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
 ]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = "neg_mean_squared_error", return_train_score = True)

grid_search.fit(housing_prepared, housing_labels)

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')

# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [20]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

array([6.71945705e-02, 6.11893135e-02, 4.32442666e-02, 1.52750568e-02,
       1.37113604e-02, 1.48345991e-02, 1.33369213e-02, 3.73376289e-01,
       3.94745130e-02, 1.14241628e-01, 7.03246705e-02, 8.84838709e-03,
       1.55804835e-01, 2.13966461e-04, 4.72486837e-03, 4.20475352e-03])

2-display these importance scores next to their corresponding attribute names:

In [21]:
extra_attribs       = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder         = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attribues           = num_attribs + extra_attribs + cat_one_hot_attribs

sorted(zip(feature_importances, attribues), reverse = True)

[(0.37337628897801783, 'median_income'),
 (0.15580483544091195, 'INLAND'),
 (0.11424162836515847, 'pop_per_hhold'),
 (0.0703246705384116, 'bedrooms_per_room'),
 (0.06719457054146899, 'longitude'),
 (0.061189313480718675, 'latitude'),
 (0.04324426659610013, 'housing_median_age'),
 (0.03947451302988654, 'rooms_per_hhold'),
 (0.015275056808788932, 'total_rooms'),
 (0.014834599086150537, 'population'),
 (0.013711360393585038, 'total_bedrooms'),
 (0.013336921289840504, 'households'),
 (0.008848387093206666, '<1H OCEAN'),
 (0.004724868373348916, 'NEAR BAY'),
 (0.004204753523340697, 'NEAR OCEAN'),
 (0.00021396646106457387, 'ISLAND')]

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [22]:
final_model = grid_search.best_estimator_

X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

2-run your full_pipeline to transform the data

In [23]:
X_test_prepared = full_pipeline.transform(X_test)

3-evaluate the final model on the test set

In [24]:
final_prediction = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_prediction)
final_rmse = np.sqrt(final_mse)

final_rmse

49924.83143250667

# compute a 95% confidence interval for the generalization error
*using scipy.stats.t.interval():*

In [25]:
from scipy import stats

In [26]:
squared_errors = (final_prediction - y_test) ** 2

np.sqrt(stats.t.interval(.95, len(squared_errors) - 1,
        loc=squared_errors.mean(),
        scale=stats.sem(squared_errors)))

array([47747.38671616, 52011.19734159])

# Great Job!
# #shAI_Club