![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [2]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)
   
fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model 

In [3]:
# First, I am going to import LinearRegression model
from sklearn.linear_model import LinearRegression

#Craeting LinearRegression model
linear = LinearRegression()

#Fitting the data to the model
linear.fit(housing_prepared,housing_labels)

# First try it out on a few instances from the training set:


In [4]:
#Let's see housing_prepared
housing_prepared
#As we can see, it is an array

array([[ 1.27258656, -1.3728112 ,  0.34849025, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.70916212, -0.87669601,  1.61811813, ...,  0.        ,
         0.        ,  1.        ],
       [-0.44760309, -0.46014647, -1.95271028, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 0.59946887, -0.75500738,  0.58654547, ...,  0.        ,
         0.        ,  0.        ],
       [-1.18553953,  0.90651045, -1.07984112, ...,  0.        ,
         0.        ,  0.        ],
       [-1.41489815,  0.99543676,  1.85617335, ...,  0.        ,
         1.        ,  0.        ]])

In [5]:
#A sample of the data
some_data = housing.iloc[:5]

#A sampel of the labels
some_labels = housing_labels.iloc[:5]

some_data_pre = full_pipeline.transform(some_data)

In [6]:
some_data_predict = linear.predict(some_data_pre)
print('The predicted value: ', some_data_predict)
print('The actual label: ', list(some_labels))

The predicted value:  [181746.54359616 290558.74973505 244957.50017771 146498.51061398
 163230.42393939]
The actual label:  [103000.0, 382100.0, 172600.0, 93400.0, 96500.0]


# measure this regression model’s RMSE on the whole training set 
* sing Scikit-Learn’s mean_squared_error() function:

In [7]:
from sklearn.metrics import mean_squared_error

In [8]:
# Let's first get the predicted values of all the data
housing_prediction = linear.predict(housing_prepared)

#Getting the measure of rmse of the whole data
linear_rmse = mean_squared_error(housing_labels, housing_prediction, squared=False)

#Let's see the RMSE
linear_rmse

67593.20745775253

# judge on the RMSE result for this model 
write down your answar 

**The result is not good, the rmse or the error is about 67593.20745775253, which is a huge number. So, it will predict a wronge value and the predicted values are not even close from the actural value.
Some of the reasons to this problem may it be that the model in very simple to this data, so this is underfitting situation.
So, how we can solve it to get better result? we can make the model a bit complex or use a different model as the linear model is not good to this data.
We can consider another model to use to solve underfitting problem.**

your answer goes here

# Let’s train a Decision Tree Regressor model 
## more powerful model

In [9]:
#Importing DecisionTreeRegressor model
from sklearn.tree import DecisionTreeRegressor 

In [10]:
# Let's build the DecisionTreeRegressor model
DTG_model = DecisionTreeRegressor()

#Fitting the data with the training data
DTG_model.fit(housing_prepared,housing_labels)

# Now evaluate the model on the training set 
* using Scikit-Learn’s mean_squared_error() function:

In [11]:
# Let's first get the predicted values of all the data
housing_prediction = DTG_model.predict(housing_prepared)

#Getting the measure of rmse of the whole data
DTG_rmse = mean_squared_error(housing_labels, housing_prediction, squared=False)

#Let's see the predicted value
DTG_rmse

0.0

# Explaine this result 
write down your answar

**As we can see, we got 0.0 as a result, in the first seen, we can say that it is a great result, so the model is going to predict 100% without any error. But, this is not true.
In this case, the model is completly complex which made this error, so the model is overfitted.
So, if we test it to the testing data, we are not going to get a good result.
So, it only perform well on the training data as it save it and does not understand it, and will perform bad on the testing data.**

your answer goes here

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [12]:
from sklearn.model_selection import cross_val_score

In [13]:
# Using cross_val_score on the DTG model
scores = cross_val_score(DTG_model, housing_prepared,housing_labels, scoring='neg_mean_squared_error', cv=10) #First, we give it the model, traning data, the labels of the data, the score, finally the cv
DTG_rmse_scores = np.sqrt(-scores)

In [14]:
def scoring(scores):
    print("Scores: ", scores)
    print("Mean: ", scores.mean())
    print("Standard Deviation: ", scores.std())

2- display the resultant scores and calculate its Mean and Standard deviation

In [15]:
scoring(DTG_rmse_scores)

Scores:  [64002.9650099  71592.45959843 68821.26977903 68974.2088521
 72863.25673388 67485.62272921 66614.24388118 67568.66729509
 66917.84674238 70840.23048664]
Mean:  68568.0771107848
Standard Deviation:  2500.7736488475452


3-repaet the same steps to compute the same scores for the Linear Regression  model 

*notice the difference between the results of the two models*

In [16]:
# Using cross_val_score on the DTG model
scores = cross_val_score(linear, housing_prepared,housing_labels, scoring='neg_mean_squared_error', cv=10) #First, we give it the model, traning data, the labels of the data, the score, finally the cv
linear_rmse_scores = np.sqrt(-scores)

In [17]:
scoring(linear_rmse_scores)

Scores:  [65000.67382615 70960.56056304 67122.63935124 66089.63153865
 68402.54686442 65266.34735288 65218.78174481 68525.46981754
 72739.87555996 68957.34111906]
Mean:  67828.38677377408
Standard Deviation:  2468.0913950652257


## Let’s train one last model the RandomForestRegressor.

In [18]:
#Importing RandomForestRegressor model
from sklearn.ensemble import RandomForestRegressor

#Creating RandomForestRegressor model
rf_model = RandomForestRegressor()


#Fitting RandomForestRegressor model
rf_model.fit(housing_prepared,housing_labels)

In [19]:
# Let's first get the predicted values of all the data
rf_housing_prediction = rf_model.predict(housing_prepared)

#Getting the measure of rmse of the whole data
rf_rmse = mean_squared_error(housing_labels, rf_housing_prediction, squared=False)

#Let's see the predicted value
rf_rmse

18340.091210262413

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [20]:
# Using cross_val_score on the rf model
rf_scores = cross_val_score(rf_model, housing_prepared,housing_labels, scoring='neg_mean_squared_error', cv=10) #First, we give it the model, traning data, the labels of the data, the score, finally the cv
rf_rmse_scores = np.sqrt(-rf_scores)

In [21]:
scoring(rf_rmse_scores)

Scores:  [47478.21282627 51920.08486426 49755.50555895 52183.81831557
 52510.78701487 46929.76935639 47391.52152942 50787.77331249
 49364.65720742 49788.62472512]
Mean:  49811.075471076736
Standard Deviation:  1952.4729437212661


# Save every model you experiment with 
*using the joblib library*

In [22]:
# Importing the joblib library
import joblib

#Saving the LinearRegression model
joblib.dump(linear, 'linear_model.sav')

#Saving the Decision Tree Regressor model
joblib.dump(DTG_model, 'tree_model.sav')

#Saving the RandomForestRegressor
joblib.dump(rf_model, 'rf_model.sav')

['rf_model.sav']

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor 
*It may take a long time*

In [23]:
from sklearn.model_selection import GridSearchCV

In [24]:
#Creating the params
param_grid = [{'n_estimators': [30,40,50, 60, 70, 80], 'max_features':[2,4,6,8, 10]},
              {'bootstrap':[False], 'max_features':[2,3,4],'n_estimators':[3,10]}]


#Creating the model with random state of 101
forest_reg = RandomForestRegressor(random_state = 101)


#Creating the GridSearch
grid_search = GridSearchCV(forest_reg, param_grid,
                           cv = 5, 
                           scoring = 'neg_mean_squared_error',
                           return_train_score = True)


#Fitting the grid search with the training data
grid_search.fit(housing_prepared, housing_labels)

with the evaluation scores

In [25]:
#Let's see the best params from our grid search 
grid_search.best_params_

{'max_features': 6, 'n_estimators': 80}

In [26]:
#Let's see the best estimator from the grid model
grid_search.best_estimator_

In [27]:
#Let's see the cv of the grid search
cvres = grid_search.cv_results_

#Creating a for loop to loop on them
for mean_score, params in zip(cvres["mean_test_score"],cvres["params"]):
  print(np.sqrt(-mean_score), params)

52647.71818320534 {'max_features': 2, 'n_estimators': 30}
52199.067803908736 {'max_features': 2, 'n_estimators': 40}
52041.081807395756 {'max_features': 2, 'n_estimators': 50}
51835.76597582866 {'max_features': 2, 'n_estimators': 60}
51621.458142041374 {'max_features': 2, 'n_estimators': 70}
51589.31830462255 {'max_features': 2, 'n_estimators': 80}
50557.01216622696 {'max_features': 4, 'n_estimators': 30}
50112.930374175354 {'max_features': 4, 'n_estimators': 40}
49720.21470018021 {'max_features': 4, 'n_estimators': 50}
49563.380218093196 {'max_features': 4, 'n_estimators': 60}
49553.859730226875 {'max_features': 4, 'n_estimators': 70}
49489.12002049985 {'max_features': 4, 'n_estimators': 80}
49584.07098872235 {'max_features': 6, 'n_estimators': 30}
49245.281630288235 {'max_features': 6, 'n_estimators': 40}
49153.86720515648 {'max_features': 6, 'n_estimators': 50}
49048.007787820694 {'max_features': 6, 'n_estimators': 60}
49053.661562079724 {'max_features': 6, 'n_estimators': 70}
48977

# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [28]:
# Getting the feature importance
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

array([7.78450910e-02, 7.02308742e-02, 4.00780096e-02, 1.71141014e-02,
       1.60168544e-02, 1.73678730e-02, 1.56681037e-02, 3.32241929e-01,
       5.31805213e-02, 1.08498357e-01, 7.95187478e-02, 1.09370432e-02,
       1.53632371e-01, 1.83905065e-04, 2.86264757e-03, 4.62356976e-03])

2-display these importance scores next to their corresponding attribute names:

In [29]:
#Let's see them 
extra_attribs = ["population_per_household","bedroom_per_room","rooms_per_household"]
num_attribs = list(housing_num.columns)
attributes = num_attribs + extra_attribs + cat_attribs
sorted(zip(grid_search.best_estimator_.feature_importances_,attributes),reverse = True)

[(0.3322419292907549, 'median_income'),
 (0.1084983570904707, 'bedroom_per_room'),
 (0.07951874780678209, 'rooms_per_household'),
 (0.07784509100770341, 'longitude'),
 (0.07023087418645206, 'latitude'),
 (0.05318052134640802, 'population_per_household'),
 (0.04007800963701211, 'housing_median_age'),
 (0.017367873033535135, 'population'),
 (0.017114101448099175, 'total_rooms'),
 (0.016016854446985945, 'total_bedrooms'),
 (0.01566810374897234, 'households'),
 (0.010937043239896, 'ocean_proximity')]

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [30]:
# Getting the data of the test set

X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

2-run your full_pipeline to transform the data

In [31]:
# Using the full pipeline to transform the testing data, as we should not fit the testing data only transform it.
y_test_prepared = full_pipeline.transform(X_test)

3-evaluate the final model on the test set

In [32]:
# Creating the final model from the best estimator of the grid search
final_model = grid_search.best_estimator_

#Predicting on the testing data
final_predictions = final_model.predict(y_test_prepared)

#Getting the rmse score of it
final_rmse = np.sqrt(mean_squared_error(y_test, final_predictions))

#Print the result
final_rmse

48524.03240054631

# compute a 95% confidence interval for the generalization error 
*using scipy.stats.t.interval():*

In [33]:
from scipy import stats

In [34]:
squared_errors = (final_predictions - y_test) ** 2
stats.t.interval(.95, len(squared_errors) - 1, loc=squared_errors.mean(), scale=stats.sem(squared_errors))

(2146865455.023457, 2562297985.7950788)

# Great Job!
# #shAI_Club