![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [2]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)
   
fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model 

In [3]:
from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [4]:
some_data = housing_prepared[:5]
some_labels = housing_labels.iloc[:5]

In [5]:
y_pred= linreg.predict(some_data)
y_pred

array([181746.54359616, 290558.74973505, 244957.50017771, 146498.51061398,
       163230.42393939])

# measure this regression model’s RMSE on the whole training set 
* sing Scikit-Learn’s mean_squared_error() function:

In [6]:
from sklearn.metrics import mean_squared_error

In [7]:
pred=linreg.predict(housing_prepared)

In [8]:
lin_rmse = mean_squared_error(housing_labels,pred, squared=False)
lin_rmse

67593.20745775253

# judge on the RMSE result for this model 

67593.207

# Let’s train a Decision Tree Regressor model 
## more powerful model

In [9]:
from sklearn.tree import DecisionTreeRegressor 

In [10]:
dt = DecisionTreeRegressor()
dt.fit(housing_prepared, housing_labels)

In [11]:
dtpred=dt.predict(housing_prepared)

# Now evaluate the model on the training set 
* using Scikit-Learn’s mean_squared_error() function:

In [12]:
dt_rmse = mean_squared_error(housing_labels,dtpred, squared=False)
dt_rmse

0.0

# Explaine this result 
write down your answar

0.0

I believe that the model has overfitted the data, so it resulted in 0 error

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [13]:
from sklearn.model_selection import cross_val_score

In [14]:
tree_rmses = -cross_val_score(dt, housing_prepared, housing_labels,scoring="neg_root_mean_squared_error",cv=10)

2- display the resultant scores and calculate its Mean and Standard deviation

In [15]:
pd.Series(tree_rmses).describe()

count       10.000000
mean     69299.861076
std       2793.848180
min      65579.044848
25%      67002.131314
50%      69381.412220
75%      71046.933783
max      73568.374974
dtype: float64

3-repaet the same steps to compute the same scores for the Linear Regression  model 

*notice the difference between the results of the two models*

In [16]:
l_rmses = -cross_val_score(linreg, housing_prepared, housing_labels, scoring="neg_root_mean_squared_error",cv=10)
pd.Series(l_rmses).describe()

count       10.000000
mean     67828.386774
std       2601.596761
min      65000.673826
25%      65472.168399
50%      67762.593108
75%      68849.373294
max      72739.875560
dtype: float64

It is noticed that the Decision tree mean error is more than Linear regression error.

## Let’s train one last model the RandomForestRegressor.

In [17]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_rmses = -cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_root_mean_squared_error", cv=10)

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [18]:
pd.Series(forest_rmses).describe()

count       10.000000
mean     49748.594017
std       2065.691211
min      46743.263650
25%      47928.334305
50%      49864.644301
75%      51421.272216
max      52201.538248
dtype: float64

# Save every model you experiment with 
*using the joblib library*

In [19]:
import joblib

joblib.dump(linreg, 'regression_model.joblib')
joblib.dump(dt, 'dt_model.joblib')
joblib.dump(forest_reg, 'rf_model.joblib')

['rf_model.joblib']

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor 
*It may take a long time*

In [None]:
from sklearn.model_selection import GridSearchCV

In [1]:
param_grid ={'n_estimators': [100, 150, 200, 250, 300],'max_depth': [4,5,6,7],}

with the evaluation scores

In [2]:
grid_search = GridSearchCV(forest_reg, param_grid, cv=3,scoring='neg_root_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)

NameError: name 'GridSearchCV' is not defined

In [23]:
grid_search.best_params_

{'max_depth': 4, 'n_estimators': 200}

In [24]:
grid_search.best_score_

-67188.60246780199

# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [25]:
final_model = grid_search.best_estimator_
final_model

In [30]:
feature_importances =final_model.feature_importances_

feature_importances.round(2)

array([0.  , 0.  , 0.01, 0.  , 0.  , 0.  , 0.  , 0.68, 0.  , 0.1 , 0.  ,
       0.  , 0.21, 0.  , 0.  , 0.  ])

2-display these importance scores next to their corresponding attribute names:

In [59]:
cat_encoder = full_pipeline.named_transformers_['cat']
cat_feature_names = cat_encoder.get_feature_names_out(cat_attribs)
cat_feature_names

array(['ocean_proximity_<1H OCEAN', 'ocean_proximity_INLAND',
       'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY',
       'ocean_proximity_NEAR OCEAN'], dtype=object)

In [61]:
all_col=list(housing.columns)+list(cat_feature_names)

In [63]:
sorted(zip(feature_importances, all_col), reverse=True)

[(0.6788199764725953, 'median_income'),
 (0.20821394071744867, 'ocean_proximity_NEAR BAY'),
 (0.10162290689607484, 'ocean_proximity_<1H OCEAN'),
 (0.007807058493052813, 'housing_median_age'),
 (0.0010776993443373357, 'total_bedrooms'),
 (0.0008094180635703017, 'households'),
 (0.0006849001273619896, 'ocean_proximity'),
 (0.0003536260391544539, 'population'),
 (0.0003448189600092487, 'ocean_proximity_INLAND'),
 (0.00020749420171235257, 'total_rooms'),
 (3.8955531918760824e-05, 'longitude'),
 (1.920515276404049e-05, 'latitude'),
 (0.0, 'ocean_proximity_NEAR OCEAN'),
 (0.0, 'ocean_proximity_ISLAND')]

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [None]:
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
final_predictions = final_model.predict(X_test)
final_rmse = mean_squared_error(y_test, final_predictions,
squared=False)
print(final_rmse)

2-run your full_pipeline to transform the data

In [74]:
housing_tst.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
20046,-119.01,36.06,25.0,1505.0,,1392.0,359.0,1.6812,INLAND
3024,-119.46,35.14,30.0,2943.0,,1565.0,584.0,2.5313,INLAND
15663,-122.44,37.8,52.0,3830.0,,1310.0,963.0,3.4801,NEAR BAY
20484,-118.72,34.28,17.0,3051.0,,1705.0,495.0,5.7376,<1H OCEAN
9814,-121.93,36.62,34.0,2351.0,,1063.0,428.0,3.725,NEAR OCEAN


In [75]:
housing_tst = test_set.drop("median_house_value", axis=1)
housing_labels_tst = test_set["median_house_value"].copy()
test_prep=full_pipeline.fit_transform(housing_tst)
final_predictions = final_model.predict(test_prep)

In [76]:
final_predictions

array([ 82358.14025756, 104190.63278717, 292056.32887693, ...,
       478359.67486793, 104190.63278717, 199472.34633982])

3-evaluate the final model on the test set

In [77]:
fin_rmse = mean_squared_error(housing_labels_tst, final_predictions,squared=False)
fin_rmse

71413.30659409052

# compute a 95% confidence interval for the generalization error 
*using scipy.stats.t.interval():*

In [78]:
from scipy import stats

In [80]:
confidence = 0.95
squared_errors = (final_predictions - housing_labels_tst) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))

array([69101.07254641, 73652.986975  ])

# Great Job!
# #shAI_Club