<td>
  <a href="https://colab.research.google.com/github/omarcevi/End2End_ML_Project/blob/master/end_to_end_machine_learning_project_blank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</td>

# **End to End Data Science / Machine Learning Project**

Credit to Aurélien Géron for the original notebook.

This notebook is a template for a machine learning project. It is intended to be a starting point for your own projects. It is not intended to be a complete solution to your problem. You will need to modify it to suit your needs.

## Docs and resources

* [Scikit-Learn](http://scikit-learn.org/stable/)
* [Pandas](https://pandas.pydata.org/)
* [Matplotlib](https://matplotlib.org/)
* [Seaborn](https://seaborn.pydata.org/)
* [NumPy](http://www.numpy.org/)
* [SciPy](https://www.scipy.org/)

In [None]:
print("Welcome to Machine Learning!")

This project requires Python 3.7 or above:

In [None]:
import sys

assert sys.version_info >= (3, 7)

It also requires Scikit-Learn ≥ 1.0.1:

In [None]:
from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

# Get the Data

*Welcome to AI Corp.! Your task is to predict median house values in Californian districts, given a number of features from these districts.*

## Download the Data

In [None]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/omarcevi/End2End_ML_Project/raw/master/datasets/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing = load_housing_data()

## Take a Quick Look at the Data Structure

In [None]:
# Check the head of the data
# .head() returns the first 5 rows of the data
## YOUR CODE HERE

## END CODE

In [None]:
# Check the info of the data
# .info() is a method that returns a concise summary of the dataframe
## YOUR CODE HERE

## END CODE

In [None]:
# Chech the value counts of the ocean_proximity column
# you can select a column by using the column name as a key
# .value_counts() is a method that returns object containing counts of unique values
## YOUR CODE HERE

## END CODE

In [None]:
# Check the description of the data
# .describe() is a method that computes a summary of statistics pertaining to the DataFrame columns
## YOUR CODE HERE

## END CODE

The following cell creates the `images/end_to_end_project` folder (if it doesn't already exist), and it defines the `save_fig()` function which is used through this notebook to save the figures in high-res for the book.

In [None]:
# extra code – code to save the figures as high-res PNGs for the book

IMAGES_PATH = Path() / "images" / "end_to_end_project"
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [None]:
import matplotlib.pyplot as plt

# extra code – the next 5 lines define the default font sizes
plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

# Plot the histogram of the data
# .hist() is a method that plots a histogram
## YOUR CODE HERE

## END CODE
save_fig("attribute_histogram_plots")  # extra code
plt.show()

## Create a Test Set

To ensure that this notebook's outputs remain the same every time we run it, we need to set the random seed:

In [None]:
import numpy as np

np.random.seed(42)

Note: another source of randomness is the order of Python sets: it is based on Python's `hash()` function, which is randomly "salted" when Python starts up (this started in Python 3.3, to prevent some denial-of-service attacks). To remove this randomness, the solution is to set the `PYTHONHASHSEED` environment variable to `"0"` _before_ Python even starts up. Nothing will happen if you do it after that. Luckily, if you're running this notebook on Colab, the variable is already set for you.

In [None]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
# .train_test_split() is a method that splits arrays or matrices into random train and test subsets
## YOUR CODE HERE
train_set, test_set = 
## END CODE

In [None]:
# Cut the median income into 5 categories
# .cut() is a function that bins values into discrete intervals
## YOUR CODE HERE
housing["income_cat"] = 
## END CODE

In [None]:
# Plot the income category
# .value_counts() is a method that returns object containing counts of unique values
# .sort_index() is a method that sorts the object by labels (along an axis)
# .plot.bar() is a method that makes a bar plot
## YOUR CODE HERE

## END CODE
plt.xlabel("Income category")
plt.ylabel("Number of districts")
save_fig("housing_income_cat_bar_plot")  # extra code
plt.show()

It's much shorter to get a single stratified split:

In [None]:
# Split the data into train and test sets using stratified sampling
# use stratify=housing["income_cat"] to make a stratified split based on the income category
## YOUR CODE HERE
strat_train_set, strat_test_set = 
## END CODE

In [None]:
# extra code – computes the data for Figure 2–10

def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall %": income_cat_proportions(housing),
    "Stratified %": income_cat_proportions(strat_test_set),
    "Random %": income_cat_proportions(test_set),
}).sort_index()
compare_props.index.name = "Income Category"
compare_props["Strat. Error %"] = (compare_props["Stratified %"] /
                                   compare_props["Overall %"] - 1)
compare_props["Rand. Error %"] = (compare_props["Random %"] /
                                  compare_props["Overall %"] - 1)
(compare_props * 100).round(2)

In [None]:
# drop the income_cat column
# .drop() is a method that drops specified labels from rows or columns
## YOUR CODE HERE
for set_ in (strat_train_set, strat_test_set):
    
## END CODE

# Discover and Visualize the Data to Gain Insights

In [None]:
# Make a copy of the training set
# .copy() is a method that makes a copy of the object
## YOUR CODE HERE
housing = 
## END CODE

## Visualizing Geographical Data

In [None]:
# Plot the longitude and latitude of the data
# .plot() is a method that plots the data
# use kind="scatter" to make a scatter plot
## YOUR CODE HERE

## END CODE
save_fig("bad_visualization_plot")  # extra code
plt.show()

In [None]:
# Plot the longitude and latitude of the data with alpha=0.1
# use alpha=0.1 to make it easier to visualize the places where there is a high density of data points
## YOUR CODE HERE

## END CODE
save_fig("better_visualization_plot")  # extra code
plt.show()

In [None]:
# Let me help you show a better way to visualize the geographical data
housing.plot(kind="scatter", x="longitude", y="latitude", grid=True,
             s=housing["population"] / 100, label="population",
             c="median_house_value", cmap="jet", colorbar=True,
             legend=True, sharex=False, figsize=(10, 7))
save_fig("housing_prices_scatterplot")  # extra code
plt.show()

## Looking for Correlations

In [None]:
# Compute the standard correlation coefficient (Pearson's r)
# .corr() is a method that computes pairwise correlation of columns, excluding NA/null values
## YOUR CODE HERE
corr_matrix = 
## END CODE

In [None]:
# Check the correlation between the median house value
# .sort_values() is a method that sorts the object by the values along an axis
## YOUR CODE HERE

## END CODE

In [None]:
from pandas.plotting import scatter_matrix

# Plot the scatter matrix of the data
# .scatter_matrix() is a method that plots a matrix of scatter plots
## YOUR CODE HERE
attributes =


## END CODE
save_fig("scatter_matrix_plot")  # extra code
plt.show()

In [None]:
# Plot the scatter plot of the median income and the median house value
# use alpha=0.1 to make it easier to visualize the places where there is a high density of data points
## YOUR CODE HERE


## END CODE
save_fig("income_vs_house_value_scatterplot")  # extra code
plt.show()

## Experimenting with Attribute Combinations

In [None]:
# Create new features
## YOUR CODE HERE
housing["rooms_per_house"] = 
housing["bedrooms_ratio"] = 
housing["people_per_house"] = 
## END CODE

In [None]:
# Compute the standard correlation coefficient (Pearson's r) for new features
## YOUR CODE HERE


## END CODE

# Prepare the Data for Machine Learning Algorithms

Let's revert to the original training set and separate the target (note that `strat_train_set.drop()` creates a copy of `strat_train_set` without the column, it doesn't actually modify `strat_train_set` itself, unless you pass `inplace=True`):

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

## Data Cleaning

Generally there are 3 options to handle the NaN values:

```python
housing.dropna(subset=["total_bedrooms"], inplace=True)    # option 1

housing.drop("total_bedrooms", axis=1)       # option 2

median = housing["total_bedrooms"].median()  # option 3
housing["total_bedrooms"].fillna(median, inplace=True)
```

or you can use the `SimpleImputer` class from Scikit-Learn:

In [None]:
from sklearn.impute import SimpleImputer

# Create an imputer
# SimpleImputer is a class that imputes missing values
# use strategy="median" to replace missing values using the median along each column
## YOUR CODE HERE
imputer = 
## END CODE

Separating out the numerical attributes to use the `"median"` strategy (as it cannot be calculated on text attributes like `ocean_proximity`):

In [None]:
# Seperate the numerical attributes
# .select_dtypes() is a method that returns a subset of the DataFrame’s columns based on the column dtypes
## YOUR CODE HERE
housing_num = 
## END CODE

In [None]:
# Fit the imputer to housing_num
# .fit() is a method that fits the imputer to the data
## YOUR CODE HERE

## END CODE

In [None]:
# Check the statistics of the imputer
# .statistics_ is an attribute that returns the statistics of the imputer
## YOUR CODE HERE

## END CODE

Check that this is the same as manually computing the median of each attribute:

In [None]:
# Check the statistics of the median of the numerical attributes
# .median() is a method that returns the median of the values for the requested axis
# .values is an attribute that returns a Numpy representation of the given DataFrame
## YOUR CODE HERE

## END CODE


Transform the training set:

In [None]:
# Transform the training set with the imputer
# .transform() is a method that transforms the data
## YOUR CODE HERE
X =
## END CODE

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

## Handling Text and Categorical Attributes

Now let's preprocess the categorical input feature, `ocean_proximity`:

In [None]:
# Check the text attributes Ocean Proximity
# .head(n) is a method that returns the first n rows
## YOUR CODE HERE
housing_cat = 

## END CODE

In [None]:
from sklearn.preprocessing import OneHotEncoder
# Create a OneHotEncoder to encode the text attributes
# OneHotEncoder is a class that encodes categorical features as a one-hot numeric array
## YOUR CODE HERE
cat_encoder = 
housing_cat_1hot = 
## END CODE

In [None]:
#check the OneHotEncoder
housing_cat_1hot

In [None]:
# Check the categories of the OneHotEncoder
# .categories_ is an attribute that returns the categories of the OneHotEncoder
## YOUR CODE HERE

## END CODE

In [None]:
# Check the input feature names of the OneHotEncoder
# use .feature_names_ to get the input feature names
## YOUR CODE HERE

## END CODE

In [None]:
# Chek the output feature names of the OneHotEncoder 
# use .get_feature_names_out() to get the output feature names
## YOUR CODE HERE

## END CODE

## Feature Scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
# Create a MinMaxScaler to scale the numerical attributes
# MinMaxScaler is a class that transforms features by scaling each feature to a given range
## YOUR CODE HERE
min_max_scaler = 
housing_num_min_max_scaled = 
## END CODE

In [None]:
from sklearn.preprocessing import StandardScaler
# Create a StandardScaler to scale the numerical attributes
# StandardScaler is a class that standardizes features by removing the mean and scaling to unit variance
## YOUR CODE HERE
std_scaler = 
housing_num_std_scaled = 
## END CODE

In [None]:
# extra code – this cell generates Figure 2–17
fig, axs = plt.subplots(1, 2, figsize=(8, 3), sharey=True)
housing["population"].hist(ax=axs[0], bins=50)
housing["population"].apply(np.log).hist(ax=axs[1], bins=50)
axs[0].set_xlabel("Population")
axs[1].set_xlabel("Log of population")
axs[0].set_ylabel("Number of districts")
save_fig("long_tail_plot")
plt.show()

## Custom Transformers

To create simple transformers:

In [None]:
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_pop = log_transformer.transform(housing[["population"]])

In [None]:
ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
ratio_transformer.transform(np.array([[1., 2.], [3., 4.]]))

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import rbf_kernel

class ClusterSimilarity(BaseEstimator, TransformerMixin):
    def __init__(self, n_clusters=10, gamma=1.0, random_state=None):
        self.n_clusters = n_clusters
        self.gamma = gamma
        self.random_state = random_state

    def fit(self, X, y=None, sample_weight=None):
        self.kmeans_ = KMeans(self.n_clusters, random_state=self.random_state)
        self.kmeans_.fit(X, sample_weight=sample_weight)
        return self  # always return self!

    def transform(self, X):
        return rbf_kernel(X, self.kmeans_.cluster_centers_, gamma=self.gamma)
    
    def get_feature_names_out(self, names=None):
        return [f"Cluster {i} similarity" for i in range(self.n_clusters)]

In [None]:
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
similarities = cluster_simil.fit_transform(housing[["latitude", "longitude"]],
                                           sample_weight=housing_labels)

In [None]:
# extra code – this cell generates Figure 2–19

housing_renamed = housing.rename(columns={
    "latitude": "Latitude", "longitude": "Longitude",
    "population": "Population",
    "median_house_value": "Median house value (ᴜsᴅ)"})
housing_renamed["Max cluster similarity"] = similarities.max(axis=1)

housing_renamed.plot(kind="scatter", x="Longitude", y="Latitude", grid=True,
                     s=housing_renamed["Population"] / 100, label="Population",
                     c="Max cluster similarity",
                     cmap="jet", colorbar=True,
                     legend=True, sharex=False, figsize=(10, 7))
plt.plot(cluster_simil.kmeans_.cluster_centers_[:, 1],
         cluster_simil.kmeans_.cluster_centers_[:, 0],
         linestyle="", color="black", marker="X", markersize=20,
         label="Cluster centers")
plt.legend(loc="upper right")
save_fig("district_cluster_plot")
plt.show()

## Transformation Pipelines

Now let's build a pipeline to preprocess the numerical attributes:

In [None]:
from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

In [None]:
from sklearn import set_config

set_config(display='diagram')

num_pipeline

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = ["longitude", "latitude", "housing_median_age", "total_rooms",
               "total_bedrooms", "population", "households", "median_income"]
cat_attribs = ["ocean_proximity"]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"))

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])

In [None]:
from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include=object)),
)

In [None]:
# You can run this and ignore
def monkey_patch_get_signature_names_out():
    """Monkey patch some classes which did not handle get_feature_names_out()
       correctly in Scikit-Learn 1.0.*."""
    from inspect import Signature, signature, Parameter
    import pandas as pd
    from sklearn.impute import SimpleImputer
    from sklearn.pipeline import make_pipeline, Pipeline
    from sklearn.preprocessing import FunctionTransformer, StandardScaler

    default_get_feature_names_out = StandardScaler.get_feature_names_out

    if not hasattr(SimpleImputer, "get_feature_names_out"):
      print("Monkey-patching SimpleImputer.get_feature_names_out()")
      SimpleImputer.get_feature_names_out = default_get_feature_names_out

    if not hasattr(FunctionTransformer, "get_feature_names_out"):
        print("Monkey-patching FunctionTransformer.get_feature_names_out()")
        orig_init = FunctionTransformer.__init__
        orig_sig = signature(orig_init)

        def __init__(*args, feature_names_out=None, **kwargs):
            orig_sig.bind(*args, **kwargs)
            orig_init(*args, **kwargs)
            args[0].feature_names_out = feature_names_out

        __init__.__signature__ = Signature(
            list(signature(orig_init).parameters.values()) + [
                Parameter("feature_names_out", Parameter.KEYWORD_ONLY)])

        def get_feature_names_out(self, names=None):
            if callable(self.feature_names_out):
                return self.feature_names_out(self, names)
            assert self.feature_names_out == "one-to-one"
            return default_get_feature_names_out(self, names)

        FunctionTransformer.__init__ = __init__
        FunctionTransformer.get_feature_names_out = get_feature_names_out

monkey_patch_get_signature_names_out()

In [None]:
from sklearn.pipeline import make_pipeline
def column_ratio(X):
    return X[:, [0]] / X[:, [1]]

def ratio_name(function_transformer, feature_names_in):
    return ["ratio"]  # feature names out

def ratio_pipeline():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        FunctionTransformer(column_ratio, feature_names_out=ratio_name),
        StandardScaler())

log_pipeline = make_pipeline(
    SimpleImputer(strategy="median"),
    FunctionTransformer(np.log, feature_names_out="one-to-one"),
    StandardScaler())
cluster_simil = ClusterSimilarity(n_clusters=10, gamma=1., random_state=42)
default_num_pipeline = make_pipeline(SimpleImputer(strategy="median"),
                                     StandardScaler())
preprocessing = ColumnTransformer([
        ("bedrooms", ratio_pipeline(), ["total_bedrooms", "total_rooms"]),
        ("rooms_per_house", ratio_pipeline(), ["total_rooms", "households"]),
        ("people_per_house", ratio_pipeline(), ["population", "households"]),
        ("log", log_pipeline, ["total_bedrooms", "total_rooms", "population",
                               "households", "median_income"]),
        ("geo", cluster_simil, ["latitude", "longitude"]),
        ("cat", cat_pipeline, make_column_selector(dtype_include=object)),
    ],
    remainder=default_num_pipeline)  # one column remaining: housing_median_age

In [None]:
housing_prepared = preprocessing.fit_transform(housing)
housing_prepared.shape

In [None]:
preprocessing.get_feature_names_out()

# Select and Train a Model

## Training and Evaluating on the Training Set

In [None]:
from sklearn.linear_model import LinearRegression
# create a pipeline with the preprocessing and the linear regression
# use the pipeline as if it was a regular regressor
## you can use fit(), predict(), score(), etc.
## YOUR CODE HERE
lin_reg = 


## END YOUR CODE

Let's try the full preprocessing pipeline on a few training instances:

In [None]:
# predict the first 5 values
# use the round() function to round the values to the nearest hundred
# use .predict() to predict the values
## YOUR CODE HERE
housing_predictions = 


## END YOUR CODE

Compare against the actual values:

In [None]:
# Compare the predictions to the actual values
## YOUR CODE HERE

## END YOUR CODE

In [None]:
from sklearn.metrics import mean_squared_error
# compute the root mean squared error
# use the mean_squared_error() function
# set the squared parameter to False
## YOUR CODE HERE

## END YOUR CODE

In [None]:
from sklearn.tree import DecisionTreeRegressor
# create a pipeline with the preprocessing and the decision tree regressor

## YOUR CODE HERE
tree_reg = 


## END YOUR CODE

In [None]:
#make predictions and compute the RMSE
##YOUR CODE HERE
housing_predictions = 


## END YOUR CODE

## Better Evaluation Using Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

tree_rmses = -cross_val_score(tree_reg, housing, housing_labels,
                              scoring="neg_root_mean_squared_error", cv=10)

In [None]:
pd.Series(tree_rmses).describe()

In [None]:
# extra code – computes the error stats for the linear model
lin_rmses = -cross_val_score(lin_reg, housing, housing_labels,
                              scoring="neg_root_mean_squared_error", cv=10)
pd.Series(lin_rmses).describe()

**Warning:** the following cell may take a few minutes to run:

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = make_pipeline(preprocessing,
                           RandomForestRegressor(random_state=42))
forest_rmses = -cross_val_score(forest_reg, housing, housing_labels,
                                scoring="neg_root_mean_squared_error", cv=10)

In [None]:
pd.Series(forest_rmses).describe()

Let's compare this RMSE measured using cross-validation (the "validation error") with the RMSE measured on the training set (the "training error"):

In [None]:
forest_reg.fit(housing, housing_labels)
housing_predictions = forest_reg.predict(housing)
forest_rmse = mean_squared_error(housing_labels, housing_predictions,
                                 squared=False)
forest_rmse

The training error is much lower than the validation error, which usually means that the model has overfit the training set. Another possible explanation may be that there's a mismatch between the training data and the validation data, but it's not the case here, since both came from the same dataset that we shuffled and split in two parts.

# Fine-Tune Your Model

## Grid Search

**Warning:** the following cell may take a few minutes to run:

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

full_pipeline = Pipeline([
    ("preprocessing", preprocessing),
    ("random_forest", RandomForestRegressor(random_state=42)),
])
param_grid = [
    {'preprocessing__geo__n_clusters': [5, 8, 10],
     'random_forest__max_features': [4, 6, 8]},
    {'preprocessing__geo__n_clusters': [10, 15],
     'random_forest__max_features': [6, 8, 10]},
]
grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,
                           scoring='neg_root_mean_squared_error')
grid_search.fit(housing, housing_labels)

You can get the full list of hyperparameters available for tuning by looking at `full_pipeline.get_params().keys()`:

In [None]:
# extra code – shows part of the output of get_params().keys()
print(str(full_pipeline.get_params().keys())[:1000] + "...")

The best hyperparameter combination found:

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

Let's look at the score of each hyperparameter combination tested during the grid search:

In [None]:
cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)

# extra code – these few lines of code just make the DataFrame look nicer
cv_res = cv_res[["param_preprocessing__geo__n_clusters",
                 "param_random_forest__max_features", "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]
score_cols = ["split0", "split1", "split2", "mean_test_rmse"]
cv_res.columns = ["n_clusters", "max_features"] + score_cols
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)

cv_res.head()

## Randomized Search

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

**Warning:** the following cell may take a few minutes to run:

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {'preprocessing__geo__n_clusters': randint(low=3, high=50),
                  'random_forest__max_features': randint(low=2, high=20)}

rnd_search = RandomizedSearchCV(
    full_pipeline, param_distributions=param_distribs, n_iter=10, cv=3,
    scoring='neg_root_mean_squared_error', random_state=42)

rnd_search.fit(housing, housing_labels)

In [None]:
# extra code – displays the random search results
cv_res = pd.DataFrame(rnd_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_res = cv_res[["param_preprocessing__geo__n_clusters",
                 "param_random_forest__max_features", "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]
cv_res.columns = ["n_clusters", "max_features"] + score_cols
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)
cv_res.head()

## Analyze the Best Models and Their Errors

In [None]:
final_model = rnd_search.best_estimator_  # includes preprocessing
feature_importances = final_model["random_forest"].feature_importances_
feature_importances.round(2)

In [None]:
sorted(zip(feature_importances,
           final_model["preprocessing"].get_feature_names_out()),
           reverse=True)

## Evaluate Your System on the Test Set

In [None]:
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

final_predictions = final_model.predict(X_test)

final_rmse = mean_squared_error(y_test, final_predictions, squared=False)
print(final_rmse)

All good! That's all for today! 😀

Congratulations! You already know quite a lot about Machine Learning. :)