# Assignment 1, Meenu Ravi, April 9 2021

* The notebook is taken from the Geron's textbook.
* Data: California census data
* Objective: predict housing prices
  * **Supervised learning** problem - labeled data
  * **Regression problem**
    * **Multiple regression** problem - multiple features
    * **Univariate regression** problem - we are trying to predict a single label (if we were trying to predict multiple labels - **multivariate regression** problem)
* Performance measure:
    * **Root Mean Square Error** (RMSE):  RMSE(X,h) $ = \sqrt{\frac{1}{m}\sum^{m}_{i=1} (h(x^{(i)}) - y^{(i)})^2} $
    * **Mean absolute error** (MAE): MAE(X,h) $ = \frac{1}{m} \sum_{i=1}^m |h(x^{(i)}) - y^{(i)}| $
    * RMSE is more sensitive to outliers than the MAE. But when
outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs
very well and is generally preferred.

## Setup

In [197]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Get the data

* Download data to disk

In [198]:
import os
import tarfile
import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [199]:
fetch_housing_data()

In [200]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [201]:
housing = load_housing_data()
np.random.seed(42)

## Split data into train and test set

In [202]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

# Prepare the data for Machine Learning algorithms

In [203]:
from sklearn.model_selection import StratifiedShuffleSplit
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [204]:
housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

### Data cleaning
#### Dealing with missing values

In [205]:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

#### Option 1: drop records with missing values

In [206]:
sample_incomplete_rows.dropna(subset=["total_bedrooms"])

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,income_cat


#### Option 2: drop variable that has incomplete values

In [207]:
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,population,households,median_income,ocean_proximity,income_cat
4629,-118.3,34.07,18.0,3759.0,3296.0,1462.0,2.2708,<1H OCEAN,2
6068,-117.86,34.01,16.0,4632.0,3038.0,727.0,5.1762,<1H OCEAN,4
17923,-121.97,37.35,30.0,1955.0,999.0,386.0,4.6328,<1H OCEAN,4
13656,-117.3,34.05,6.0,2155.0,1039.0,391.0,1.6675,INLAND,2
19252,-122.79,38.48,7.0,6837.0,3468.0,1405.0,3.1662,<1H OCEAN,3


#### Option 3: fill missing values with, for example, median or mean or some other constant

In [208]:
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3

In [209]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

Remove the text attribute because median can only be calculated on numerical attributes:

In [210]:
housing_num = housing.drop("ocean_proximity", axis=1)
# alternatively: housing_num = housing.select_dtypes(include=[np.number])

Transform the training set:

In [211]:
imputer.fit(housing_num)
X = imputer.transform(housing_num)

In [212]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

In [213]:
housing_tr.loc[sample_incomplete_rows.index.values]

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,income_cat
4629,-118.3,34.07,18.0,3759.0,433.0,3296.0,1462.0,2.2708,2.0
6068,-117.86,34.01,16.0,4632.0,433.0,3038.0,727.0,5.1762,4.0
17923,-121.97,37.35,30.0,1955.0,433.0,999.0,386.0,4.6328,4.0
13656,-117.3,34.05,6.0,2155.0,433.0,1039.0,391.0,1.6675,2.0
19252,-122.79,38.48,7.0,6837.0,433.0,3468.0,1405.0,3.1662,3.0


In [214]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing_num.index)

### Encode categorical input features


In [215]:
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

Unnamed: 0,ocean_proximity
17606,<1H OCEAN
18632,<1H OCEAN
14650,NEAR OCEAN
3230,INLAND
3555,<1H OCEAN
19480,INLAND
8879,<1H OCEAN
13685,INLAND
4937,<1H OCEAN
4861,<1H OCEAN


#### OrdinalEncoder

In [216]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

array([[0.],
       [0.],
       [4.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]])

#### OneHotEncoder

In [217]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

By default, the `OneHotEncoder` class returns a sparse array, but we can convert it to a dense array if needed by calling the `toarray()` method:

In [218]:
housing_cat_1hot.toarray()

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

Alternatively, you can set `sparse=False` when creating the `OneHotEncoder`:

In [219]:
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

In [220]:
from sklearn.base import BaseEstimator, TransformerMixin

col_names = "total_rooms", "total_bedrooms", "population", "households"
rooms_ix, bedrooms_ix, population_ix, households_ix = [
    housing.columns.get_loc(c) for c in col_names]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

type(housing_extra_attribs)
housing_extra_attribs.shape

(16512, 12)

In [221]:
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_attribs.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity,income_cat,rooms_per_household,population_per_household
17606,-121.89,37.29,38.0,1568.0,351.0,710.0,339.0,2.7042,<1H OCEAN,2,4.625369,2.094395
18632,-121.93,37.05,14.0,679.0,108.0,306.0,113.0,6.4214,<1H OCEAN,5,6.00885,2.707965
14650,-117.2,32.77,31.0,1952.0,471.0,936.0,462.0,2.8621,NEAR OCEAN,2,4.225108,2.025974
3230,-119.61,36.31,25.0,1847.0,371.0,1460.0,353.0,1.8839,INLAND,2,5.232295,4.135977
3555,-118.59,34.23,17.0,6592.0,1525.0,4459.0,1463.0,3.0347,<1H OCEAN,3,4.50581,3.047847


In [222]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

In [223]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

## Extending notebook 3 for homework assignment 1

### Question 2: Extend l1_n3.ipynb to try a Support Vector Machine regressor (sklearn.svm.SVR) with various hyperparameters

In [224]:
from sklearn.svm import SVR

##### Grid Search CV

In [225]:
from sklearn.model_selection import GridSearchCV
param_grid = [{'C':[0.1,1,10],'kernel':['linear']},
            { 'C':[0.1,1,10],
              'kernel':['rbf'],
              'gamma':[1,0.1,0.01]
              }]
grid = GridSearchCV(SVR(),param_grid,cv=5,
                           return_train_score=True,scoring='neg_mean_squared_error', n_jobs=-1,verbose=True)

In [226]:
grid.fit(housing_prepared, housing_labels)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


GridSearchCV(cv=5, estimator=SVR(), n_jobs=-1,
             param_grid=[{'C': [0.1, 1, 10], 'kernel': ['linear']},
                         {'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01],
                          'kernel': ['rbf']}],
             return_train_score=True, scoring='neg_mean_squared_error',
             verbose=True)

In [227]:
grid.best_params_

{'C': 10, 'kernel': 'linear'}

In [228]:
grid.best_estimator_

SVR(C=10, kernel='linear')

In [229]:
cvres = grid.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

117800.31017688812 {'C': 0.1, 'kernel': 'linear'}
108968.2694281854 {'C': 1, 'kernel': 'linear'}
81649.4933206491 {'C': 10, 'kernel': 'linear'}
118919.9953920614 {'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}
118885.04966623483 {'C': 0.1, 'gamma': 0.1, 'kernel': 'rbf'}
118905.45885493883 {'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}
118894.65726808914 {'C': 1, 'gamma': 1, 'kernel': 'rbf'}
118556.82411885096 {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
118743.74833054295 {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}
118658.95230678072 {'C': 10, 'gamma': 1, 'kernel': 'rbf'}
115419.67531992932 {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
117186.1192387479 {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}


In [230]:
pd.DataFrame(grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,param_gamma,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,19.681751,0.072417,2.862765,0.026643,0.1,linear,,"{'C': 0.1, 'kernel': 'linear'}",-13151120000.0,-13876410000.0,...,-13876910000.0,379113000.0,5,-14003980000.0,-13838350000.0,-13819170000.0,-13840670000.0,-13866330000.0,-13873700000.0,66844800.0
1,19.522096,0.072867,2.865562,0.034548,1.0,linear,,"{'C': 1, 'kernel': 'linear'}",-11250180000.0,-11829560000.0,...,-11874080000.0,332274700.0,2,-11981560000.0,-11849240000.0,-11822500000.0,-11853970000.0,-11843560000.0,-11870170000.0,56726470.0
2,18.324581,0.3431,2.847372,0.023801,10.0,linear,,"{'C': 10, 'kernel': 'linear'}",-6295450000.0,-6584026000.0,...,-6666640000.0,249619100.0,1,-6693891000.0,-6652226000.0,-6607054000.0,-6759641000.0,-6578965000.0,-6658355000.0,64021380.0
3,22.392199,0.571932,13.453108,0.255117,0.1,rbf,1.0,"{'C': 0.1, 'gamma': 1, 'kernel': 'rbf'}",-13399280000.0,-14156320000.0,...,-14141970000.0,385901500.0,12,-14266160000.0,-14110470000.0,-14078650000.0,-14101800000.0,-14137070000.0,-14138830000.0,66353110.0
4,22.997838,0.522585,13.632257,0.070388,0.1,rbf,0.1,"{'C': 0.1, 'gamma': 0.1, 'kernel': 'rbf'}",-13391490000.0,-14147830000.0,...,-14133660000.0,385782300.0,9,-14258120000.0,-14101940000.0,-14072470000.0,-14091540000.0,-14128620000.0,-14130540000.0,66319650.0
5,23.545629,0.398374,13.731434,0.058042,0.1,rbf,0.01,"{'C': 0.1, 'gamma': 0.01, 'kernel': 'rbf'}",-13395910000.0,-14152720000.0,...,-14138510000.0,385930400.0,11,-14262610000.0,-14107010000.0,-14076430000.0,-14097270000.0,-14133590000.0,-14135380000.0,66217010.0
6,21.809266,0.405403,13.654177,0.123813,1.0,rbf,1.0,"{'C': 1, 'gamma': 1, 'kernel': 'rbf'}",-13393780000.0,-14150690000.0,...,-14135940000.0,385754800.0,10,-14260420000.0,-14104520000.0,-14074540000.0,-14094380000.0,-14129650000.0,-14132700000.0,66280970.0
7,22.796033,0.3357,13.681777,0.06824,1.0,rbf,0.1,"{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}",-13320380000.0,-14060590000.0,...,-14055720000.0,383474500.0,6,-14185250000.0,-14013550000.0,-13995600000.0,-14016310000.0,-14051690000.0,-14052480000.0,68829270.0
8,23.771397,0.322662,13.757269,0.124707,1.0,rbf,0.01,"{'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}",-13359670000.0,-14112610000.0,...,-14100080000.0,384977700.0,8,-14224360000.0,-14067690000.0,-14039730000.0,-14056250000.0,-14096510000.0,-14096910000.0,66365020.0
9,21.734253,0.385688,13.798311,0.232454,10.0,rbf,1.0,"{'C': 10, 'gamma': 1, 'kernel': 'rbf'}",-13343530000.0,-14086260000.0,...,-14079950000.0,383645300.0,7,-14208550000.0,-14036150000.0,-14018830000.0,-14035030000.0,-14077340000.0,-14075180000.0,69430840.0


##### Randomized Search CV using linear kernel

In [231]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = [{'C':randint(low=1,high=100),'kernel':['linear']}
            ]
rnd_grid = RandomizedSearchCV(SVR(),param_dist,cv=5,n_iter=10,
                           return_train_score=True,scoring='neg_mean_squared_error', random_state=42,
                              n_jobs=-1,verbose=True)

In [232]:
rnd_grid.fit(housing_prepared, housing_labels)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5, estimator=SVR(), n_jobs=-1,
                   param_distributions=[{'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001CB5C96ABB0>,
                                         'kernel': ['linear']}],
                   random_state=42, return_train_score=True,
                   scoring='neg_mean_squared_error', verbose=True)

In [233]:
rnd_grid.best_params_

{'C': 93, 'kernel': 'linear'}

In [234]:
rnd_grid.best_estimator_

SVR(C=93, kernel='linear')

In [235]:
cvres = rnd_grid.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

72476.96714901099 {'C': 52, 'kernel': 'linear'}
71317.03930784999 {'C': 93, 'kernel': 'linear'}
78058.84138448106 {'C': 15, 'kernel': 'linear'}
71772.3154302096 {'C': 72, 'kernel': 'linear'}
72083.57989550327 {'C': 61, 'kernel': 'linear'}
75947.04494688548 {'C': 21, 'kernel': 'linear'}
71498.69440494395 {'C': 83, 'kernel': 'linear'}
71427.84045379868 {'C': 87, 'kernel': 'linear'}
71675.4378881904 {'C': 75, 'kernel': 'linear'}
71675.4378881904 {'C': 75, 'kernel': 'linear'}


In [236]:
pd.DataFrame(rnd_grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,20.262881,0.427441,2.850428,0.012795,52,linear,"{'C': 52, 'kernel': 'linear'}",-4931273000.0,-5392689000.0,-5445445000.0,...,-5252911000.0,243653200.0,8,-5300389000.0,-5185431000.0,-5203196000.0,-5332177000.0,-5182532000.0,-5240745000.0,62889480.0
1,22.900542,3.84937,3.18575,0.378959,93,linear,"{'C': 93, 'kernel': 'linear'}",-4751247000.0,-5320633000.0,-5259426000.0,...,-5086120000.0,256873100.0,1,-5134579000.0,-5007461000.0,-5040499000.0,-5155815000.0,-5023405000.0,-5072352000.0,60761160.0
2,25.380828,1.725362,3.672992,0.058954,15,linear,"{'C': 15, 'kernel': 'linear'}",-5742320000.0,-6048678000.0,-6319090000.0,...,-6093183000.0,241601100.0,10,-6105943000.0,-6069745000.0,-6040579000.0,-6203865000.0,-5994750000.0,-6082976000.0,70566220.0
3,25.002366,0.699607,3.218014,0.279223,72,linear,"{'C': 72, 'kernel': 'linear'}",-4818772000.0,-5345711000.0,-5337864000.0,...,-5151265000.0,250696500.0,6,-5197145000.0,-5081746000.0,-5108338000.0,-5228036000.0,-5082914000.0,-5139636000.0,61106450.0
4,24.621291,0.348647,3.602699,0.041311,61,linear,"{'C': 61, 'kernel': 'linear'}",-4865712000.0,-5362860000.0,-5383967000.0,...,-5196042000.0,247126100.0,7,-5239659000.0,-5129275000.0,-5148124000.0,-5277929000.0,-5128137000.0,-5184625000.0,62162950.0
5,24.262433,0.68047,3.642908,0.054657,21,linear,"{'C': 21, 'kernel': 'linear'}",-5445082000.0,-5760244000.0,-5987874000.0,...,-5767954000.0,238221800.0,9,-5804793000.0,-5727476000.0,-5716962000.0,-5862392000.0,-5683184000.0,-5758961000.0,65262670.0
6,24.152777,0.706027,3.600311,0.054333,83,linear,"{'C': 83, 'kernel': 'linear'}",-4779085000.0,-5335911000.0,-5286450000.0,...,-5112063000.0,253363400.0,3,-5160218000.0,-5035712000.0,-5063811000.0,-5187268000.0,-5044533000.0,-5098309000.0,62843890.0
7,23.619982,0.269852,3.55376,0.048452,87,linear,"{'C': 87, 'kernel': 'linear'}",-4769120000.0,-5331188000.0,-5276657000.0,...,-5101936000.0,254795700.0,2,-5151509000.0,-5024619000.0,-5055781000.0,-5174220000.0,-5034748000.0,-5088176000.0,62222460.0
8,24.204878,0.369766,3.6497,0.065131,75,linear,"{'C': 75, 'kernel': 'linear'}",-4801455000.0,-5344774000.0,-5318795000.0,...,-5137368000.0,252617000.0,4,-5180683000.0,-5067866000.0,-5092407000.0,-5214476000.0,-5070259000.0,-5125138000.0,60711800.0
9,20.97825,3.533354,3.145064,0.797861,75,linear,"{'C': 75, 'kernel': 'linear'}",-4801455000.0,-5344774000.0,-5318795000.0,...,-5137368000.0,252617000.0,4,-5180683000.0,-5067866000.0,-5092407000.0,-5214476000.0,-5070259000.0,-5125138000.0,60711800.0


#### Randomized Search CV using rbf kernel

In [238]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_dist = [
            { 'C':randint(low=1,high=100),
              'kernel':['rbf'],
              'gamma':randint(low=1,high=10)
              }]
rnd_grid2 = RandomizedSearchCV(SVR(),param_dist,cv=5,n_iter=10,
                           return_train_score=True,scoring='neg_mean_squared_error', random_state=42,
                              n_jobs=-1,verbose=True)

In [239]:
rnd_grid2.fit(housing_prepared, housing_labels)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5, estimator=SVR(), n_jobs=-1,
                   param_distributions=[{'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001CB4BFA83D0>,
                                         'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001CB4BFA8CD0>,
                                         'kernel': ['rbf']}],
                   random_state=42, return_train_score=True,
                   scoring='neg_mean_squared_error', verbose=True)

In [240]:
cvres = rnd_grid2.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

118902.28337886001 {'C': 52, 'gamma': 8, 'kernel': 'rbf'}
118839.99060403393 {'C': 61, 'gamma': 5, 'kernel': 'rbf'}
118870.0554388319 {'C': 83, 'gamma': 7, 'kernel': 'rbf'}
118889.77017584811 {'C': 75, 'gamma': 8, 'kernel': 'rbf'}
118846.26919997799 {'C': 24, 'gamma': 3, 'kernel': 'rbf'}
118901.30913338567 {'C': 22, 'gamma': 5, 'kernel': 'rbf'}
118921.57470456738 {'C': 2, 'gamma': 8, 'kernel': 'rbf'}
118902.05838172584 {'C': 30, 'gamma': 6, 'kernel': 'rbf'}
118919.08578826314 {'C': 2, 'gamma': 5, 'kernel': 'rbf'}
118901.22836289476 {'C': 33, 'gamma': 6, 'kernel': 'rbf'}


In [241]:
rnd_grid2.best_params_

{'C': 61, 'gamma': 5, 'kernel': 'rbf'}

In [242]:
rnd_grid2.best_estimator_

SVR(C=61, gamma=5)

In [243]:
pd.DataFrame(rnd_grid2.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_gamma,param_kernel,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,33.882431,0.497326,16.055563,1.019672,52,8,rbf,"{'C': 52, 'gamma': 8, 'kernel': 'rbf'}",-13394560000.0,-14147630000.0,...,-14137750000.0,386714900.0,8,-14252100000.0,-14092010000.0,-14069450000.0,-14086020000.0,-14127000000.0,-14125310000.0,66106750.0
1,34.116124,1.354618,15.264254,1.408074,61,5,rbf,"{'C': 61, 'gamma': 5, 'kernel': 'rbf'}",-13380770000.0,-14129350000.0,...,-14122940000.0,386504200.0,1,-14235800000.0,-14071080000.0,-14053440000.0,-14071710000.0,-14112420000.0,-14108890000.0,66338590.0
2,35.702732,0.169887,17.159998,0.146655,83,7,rbf,"{'C': 83, 'gamma': 7, 'kernel': 'rbf'}",-13387220000.0,-14137430000.0,...,-14130090000.0,386846400.0,3,-14238770000.0,-14075700000.0,-14057350000.0,-14074320000.0,-14114650000.0,-14112160000.0,66033610.0
3,35.220819,0.1,17.17505,0.158861,75,8,rbf,"{'C': 75, 'gamma': 8, 'kernel': 'rbf'}",-13392260000.0,-14144140000.0,...,-14134780000.0,386377800.0,4,-14245660000.0,-14084250000.0,-14061920000.0,-14079430000.0,-14120330000.0,-14118320000.0,66443340.0
4,30.844777,1.309853,16.649433,0.178891,24,3,rbf,"{'C': 24, 'gamma': 3, 'kernel': 'rbf'}",-13383780000.0,-14131160000.0,...,-14124440000.0,385735600.0,2,-14246320000.0,-14079760000.0,-14060590000.0,-14079490000.0,-14118870000.0,-14117010000.0,67385130.0
5,38.864548,1.244252,16.834296,1.194202,22,5,rbf,"{'C': 22, 'gamma': 5, 'kernel': 'rbf'}",-13394470000.0,-14148250000.0,...,-14137520000.0,386567000.0,6,-14257380000.0,-14098040000.0,-14073810000.0,-14092100000.0,-14130770000.0,-14130420000.0,66090920.0
6,34.30856,4.869699,16.196616,2.126852,2,8,rbf,"{'C': 2, 'gamma': 8, 'kernel': 'rbf'}",-13399640000.0,-14156570000.0,...,-14142340000.0,385919400.0,10,-14266180000.0,-14110370000.0,-14078720000.0,-14101870000.0,-14137180000.0,-14138870000.0,66349220.0
7,31.310101,0.199826,14.532691,0.366582,30,6,rbf,"{'C': 30, 'gamma': 6, 'kernel': 'rbf'}",-13394520000.0,-14148020000.0,...,-14137700000.0,386676600.0,7,-14255960000.0,-14096350000.0,-14073040000.0,-14090300000.0,-14130140000.0,-14129160000.0,66051540.0
8,33.149252,0.956997,16.538941,1.377607,2,5,rbf,"{'C': 2, 'gamma': 5, 'kernel': 'rbf'}",-13399120000.0,-14156010000.0,...,-14141750000.0,385882500.0,9,-14265610000.0,-14109780000.0,-14078170000.0,-14101270000.0,-14136550000.0,-14138280000.0,66350060.0
9,30.061609,3.740213,11.892802,3.50963,33,6,rbf,"{'C': 33, 'gamma': 6, 'kernel': 'rbf'}",-13394640000.0,-14146850000.0,...,-14137500000.0,386608700.0,5,-14255660000.0,-14094560000.0,-14072470000.0,-14089440000.0,-14129980000.0,-14128420000.0,66317070.0


In [244]:
from sklearn.metrics import mean_squared_error
final_model = grid.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

In [245]:
final_rmse

78069.60329439449

In [246]:
from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

array([75391.8615527 , 80658.49694054])

In [250]:
final_model = rnd_grid.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

69209.34884560181

#### How does the best SVR predictor perform?


The best model achieves the score of 78069 when evaluated using 5-fold cross validation. This is worse than the Random Forest Regressor done in the original notebook 3. The linear kernel seems better than the RBF kernel. The value of
C is the maximum of the tested C values. Randomized search tends to find better hyperparameters than grid search cv in
the same amount of time, especially for linear kernel. In the randomized search, using linear kernel, the error is smaller and closer to the Random forest regressor. In terms of time, the SVR performs slower than the random forest regressor.


## Question 3: Add a transformer in the preparation pipeline to select only the most important attributes.

#### This is the transformer below:

In [252]:
class TopFeatureSelector(BaseEstimator,TransformerMixin):
    def __init__(self,threshold):
        self.threshold=threshold
       
    def fit(self, X, y):
        corr=X.corrwith(y)
        self.important_features=corr.index[(np.abs(corr)>self.threshold)].tolist()
        return self

    def transform(self, X):
        return X[self.important_features]

####  Insert a transformer into the preparation pipeline

In [254]:
feature_selection_pipeline = Pipeline([
    ('feature_selection', TopFeatureSelector( 0.1))
])
feature_selection_pipeline

Pipeline(steps=[('feature_selection', TopFeatureSelector(threshold=0.1))])

In [256]:
top_features = feature_selection_pipeline.fit_transform(housing,housing_labels)

#### Showing the dataframe with the top features

In [258]:
top_features

Unnamed: 0,latitude,housing_median_age,total_rooms,median_income
17606,37.29,38.0,1568.0,2.7042
18632,37.05,14.0,679.0,6.4214
14650,32.77,31.0,1952.0,2.8621
3230,36.31,25.0,1847.0,1.8839
3555,34.23,17.0,6592.0,3.0347
...,...,...,...,...
6563,34.20,46.0,1271.0,4.9312
12053,33.88,40.0,1196.0,2.0682
13908,34.09,9.0,4855.0,3.2723
11159,33.82,31.0,1960.0,4.0625
