# Exercise 

1. Evaluate the performance of a Support Vector Machine regressor (sklearn.svm.SVR) by experimenting with different hyperparameters such as kernel="linear" (varying the C hyperparameter) or kernel="rbf" (adjusting both the C and gamma hyperparameters).

2. Compare the model's performance when using RandomizedSearchCV versus GridSearchCV for hyperparameter optimization.

3. Investigate the impact on model performance and feature selection by incorporating a transformer into the preparation pipeline to select the most important features.

4. Assess whether combining comprehensive data preparation and the final prediction process into a unified pipeline yields better performance compared to separate pipelines.

5. Explore how GridSearchCV automatically explores various preparation options and their effects on model performance.


## <font color='red'> Exercise solutions

Question 1: Experiment with a Support Vector Machine regressor (sklearn.svm.SVR), exploring different hyperparameters such as kernel="linear" (using different values for the C hyperparameter) or kernel="rbf" (using different values for both the C and gamma hyperparameters).

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

# Define the grid of hyperparameters to search
param_grid = [
    {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000., 10000., 30000.0]},
    {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
     'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
]

# Create an SVR model
svm_reg = SVR()

# Perform grid search using cross-validation
grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2)

# Fit the grid search to the prepared housing data and corresponding labels
grid_search.fit(housing_prepared, housing_labels)


Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END ..............................C=10.0, kernel=linear; total time=   7.7s
[CV] END ..............................C=10.0, kernel=linear; total time=   8.7s
[CV] END ..............................C=10.0, kernel=linear; total time=   8.6s
[CV] END ..............................C=10.0, kernel=linear; total time=   7.6s
[CV] END ..............................C=10.0, kernel=linear; total time=   8.7s
[CV] END ..............................C=30.0, kernel=linear; total time=   8.7s
[CV] END ..............................C=30.0, kernel=linear; total time=   7.5s
[CV] END ..............................C=30.0, kernel=linear; total time=   8.7s
[CV] END ..............................C=30.0, kernel=linear; total time=   8.7s
[CV] END ..............................C=30.0, kernel=linear; total time=   7.5s
[CV] END .............................C=100.0, kernel=linear; total time=   8.7s
[CV] END .............................C=100.0, 

In [None]:
# Get the negative mean squared error from the best estimator found by grid search
negative_mse = grid_search.best_score_

# Convert the negative mean squared error back to root mean squared error
rmse = np.sqrt(-negative_mse)
rmse


70286.61835383571

In [None]:
# Retrieve the best hyperparameters found during grid search
grid_search.best_params_


{'C': 30000.0, 'kernel': 'linear'}

Question 2 : Try substituting GridSearchCV with RandomizedSearchCV.

In [None]:
 from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal

# Define the distributions for hyperparameters to search over
# 'kernel' is set to either 'linear' or 'rbf'
# 'C' is sampled from a reciprocal distribution with range [20, 200000]
# 'gamma' is sampled from an exponential distribution with scale parameter 1.0
param_distribs = {
    'kernel': ['linear', 'rbf'],
    'C': reciprocal(20, 200000),
    'gamma': expon(scale=1.0),
}

# Create an SVR model
svm_reg = SVR()

# Perform randomized search using the specified distributions
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
                                n_iter=50, cv=5, scoring='neg_mean_squared_error',
                                verbose=2, random_state=42)

# Fit the randomized search to the prepared housing data and corresponding labels
rnd_search.fit(housing_prepared, housing_labels)


Fitting 5 folds for each of 50 candidates, totalling 250 fits
[CV] END C=629.7823295913721, gamma=3.010121430917521, kernel=linear; total time=   7.6s
[CV] END C=629.7823295913721, gamma=3.010121430917521, kernel=linear; total time=   8.5s
[CV] END C=629.7823295913721, gamma=3.010121430917521, kernel=linear; total time=   8.6s
[CV] END C=629.7823295913721, gamma=3.010121430917521, kernel=linear; total time=   7.6s
[CV] END C=629.7823295913721, gamma=3.010121430917521, kernel=linear; total time=   8.6s
[CV] END C=26290.20646430022, gamma=0.9084469696321253, kernel=rbf; total time=  14.7s
[CV] END C=26290.20646430022, gamma=0.9084469696321253, kernel=rbf; total time=  15.9s
[CV] END C=26290.20646430022, gamma=0.9084469696321253, kernel=rbf; total time=  15.1s
[CV] END C=26290.20646430022, gamma=0.9084469696321253, kernel=rbf; total time=  15.2s
[CV] END C=26290.20646430022, gamma=0.9084469696321253, kernel=rbf; total time=  14.8s
[CV] END C=84.14107900575871, gamma=0.059838768608680676, 

In [None]:
# Get the negative mean squared error from the best estimator found by randomized search
negative_mse = rnd_search.best_score_

# Convert the negative mean squared error back to root mean squared error
rmse = np.sqrt(-negative_mse)
rmse


In [None]:
# Retrieve the best hyperparameters found during randomized search
rnd_search.best_params_


In [None]:
# Generate samples from an exponential distribution with scale parameter 1.0
expon_distrib = expon(scale=1.)
samples = expon_distrib.rvs(10000, random_state=42)

# Plot the original exponential distribution
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("Exponential distribution (scale=1.0)")
plt.hist(samples, bins=50)

# Plot the distribution of logarithms of the samples
plt.subplot(122)
plt.title("Log of this distribution")
plt.hist(np.log(samples), bins=50)

# Show the plots
plt.show()


In [None]:
# Generate samples from a reciprocal distribution with range [20, 200000]
reciprocal_distrib = reciprocal(20, 200000)
samples = reciprocal_distrib.rvs(10000, random_state=42)

# Plot the original reciprocal distribution
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("Reciprocal distribution (range=[20, 200000])")
plt.hist(samples, bins=50)

# Plot the distribution of logarithms of the samples
plt.subplot(122)
plt.title("Log of this distribution")
plt.hist(np.log(samples), bins=50)

# Show the plots
plt.show()


Question 3 :Experiment with integrating a transformer into the preparation pipeline to choose the most important feature.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

def indices_of_top_k(arr, k):
    """Function to get indices of top k elements in an array."""
    return np.sort(np.argpartition(np.array(arr), -k)[-k:])

class TopFeatureSelector(BaseEstimator, TransformerMixin):
    """Custom transformer to select top k features based on their importances."""

    def __init__(self, feature_importances, k):
        """Initialize the transformer with feature importances and the number of top features to select."""
        self.feature_importances = feature_importances
        self.k = k

    def fit(self, X, y=None):
        """Fit the transformer to the data."""
        # Get indices of top k features based on their importances
        self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
        return self

    def transform(self, X):
        """Transform the data by selecting only the top k features."""
        return X[:, self.feature_indices_]


In [None]:
# Define the number of top features to select
k = 5

# Get the indices of top k features based on their importances
top_k_feature_indices = indices_of_top_k(feature_importances, k)

# Display the top k feature indices and their corresponding attribute names
np.array(attributes)[top_k_feature_indices]


In [None]:

# Alternatively, display the top k feature importances and their corresponding attribute names
sorted(zip(feature_importances, attributes), reverse=True)[:k]

# Create a pipeline for preparation and feature selection
preparation_and_feature_selection_pipeline = Pipeline([
    ('preparation', full_pipeline),  # Include the full data preparation pipeline
    ('feature_selection', TopFeatureSelector(feature_importances, k))  # Include the top feature selector
])

# Apply the pipeline to select the top k features from the prepared housing data
housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)


In [None]:
# Selecting the top k features for the first three housing instances
# from the dataset containing only the selected features.
housing_prepared_top_k_features[0:3]

# Extracting the top k selected features for the first three housing instances
# from the original prepared housing data.
housing_prepared[0:3, top_k_feature_indices]


Question 4:  construct a unified pipeline that encompasses both comprehensive data preparation and the ultimate prediction process?

In [None]:
# Pipeline for preparing the data, selecting top features, and training a Support Vector Machine regressor
prepare_select_and_predict_pipeline = Pipeline([
    ('preparation', full_pipeline),  # Preparation step including data preprocessing
    ('feature_selection', TopFeatureSelector(feature_importances, k)),  # Feature selection
    ('svm_reg', SVR(**rnd_search.best_params_))  # Support Vector Machine regressor with best hyperparameters
])


In [None]:
# Fit the pipeline to the training data along with their corresponding labels
prepare_select_and_predict_pipeline.fit(housing, housing_labels)


In [None]:
# Obtain predictions for a subset of the data using the prepared and trained pipeline
some_data = housing.iloc[:4]  # Subset of the housing data
some_labels = housing_labels.iloc[:4]  # Subset of the housing labels
print("Predictions:\t", prepare_select_and_predict_pipeline.predict(some_data))  # Print predictions
print("Labels:\t\t", list(some_labels))  # Print actual labels


Question 5: Use `GridSearchCV` to automatically investigate various preparation options.

In [None]:
# Set the handling of unknown categories to ignore in the categorical transformer
full_pipeline.named_transformers_["cat"].handle_unknown = 'ignore'

# Define parameter grid for GridSearchCV
param_grid = [{
    'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],
    'feature_selection__k': list(range(1, len(feature_importances) + 1))
}]

# Perform GridSearchCV for preparation and prediction pipeline
grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,
                                scoring='neg_mean_squared_error', verbose=2)
grid_search_prep.fit(housing, housing_labels)


In [None]:
# Best parameters found by the GridSearchCV
grid_search_prep.best_params_
