# Regression Model

Informally, a model that generates a numerical prediction. (In contrast, a classification model generates a class prediction.) For example, the following are all regression models:



*   A model that predicts a certain house's value, such as 423,000 Euros.
*   A model that predicts a certain tree's life expectancy, such as 23.2 years.
*   A model that predicts the amount of rain that will fall in a certain city over the next six hours, such as 0.18 inches.

Two common types of regression models are:


*   Linear regression, which finds the line that best fits label values to features.
*   Logistic regression, which generates a probability between 0.0 and 1.0 that a system typically then maps to a class prediction.

Not every model that outputs numerical predictions is a regression model. In some cases, a numeric prediction is really just a classification model that happens to have numeric class names. For example, a model that predicts a numeric postal code is a classification model, not a regression model.
[source](https://developers.google.com/machine-learning/glossary#regression_model)



##Problem
You want to train a model that represents a linear relationship between the feature and target vector.

Dataset: **The California housing dataset**. This dataset can be fetched from internet using scikit-learn.

Target variable: The **median house value** for California districts,
expressed in hundreds of thousands of dollars ($100,000).





## Solution
Using regression algorithms some commonly used regression algorithms:

1.   Linear regression
2.   Lasso Regression (L1 Regularization)
3.   Ridge Regression (L2 Regularization)

First, let's start with loading the dataet, data visualization and data prepration.







In [8]:
#load california housing dataset from scikit
#TODO

#load dataset
#TODO


We can have a first look at the available description

In [9]:
# california housing dataset description using DESCR function
#TODO

### First, let's get familiarized with the data that we have:

In [10]:
#let's take a look at the data frame head
#TODO


In this dataset, we have information regarding the demography (income, population, house occupancy) in the districts, the location of the districts (latitude, longitude), and general information regarding the house in the districts (number of rooms, number of bedrooms, age of the house).

In [11]:
#show description of data
#TODO


Now, let’s have a look to the target to be predicted.

The target contains the median of the house value for each district. Therefore, this problem is a regression problem.

In [12]:
# Target variable
# TODO


We can see that:

the dataset contains 20,640 samples and 8 features

all features are numerical features encoded as floating number

there is no missing values:

In [13]:
# Missing data
#TODO


In [14]:
# data frame info
#TODO


Let’s have a quick look at the distribution of these features by plotting their histograms.

In [15]:
# distribution of features by plotting their histograms
# TODO



Now it is time for exploring more. First of all, we want to visualize the geographical data with latitude and longitude. A good way to do this is to create a scatterplot of all the districts. It is important that you set alpha equal to 0.2, because then the scatterplot has a high density and therefore it is much easier to visualize. Try differnet alpha values to see the changes.

In [16]:
# plot the housing value with respect to longitude and latitude
# TODO


Yes! Map of California! Please note that California's big cities: San Diego, Los Angeles, San Jose, or San Francisco, are located in the east coast!

The above color map shows the house value and the radius of the circles corresponding to the population of the areas.

Based on this plot, we can conclude that:
1. Houses near ocean value more, such as San Diego, Los Angeles, San Jose, and San Francisco.
2. House in high population density area also value more but the effect decreases as we move further away from the ocean.
3. And there are some outliers

## Searching for Correlations:
The housing dataset isn't that large and therefore we can easily compute the correlations between every attribute using the "corr()" method. We will start by looking how much each attribute is correlated to the median house value.

In [17]:
# Corrolations between attributes
# TODO


The coefficient of the correlation ranges from 1 to -1. The closer it is to 1 the more correlated it is and vice versa. Correlations that are close to 0, means that there is no correlation, neither negative or positive. You can see that the median_income is correlated the most with the median house value. Because of that, we will generate a more detailed scatterplot below:

In [18]:
# Correlation between MedInc and MedHouseVal
# TODO


## Random Sampling
We can perform random subsampling to reduce the number of data points for plotting, while still capturing the relevant characteristics.

**DataFrame.sample**(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False) [source](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)

Return a random sample of items from an axis of object. You can use random_state for reproducibility.

In [19]:
# TODO



We can make a final analysis by making a pair plot of all features and the target but dropping the longitude and latitude. We will quantize the target such that we can create proper histogram.

## Feature Scaling
Feature scaling is one of the most important transformations you need to apply, since nearly all machine learning algorithms perform bad when the input numerical attributes have widely varying scales, which is the case at our current dataset. For example, the median incomes range from o to 15, but the total number of rooms from 6 to 39,320. Note that scaling the target values is not required.

There are two common ways:


*   min-max scaling
*   standardization

We use standardization here, feel free to try min-max scaling too.



In [20]:
# standardization
# TODO



## Split the dataset for testing and training

Here, we are randomly splitting the data into training and testing set using train_test_split() method. 80% is kept for training and 20% for testing.

In [21]:
# Split the dataset for testing and training
# TODO


##1. Linear Regression

Linear regression is a basic supervised learning algorithm that is widely used for making predictions. It is often taught in introductory statistics courses and is considered a fundamental technique in data analysis. Although it is straightforward and relatively simple compared to other machine learning algorithms, **linear regression remains valuable for predicting quantitative values such as home prices or ages**. Despite its simplicity, linear regression and its variations remain relevant and effective in practical applications.


## Fitting Linear Regression model

In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# Instantiate a linear regression object
#TODO


# fitting model or training model:
#TODO


As you know linear regression assumes that the relationship between the features and the target vector is approximately linear. That is, the effect (also called coefficient, weight, or parameter) of the features on the target vector is constant. In our solution, we have trained our model using  8 features.

In [23]:
# Show features
# TODO


After we have fit our model, we can view the value of each parameter. Bias or intercept, can be viewed using intercept_:

In [24]:
# View the intercept
# TODO


And coefficients are shown using coef_:

In [25]:
# View the feature coefficients
# TODO


In our dataset, the target variable is the median house value for California houses, expressed in hundreds of thousands of dollars ($100,000). Therefore the price of the first home in the dataset is:

In [26]:
# First value in the target vector (y) multiplied by 100,000
# TODO


In [27]:
# Predict the target value of the first observation, multiplied by 100,000
# TODO


In [28]:
# Difference between actual and predicted values
# TODO


It means our model was off by $33,668! How can we be sure about this model?

## K-Fold Cross-Validation
As we know we don't want to use the test set until we are confident about our model. But how can we test how our model performs if we can't use the test data ? One way to do this is using **K-Fold Cross-Validation**, which uses part of the training set for training and a part for validation. The following code randomly splits the training set into 10 subset called folds. Then it trains and evaluates 10 times, using every fold once for either training or validation:

In [29]:
# import libraries
# TODO


# Define cross-validation strategy
# TODO


# uncomment:
# cv_scores = cross_val_score(regressor, X, y, cv=kf, scoring='neg_mean_squared_error')

# cv_predictions = cross_val_predict(regressor, X, y, cv=kf)

# test set
# uncomment
# cv_scores_test = cross_val_score(regressor, X_test, y_test, cv=kf, scoring='neg_mean_squared_error')
# y_pred_test = regressor.predict(X_test)


**note**: The term neg_mean_squared_error refers to a scoring method used in cross-validation and model evaluation in scikit-learn, where the goal is to minimize the Mean Squared Error (MSE). In scikit-learn, some metrics are defined as being maximized (higher is better), so for metrics like MSE, which are minimized (lower is better), the negative value is used to allow the cross-validation function to maximize a score by minimizing the error.

[more information on Metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html)

The *random_state* parameter in KFold cross-validation is used to control the randomness involved in the shuffling process of the data before splitting it into folds. Setting a random_state ensures that the same data split is used every time the code is run, which helps in achieving reproducibility of the results.



## Evaluate the model

In [30]:
# Calculate evaluation metrics for training set

# uncomment and complete
# TODO

# mae =
# mse =
# rmse =
# r2 =

# # Print evaluation metrics
# print(f"Mean Absolute Error (MAE): {mae}")
# print(f"Cross-Validated Mean Squared Error (MSE): {-np.mean(cv_scores)}")
# print(f"R-squared: {r2}")


# Scores explanation (Standardized data):
Since we scaled (using standardization) the features (independent variables) and target (dependent variable), this can affect the range and magnitude of the evaluation metrics. So, the scores are between 0 and 1.

 **MAE**: 0.46106034365834575, same as MSE, the model's predicted house prices are about 0.46 units away from the actual values in standardized units. This indicates that the model's predictions are fairly close to the actual values.

  **MSE**: 0.3999298125600988, indicates nn average, the squared differences between the predicted and actual house prices across the cross-validation folds are approximately 0.40 in standardized units. This suggests that the model has a reasonably good fit, though not perfect.

**R-squared**: 0.6000701874399013, indicates approximately 60% of the variability in the house prices can be explained by the model's predictors. This indicates a moderate level of explanatory power, with the remaining 40% of the variability due to other factors or noise.

## Conversion to Original Scale

Let's see the evaluation metrics with original house prices (no scaling) to better sense the scores.

In [31]:
# Mean (original data)
# TODO
# original_mean =
# print("Mean of riginal house prices = $", original_mean)

# Standard Deviation (original data)
# TODO
# original_std =
# print("Standard deviation of riginal house prices = $", original_std)

In [32]:
# Convert predictions and actuals back to the original scale
# uncomment:
# predictions_original_scale = cv_predictions * original_std + original_mean
# actuals_original_scale = y * original_std + original_mean

# Calculate evaluation metrics on original scale
# TODO

# Print evaluation metrics
# TODO

In [33]:
import matplotlib.pyplot as plt

# Scatter plot of actual vs predicted values
# TODO



# 2. LASSO Regression (L1 Regularization)

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), is a technique used in regression models to prevent overfitting and to perform feature selection.

**Regularization**: Regularization adds a penalty to the cost (loss) function (the function the model tries to minimize). This penalty discourages the model from fitting the training data too closely, which helps to generalize better to new data

**L1 Penalty**: In L1 regularization, the penalty is proportional to the sum of the absolute values of the coefficients in cost function.

**Feature Selection**: One of the key features of L1 regularization is that it can shrink some coefficients to exactly zero. This means that the model effectively ignores those features, performing automatic feature selection. This is particularly useful when you have many features, some of which may be irrelevant.

**Controlling Overfitting**: By adding this penalty, L1 regularization prevents the model from becoming too complex and fitting the noise in the training data, thus reducing the risk of overfitting (n undesirable machine learning behavior that occurs when the machine learning model gives accurate predictions for training data but not for unseen data).


*   Example: Predicting medical expenses and identifying the most significant factors affecting costs.


**Note**: In the context of L1 and L2 regularization, "L1" and "L2" refer to different types of norm-based penalties applied to the regression coefficients to prevent overfitting and improve generalization.

## Train the Lasso Regression Model

In [34]:
from sklearn.linear_model import Lasso

# Train the Lasso model
# You can adjust alpha for regularization strength

# instantiate a lasso object
# TODO

# fit the model on train set
# TODO


## Make Predictions

In [35]:
# Make predictions by the trained lasso model
# TODO



## Evaluate the Model

In [36]:
# Calculate evaluation metrics (get help from previous model exaluation)
# Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared
# TODO

# Print evaluation metrics
# Calculate evaluation metrics
# TODO


## Cross-Validation

In [37]:
# Cross-validation (get help from previous model)
# TODO




## Grid Search for optimization
Grid search is a method used for hyperparameter tuning in machine learning. It systematically works through multiple combinations of parameter values, cross-validates each combination, and determines the set of parameters that gives the best performance. The main goal of grid search is to find the optimal hyperparameters for a given model to improve its accuracy or other performance metrics.

Here we use [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) from sklearn.



## Tuning the Alpha Hyperparameter
You can tune the alpha parameter to find the optimal value for the dataset. The alpha hyperparameter determines the strength of regularization applied to the model.

### Effect of Alpha:

High Alpha: A high alpha value increases the penalty on the coefficients, leading to more coefficients being shrunk to zero. This results in a simpler model with potentially fewer features (automatic feature selection).
Low Alpha: A low alpha value reduces the penalty on the coefficients, making the model more similar to ordinary least squares regression with little to no regularization.

In [38]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
# param_grid = {'alpha': [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 1, 10, 100]}

# Perform grid search with GridSearchCV
# TODO

# fit grid search
# TODO



In [39]:
# Best alpha
# TODO

# Train the Lasso model with the best alpha
# TODO

# fit the best lasso model
# TODO

# Make predictions and evaluate
# TODO


## Best LASSO model evaluation

In [40]:
# Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared
# TODO



# 3. Ridge Regression (L2 Regularization)
L2 regularization, also known as Ridge regression, is a technique used in regression models to prevent overfitting and improve the model's generalization by adding a penalty term to the loss function. This penalty term is proportional to the square of the magnitude of the coefficients.

*   **L2 Penalty**: In L2 regularization, the penalty is proportional to the sum of the squares of the coefficients of the cost function.

*   **Shrinkage**: The L2 penalty causes the coefficients to be "shrunk" towards zero, but not exactly zero. This means that all features are kept in the model, but their impact is reduced.

*   **Controlling Overfitting**: By adding this penalty, L2 regularization prevents the model from becoming too complex and fitting the noise in the training data, thus reducing the risk of overfitting.


Example: Predicting housing prices with regularization to avoid overfitting on training data.

## Train the Ridge Regression Model

In [41]:
from sklearn.linear_model import Ridge

# instantiate a Ridge object
# You can adjust alpha for regularization strength
# TODO

# Train the Ridge model
# TODO


## Make Predictions with Ridge model

In [42]:
# Make predictions
# TODO


## Evaluate Ridge Model

In [43]:
# Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared
# Calculate evaluation metrics
# TODO


## Tuning the Alpha Parameter using grid search

In [44]:
# Usee parameter grid from. earlier

# Perform grid search
# TODO

# Best alpha
# TODO

# Train the Ridge model with the best alpha
# TODO

# Make predictions and evaluate
# TODO




## Best Ridge model evaluation

In [45]:
# Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared
# TODO