----

# **Demonstrate Ridge Regression**

## **Author**   :  **Muhammad Adil Naeem**

## **Contact**   :   **madilnaeem0@gmail.com**
<br>

----

The dataset  is the **Boston Housing dataset**. This dataset is commonly used in regression analysis and machine learning projects to predict housing prices in the Boston area. It contains various features (columns) related to housing data, such as crime rate, average number of rooms, property tax rate, and others, along with the target variable `MEDV`, which represents the median value of owner-occupied homes (in $1000s).

### Key Columns:
- **CRIM**: Per capita crime rate by town.
- **ZN**: Proportion of residential land zoned for lots over 25,000 sq. ft.
- **INDUS**: Proportion of non-retail business acres per town.
- **CHAS**: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- **NOX**: Nitric oxides concentration (parts per 10 million).
- **RM**: Average number of rooms per dwelling.
- **AGE**: Proportion of owner-occupied units built before 1940.
- **DIS**: Weighted distances to five Boston employment centers.
- **RAD**: Index of accessibility to radial highways.
- **TAX**: Full-value property tax rate per $10,000.
- **PTRATIO**: Pupil-teacher ratio by town.
- **B**: where Bk is the proportion of Black people by town.

- **LSTAT**: Percentage of lower status of the population.
- **MEDV**: Median value of owner-occupied homes in $1000s.


# **Objectives:**

- We will apply Ridge Regression which is Also Known as **Regularization**.


### **Importing Libraries**

In [57]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold, GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

### **Load the Dataset**

In [58]:
df = pd.read_csv('/content/BostonHousing.csv')

### **First 5 rows of Dataset**

In [59]:
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


### **Shape of the Dataset**

In [60]:
print(f'This Dataset consist of {df.shape[0]} rows and {df.shape[1]} columns.')

This Dataset consist of 506 rows and 14 columns.


### **Information of the Dataset**

In [61]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   indus    506 non-null    float64
 3   chas     506 non-null    int64  
 4   nox      506 non-null    float64
 5   rm       501 non-null    float64
 6   age      506 non-null    float64
 7   dis      506 non-null    float64
 8   rad      506 non-null    int64  
 9   tax      506 non-null    int64  
 10  ptratio  506 non-null    float64
 11  b        506 non-null    float64
 12  lstat    506 non-null    float64
 13  medv     506 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB


### **Check Missing Values**

In [62]:
df.isnull().sum()

Unnamed: 0,0
crim,0
zn,0
indus,0
chas,0
nox,0
rm,5
age,0
dis,0
rad,0
tax,0


### **Fill missing values of `rm` using median**

In [63]:
df['rm'] = df['rm'].fillna(df['rm'].median())

### **Statistical Description of Numerical Columns**

In [64]:
df.describe()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.283587,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702126,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.208,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.61875,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


### **Create Dependent and Independent Variable**

In [65]:
X = df.drop(['medv'],axis=1)
y = df['medv']

### **Spliting Data into Train and Test**

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

### **Ridge Model**

In [67]:
ridge_model = Ridge(alpha=1).fit(X_train,y_train)
ridge_model.intercept_

24.834188563554434

- This code creates and fits a Ridge regression model using the training data `X_train` and `y_train` with a regularization strength of `alpha=1`. The line `ridge_model.intercept_` retrieves the intercept (bias) of the fitted model.

### **Let's Calculate Mean Squared Error and R2_score**

In [68]:
y_pred = ridge_model.predict(X_test)

In [69]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

In [70]:
r2 = r2_score(y_test, y_pred)

In [71]:
print("RMSE:", rmse)
print("R^2 Score:", r2)

RMSE: 4.7533170036297925
R^2 Score: 0.6773533610261309


### **Coefficents of Model**

In [72]:
ridge_model.coef_

array([-0.12385778,  0.03138564,  0.01905842,  2.54303001, -8.77787223,
        4.38065125, -0.01507134, -1.28357524,  0.24357931, -0.01079548,
       -0.83572093,  0.01350036, -0.53378837])

### **Let's Use GridSearchCv**

In [73]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

#### **Define a Grid**

In [74]:
grid = dict()
grid['alpha'] = np.arange(0,1,0.1)
model = Ridge()
search = GridSearchCV(model,grid,scoring='neg_mean_absolute_error',cv=cv, n_jobs=1)

In [75]:
results = search.fit(X_train, y_train)
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -3.502
Config: {'alpha': 0.7000000000000001}


## **Let's fit the Hyperparameter**

In [76]:
ridge_model = Ridge(alpha=0.7,).fit(X_train, y_train)
y_pred = ridge_model.predict(X_test)

In [77]:
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

In [78]:
print("RMSE:", rmse)
print("R^2 Score:", r2)

RMSE: 4.74374561060859
R^2 Score: 0.6786514308042815


- `RMSE` decreased somehow.
- `r2_score` slightly increased.

In [79]:
pd.Series(ridge_model.coef_, index=X_train.columns)

Unnamed: 0,0
crim,-0.124671
zn,0.031021
indus,0.024938
chas,2.595997
nox,-10.184433
rm,4.383117
age,-0.01396
dis,-1.304218
rad,0.24595
tax,-0.010625


This code creates a Pandas Series from the coefficients of a fitted Ridge regression model (`ridge_model`). Here's what it does:

- **`ridge_model.coef_`:** This retrieves the coefficients of the features used in the Ridge regression model.
- **`pd.Series(..., index=X_train.columns)`:** This creates a Pandas Series where the values are the coefficients and the index is set to the names of the features (columns) in the `X_train` DataFrame.

### Purpose:
The resulting Series provides a convenient way to view the coefficients associated with each feature, allowing you to assess their impact on the model's predictions. Positive coefficients indicate a positive relationship with the target variable, while negative coefficients suggest an inverse relationship.