<a href="https://colab.research.google.com/github/kjmobile/lb/blob/main/3_LM_Feature_engineering_and_regularization_Q.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering and Regularization

## Data Prep

In [None]:
import pandas as pd
import numpy as np
!pip install pymysql

In [None]:
# import data from mySQL database  using the following info host: kaylee-db.cbgcswckszgl.us-east-1.rds.amazonaws.com, id cob, password 1212, db='data', port 3306

import pymysql.cursors

# Connect to the database
connection = pymysql.connect(host='kaylee-db.cbgcswckszgl.us-east-1.rds.amazonaws.com',
                             user='',
                             password='',
                             db='data',
                             cursorclass=pymysql.cursors.DictCursor)

try:
    # Create a cursor object
    with connection.cursor() as cursor:
        # Execute the SQL query to fetch all data from the 'data' table
        cursor.execute("SELECT * FROM data.fish")

        # Fetch all the results
        results = cursor.fetchall()

        # Convert the results to a Pandas DataFrame
        fish_df = pd.DataFrame(results)

finally:
    # Close the connection
    connection.close()

# Print the DataFrame
print(fish_df)


In [None]:
fish = fish_df.to_numpy()
print(fish)

In [None]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(fish[:,:-1], fish[:,-1], random_state=42)

In [None]:
test_y.shape

## Preprocessing - Feature engineering

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
poly = PolynomialFeatures(include_bias=False)

poly.fit(train_X)
train_poly = poly.transform(train_X)

In [None]:
print(train_poly.shape)

In [None]:
train_poly[:2]

In [None]:
poly.get_feature_names_out()

In [None]:
#Now, transform the test set using the same poly class used for train set.
test_poly = poly.transform(test_X)


In [None]:
test_poly[:2]

In [None]:
np.set_printoptions(suppress=True)
test_poly

## Train Multple Regression

In [None]:
from sklearn.linear_model import LinearRegression

m0 = LinearRegression()
m0.fit(train_poly, train_y)
print(m0.score(train_poly, train_y))

In [None]:
print(m0.score(test_poly, test_y))


## But, what if we increase the degree from default 2 to 5 in PolynomialFeatures() method?

In [None]:
poly_5 = PolynomialFeatures(degree = 5, include_bias=False)

poly_5.fit(train_X)
train_poly_5 = poly_5.transform(train_X)
test_poly_5 = poly_5.transform(test_X)

In [None]:
train_y

In [None]:
m1=LinearRegression()
m1.fit(train_poly_5, train_y)


In [None]:
print(m1.score(train_poly_5, train_y))
print(m1.score(test_poly_5, test_y))
# R-squared, -144.5 ? ; Something is wrong because the model seriously "overfitted" with train set.
# In this case, test_poly_5 set has only 14 records but have 55 (5 degree combination out of 3 variables) columns
# Here R^2 becomes negative; due to overfitting, the predictions differ significantly from the actual values.

In [None]:
test_poly_5[0,:].shape


#### Here $R^2$ becomes negative because $SS_{res}$ > $SS_{tot}$ due to overfitting

$$ \large R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} $$








Overfitting occurs when the model fits with too much data.
This can be resolved by using regularization term to generalize the model.
Regularization can be done by adding a penalty to the weight parameters.
L1 and L2 regularizations are the two forms of regularizations.

A larger $\lambda$ increases the strength of the regularization.

$$\text {Lasso Regression- L1 Norm} : Min(Loss Function + \lambda||w||_1)$$  
$$\text {Ridge Regression- L2 Norm} : Min(Loss Function + \lambda||w||_2^2)$$

# Regularization

# Before applying regularization, standardization is necessary because regularization is scale sensitive

Standardization is a preprocessing technique used to scale the features of data so that they have a mean of 0 and a standard deviation of 1. It is represented as follows for each feature:


$$\large z=\frac{(x-\mu)}{\sigma}$$

    where:  z: standardized value,  x: original value,  μ: mean, σ: standard deviation.

Machine Learning Algorithms perform better when the data is standardized, it ensures that all features contribute equally to the analysis.

In [None]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
ss.fit(train_poly_5)

train_scaled = ss.transform(train_poly_5)
test_scaled = ss.transform(test_poly_5)

## Lasso : L1 regularization

In [None]:
from sklearn.linear_model import Lasso

lasso = Lasso()
lasso.fit(train_scaled, train_y)
print(lasso.score(train_scaled, train_y))

In [None]:
print(lasso.score(test_scaled, test_y))

## Hyperparameter Choice
   The $\lambda$ value is implimented as 'alpha' parameter in sklearn; default=1.0

In [None]:
import matplotlib.pyplot as plt


train_score_l = []
test_score_l = []

alpha_list = [0.001, 0.01, 0.1, 1, 10, 100]
for alpha in alpha_list:
    # lasso model
    lasso = Lasso(alpha=alpha)
    # train lasso model
    lasso.fit(train_scaled, train_y)
    # Save R-squares from train and test
    train_score_l.append(lasso.score(train_scaled, train_y))
    test_score_l.append(lasso.score(test_scaled, test_y))

In [None]:
train_score_l

In [None]:
plt.plot(np.log10(alpha_list), train_score_l)
plt.plot(np.log10(alpha_list), test_score_l)
#plt.vlines(x=1, ymin=0.92, ymax=0.995, ls=':', color='red')
plt.xlabel('alpha')
plt.ylabel('R^2')
plt.show()

In [None]:
lasso = Lasso(alpha=10)
lasso.fit(train_scaled, train_y)

print(lasso.score(train_scaled, train_y))
print(lasso.score(test_scaled, test_y))

In [None]:
# 40 out of 55 coefficients are reduced to 0.
print(len(lasso.coef_))
print(np.sum(lasso.coef_==0))

# Although 55 features were fed into the model, the ridge model only used 15 of them.
# Due to this characteristic,the ridge (or lasso) model can also be used for feature selection purposes.

## Ridge : L2 regularization

In [None]:
ridge?

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge()
ridge.fit(train_scaled, train_y)
print(ridge.score(train_scaled, train_y))

In [None]:
print(ridge.score(test_scaled, test_y))

In [None]:
train_score_r = []
test_score_r = []

In [None]:
alpha_list = [0.001, 0.01, 0.1, 1, 10, 100]
# Increase the alpha value in increments of 10 times from 0.001 to 100
# while training the Ridge regression model.

for alpha in alpha_list:
    # ridge model
    ridge = Ridge(alpha=alpha)
    # train ridge model
    ridge.fit(train_scaled, train_y)
    # Save R squares
    train_score_r.append(ridge.score(train_scaled, train_y))
    test_score_r.append(ridge.score(test_scaled, test_y))

In [None]:
plt.plot(np.log10(alpha_list), train_score_r)
plt.plot(np.log10(alpha_list), test_score_r)
#plt.vlines(x=-1, ymin=0.955, ymax=0.999, ls=':', color='r')
plt.xlabel('$alpha$')
plt.ylabel('$R^2$')
plt.show()

In [None]:
ridge = Ridge()
ridge.fit(train_scaled, train_y)

print(ridge.score(train_scaled, train_y))
print(ridge.score(test_scaled, test_y))

In [None]:
# fix the code in the cell above to optimize ridge regression by modifying hypterparameter alpha here.





