## Q1 Load the dataset.

The file has no index column.
The last column is the target column.
The first row of the file has column ids

Click here to view the dataset

Which dataset are you using for this exam?


NPPE1_ModelBuilding3.csv

In [None]:
# ----- Mount Drive and Load Dataset -----
from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np

# Change the file name if you wish to use a different dataset.
data_path = '/content/drive/MyDrive/MLP/NPPE-1/Model-building/NPPE1_ModelBuilding3.csv'
df_model = pd.read_csv(data_path)
print("Dataset loaded. Columns:", df_model.columns)

Mounted at /content/drive
Dataset loaded. Columns: Index(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
       '13', '14'],
      dtype='object')


## Q2 Split the dataset into train dataset and test dataset in the following manner

  Use train_test_split to split the dataset into train and test dataset with test size equal to 0.3(i.e.30%) and random_state equal to 42. Let other parameters have default values.

  Columns except the last column should be the feature matrix (X_train or X_test)
  
  Last column will be the label vector.

(Common instructions for Q.2, Q.3 and Q.4)

In [None]:
# ----- Split the Dataset into Features and Target -----
# The file has no index column, and the last column is the target.
X = df_model.iloc[:, :-1]   # All columns except the last one
y = df_model.iloc[:, -1]    # The last column as target

# Split into training and test sets (70% train, 30% test) using random_state=42
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Training set samples:", X_train.shape[0])

Training set samples: 2800


Train the ridge model on the training data with the following parameters:

alpha = 10

solver = 'saga'

tol = 1e-4

random_state = 42

Enter the value of $R^2$ score on the test dataset.

In [None]:
from sklearn.linear_model import Ridge

# Train Ridge model with given parameters
ridge = Ridge(alpha=10, solver='saga', tol=1e-4, random_state=42)
ridge.fit(X_train, y_train)
r2_ridge = ridge.score(X_test, y_test)
print("R^2 score of Ridge model on test set:", r2_ridge)

R^2 score of Ridge model on test set: 0.6613547575262211


## Q3 What is the index of most important feature? Note the index starts from 0. Ignore the intercept for this question.

In [None]:
coef = ridge.coef_
most_important_index = np.argmax(np.abs(coef))
print("Index of most important feature:", most_important_index)

Index of most important feature: 9


## Q4 What is the index of least important feature? Note the index starts from 0. Ignore the intercept for this question.


In [None]:
least_important_index = np.argmin(np.abs(coef))
print("Index of least important feature:", least_important_index)

Index of least important feature: 0


## Q5 (Common Instructions for Q.5 and Q.6)
Take SGDRegressor(random_state = 42) estimator with GridSearchCV. Hyperparameter tuning to be done over the following parameters:
penalty as ['l1', 'l2']
alpha values as [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
values of tol as [1e-4, 1e-3, 1e-2, 1e-1]
Use cross-validation = 5
Set scoring as neg_mean_absolute_error
Use the best model from above hyper parameter tuning process to answer following questions:

What is the best penalty?

In [None]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

# Define hyperparameter grid
param_grid = {
    'penalty': ['l1', 'l2'],
    'alpha': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
    'tol': [1e-4, 1e-3, 1e-2, 1e-1]
}

sgd = SGDRegressor(random_state=42)
grid = GridSearchCV(sgd, param_grid, cv=5, scoring='neg_mean_absolute_error')
grid.fit(X_train, y_train)

best_penalty = grid.best_params_['penalty']
print("Best penalty from GridSearchCV:", best_penalty)

Best penalty from GridSearchCV: l2


## Q6 What will be value of mean absolute error on the test dataset?

In [None]:
# Predict on test set and compute Mean Absolute Error
y_pred_sgd = grid.best_estimator_.predict(X_test)
mae = mean_absolute_error(y_test, y_pred_sgd)
print("Mean Absolute Error on test set for best SGDRegressor:", mae)

Mean Absolute Error on test set for best SGDRegressor: 3.8131121797994014


## Q7 (Common Instructions for Q.7 and Q.8)
Create a pipeline of the PCA() as transformer and Lasso as an estimator.
Use GridSearchCV for tuning the hyperparameters of the created pipeline on training dataset.
	Values of n_components for PCA to be [0.9, 0.95]
	lasso alpha value to be taken as : [10, 1, 0.01, 0.001]
	scoring : neg_mean_absolute_error.
	Use 5 fold cross validation.
	n_jobs = -1 (negative one) [it helps in using all the computational power to run this job]
(Note: Kindly ignore the warning.)

If we fit the pipeline on the training dataset, what will be the R2 score on the test dataset?

In [None]:
from sklearn.decomposition import PCA
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline

# Create the pipeline with PCA and Lasso
pipeline = Pipeline([
    ('pca', PCA()),
    ('lasso', Lasso(random_state=42))
])

# Define parameter grid for PCA n_components and Lasso alpha
param_grid_pipeline = {
    'pca__n_components': [0.9, 0.95],
    'lasso__alpha': [10, 1, 0.01, 0.001]
}

grid_pipeline = GridSearchCV(pipeline, param_grid_pipeline, cv=5, scoring='neg_mean_absolute_error', n_jobs=-1)
grid_pipeline.fit(X_train, y_train)

# Evaluate R^2 score on test set using the best pipeline model
r2_pipeline = grid_pipeline.best_estimator_.score(X_test, y_test)
print("R^2 score on test dataset from PCA + Lasso pipeline:", r2_pipeline)

R^2 score on test dataset from PCA + Lasso pipeline: 0.6288625430197575


## Q8 How much variance is explained (Eigen value) by the first principle component?

In [None]:
best_pca = grid_pipeline.best_estimator_.named_steps['pca']
first_pc_variance = best_pca.explained_variance_[0]
print("Eigenvalue (variance) of the first principal component:", first_pc_variance)


Eigenvalue (variance) of the first principal component: 1.1635075742239045


## Q9 Create a pipeline of the PolynomialFeatures as transformer and Lasso as an estimator with the following parameters:
  - For PolynomialFeatures:
    - interaction_only = False
    - degree = 2
  - For Lasso:
    - alpha = 1
    - warm_start = True
    - random state as 0

Fit the pipeline on the training dataset and find the $R^2$ score on the test dataset.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly_lasso_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, interaction_only=False)),
    ('lasso', Lasso(alpha=1, warm_start=True, random_state=0))
])
poly_lasso_pipeline.fit(X_train, y_train)
r2_poly_lasso = poly_lasso_pipeline.score(X_test, y_test)
print("R^2 score on test dataset from PolynomialFeatures + Lasso pipeline:", r2_poly_lasso)

R^2 score on test dataset from PolynomialFeatures + Lasso pipeline: 0.157678032410551


## Q10 If you eliminate 1 feature with recursive feature elimination, which feature will be eliminated?

Type the index of the eliminated feature (index starts from 0).

Use LinearRegression model with default parameters as an estimator.
Use processed training data.


In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
# Set n_features_to_select to total number of features minus 1 (i.e. eliminate 1 feature)
rfe = RFE(estimator=lin_reg, n_features_to_select=X_train.shape[1] - 1)
rfe.fit(X_train, y_train)
# The eliminated feature will have a ranking higher than 1 (selected features have rank 1)
eliminated_feature_index = np.where(rfe.ranking_ != 1)[0][0]
print("Index of the eliminated feature by RFE:", eliminated_feature_index)

Index of the eliminated feature by RFE: 2
