# **Ques 1**

K-Fold Cross Validation for Multiple Linear Regression (Least Square Error Fit)

Download the dataset regarding USA House Price Prediction from the following link:

https://drive.google.com/file/d/1O_NwpJT-8xGfU_-3llUl2sgPu0xllOrX/view?usp=sharing

Load the dataset and Implement 5- fold cross validation for multiple linear regression (using least square error fit).

Steps:

a) Divide the dataset into input features (all columns except price) and output variable(price)

b) Scale the values of input features.

c) Divide input and output features into five folds.

d) Run five iterations, in each iteration consider one-fold as test set and remaining four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score for each iteration using least square error fit.

e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the regressor for 70% of data and test the performance for remaining 30% data.


In [19]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,KFold
from sklearn.metrics import r2_score

# load data
df = pd.read_csv("USA_Housing.csv")

# (a) divide dataset into input and output
X = df.drop("Price",axis=1).values
y = df["Price"].values

# (b) scale input
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# (c) define 5 folds
kf = KFold(n_splits=5,shuffle=True,random_state=42)

# lists to store outcomes for each fold
r2_scores = []
betas = []

# (d) iterations through folds
for train_idx, test_idx in kf.split(X_scaled):
    X_train, X_test = X_scaled[train_idx], X_scaled[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Add column of ones for intercept
    # np.c_ is for column-wise concatenation
    X_train = np.c_[np.ones(X_train.shape[0]), X_train]
    X_test = np.c_[np.ones(X_test.shape[0]), X_test]

    # Beta = (X^T X)^-1 X^T y
    beta = np.linalg.pinv(X_train.T @ X_train) @ (X_train.T) @ (y_train)

    # Predictions
    y_pred = X_test @ beta

    # R² score
    r2 = r2_score(y_test, y_pred)

    r2_scores.append(r2)
    betas.append(beta)

# Best beta
best_idx = np.argmax(r2_scores)
best_beta = betas[best_idx]
best_r2 = r2_scores[best_idx]
print("r2_scores: ",[round(val,4) for val in r2_scores])
print("Best r2_score from CV:", round(best_r2,4))
print("Best beta from CV:", np.round(best_beta,4))  # built-in round() not work: beta is numpy array

# e) Train/test with 70/30 split using best beta
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Add column of ones for intercept
X_train = np.c_[np.ones(X_train.shape[0]), X_train]
X_test = np.c_[np.ones(X_test.shape[0]), X_test]

# predict y = Xtest​.beta
y_pred_final = X_test @ best_beta

# r2_score
print("Final r2_score on 30% test:", round(r2_score(y_test, y_pred_final),4))

r2_scores:  [0.918, 0.9146, 0.9116, 0.9193, 0.9244]
Best r2_score from CV: 0.9244
Best beta from CV: [1.23161736e+06 2.30225051e+05 1.63956839e+05 1.21115121e+05
 7.83467200e+02 1.50662447e+05]
Final r2_score on 30% test: 0.9147


# **Ques 2**

Concept of Validation set for Multiple Linear Regression (Gradient Descent
Optimization)

Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the
dataset into training set (56%), validation set (14%), and test set (30%).

Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}.
Compute the values of regression coefficients for each value of learning rate after 1000 iterations.For each set of regression coefficients, compute R2_score for validation and test set and find the best value of regression coefficients.

In [26]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Load dataset
df = pd.read_csv("USA_Housing.csv")
X = df.drop("Price", axis=1).values
y = df["Price"].values

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Add column of ones for intercept
X_scaled = np.c_[np.ones(X_scaled.shape[0]), X_scaled]

# Train/Val/Test Split (56/14/30)
# First split: Train+Val (70%) and Test (30%)
X_temp, X_test, y_temp, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
# Second split: Train (56%) and Validation (14%)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.2, random_state=42)

# Gradient Descent Function
def gradient_descent(X, y, lr, iters):
    n, m = X.shape
    beta = np.zeros(m)
    for _ in range(iters):
        y_pred = X @ beta
        gradient = (1/n) * (X.T @ (y_pred - y))
        beta -= lr * gradient
    return beta

# Try different learning rates
learning_rates = [0.001, 0.01, 0.1, 1]
results = []

for lr in learning_rates:
    beta = gradient_descent(X_train, y_train, lr=lr, iters=1000)

    # Validation & Test Predictions
    y_val_pred = X_val @ beta
    y_test_pred = X_test @ beta

    r2_val = r2_score(y_val, y_val_pred)
    r2_test = r2_score(y_test, y_test_pred)
    results.append((lr, beta, r2_val, r2_test))

# Find best learning rate (by validation R²)
best_lr, best_beta, best_r2_val, best_r2_test = max(results, key=lambda x: x[2])

print("Results:")
for lr, beta, r2_val, r2_test in results:
    print(f"LR={lr:<6} | Val R²={r2_val:.4f} | Test R²={r2_test:.4f}")

print("\nBest Learning Rate:", best_lr)
print("Best Beta Coefficients (rounded):", np.round(best_beta, 4))
print("Best Validation R²:", round(best_r2_val, 4))
print("Corresponding Test R²:", round(best_r2_test, 4))

Results:
LR=0.001  | Val R²=-0.8125 | Test R²=-0.9914
LR=0.01   | Val R²=0.9098 | Test R²=0.9147
LR=0.1    | Val R²=0.9098 | Test R²=0.9148
LR=1      | Val R²=0.9098 | Test R²=0.9148

Best Learning Rate: 0.01
Best Beta Coefficients (rounded): [1232562.5125  230048.7666  163686.935   121406.9411    3117.4736
  150655.9746]
Best Validation R²: 0.9098
Corresponding Test R²: 0.9147


# **Ques 3**

Pre-processing and Multiple Linear Regression

Download the dataset regarding Car Price Prediction from the following link:

https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data

1. Load the dataset with following column names ["symboling", "normalized_losses",
"make", "fuel_type", "aspiration","num_doors", "body_style", "drive_wheels",
"engine_location", "wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke",
"compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"] and replace all ? values with NaN

2. Replace all NaN values with central tendency imputation. Drop the rows with NaN values in price column

3. There are 10 columns in the dataset with non-numeric values. Convert these values to numeric values using following scheme:

    (i) For “num_doors” and “num_cylinders”: convert words (number names) to figures for e.g., two to 2

    (ii) For "body_style", "drive_wheels": use dummy encoding scheme

    (iii) For “make”, “aspiration”, “engine_location”,fuel_type: use label encoding scheme

    (iv) For fuel_system: replace values containing string pfi to 1 else all values to 0.

    (v) For engine_type: replace values containing string ohc to 1 else all values to 0.

4. Divide the dataset into input features (all columns except price) and output variable (price). Scale all input features.

5. Train a linear regressor on 70% of data (using inbuilt linear regression function of Python) and test its performance on remaining 30% of data.

6. Reduce the dimensionality of the feature set using inbuilt PCA decomposition and then again train a linear regressor on 70% of reduced data (using inbuilt linear regression function of Python). Does it lead to any performance improvement on test set?

In [39]:
# Preprocessing + Multiple Linear Regression + PCA comparison
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score

# 1) Load dataset with given column names and replace '?' with NaN
cols = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
        "num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base",
        "length", "width", "height", "curb_weight", "engine_type", "num_cylinders",
        "engine_size", "fuel_system", "bore", "stroke", "compression_ratio",
        "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",names=cols, na_values='?')
print(df.head())
print("Initial shape:", df.shape)
print("Columns with missing counts:\n", df.isna().sum())

# 2) Drop rows where price is NaN (we cannot use them as targets)
df = df.dropna(subset=["price"]).reset_index(drop=True)
print("\nAfter dropping rows with missing price, shape:", df.shape)

# Convert numeric-like columns to numeric (some were read as object due to '?')
numeric_cols = ["normalized_losses", "wheel_base", "length", "width", "height",
                "curb_weight", "engine_size", "bore", "stroke", "compression_ratio",
                "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]
for c in numeric_cols:
    df[c] = pd.to_numeric(df[c], errors='coerce')

#  Central tendency imputation:
# - numeric: fill with mean
# - categorical: fill with mode
num_cols_present = [c for c in numeric_cols if c in df.columns]
for c in num_cols_present:
    if df[c].isna().any():
        mean_val = df[c].mean()
        df[c] = df[c].fillna(mean_val)

# categorical columns (all other non-numeric cols)
cat_cols = [c for c in df.columns if c not in num_cols_present]
for c in cat_cols:
    if df[c].isna().any():
        mode_val = df[c].mode(dropna=True)[0]
        df[c] = df[c].fillna(mode_val)                  # df[c].fillna(mode_val, inplace=True) may change in future

print("\nMissing counts after imputation:\n", df.isna().sum())

# 3) Convert the 10 non-numeric columns as requested
# (i) num_doors and num_cylinders: map word numbers to digits
num_map = {"one": 1, "two": 2, "three": 3, "four": 4, "five": 5, "six": 6,"seven": 7, "eight": 8, "nine": 9, "ten": 10, "eleven": 11, "twelve": 12}
# ensure lowercase for mapping
df["num_doors"] = df["num_doors"].str.lower().map(num_map)
df["num_cylinders"] = df["num_cylinders"].str.lower().map(num_map)

# if any remaining NaNs (uncommon), fill with mode converted to int
if df["num_doors"].isna().any():
    df["num_doors"].fillna(int(df["num_doors"].mode()[0]), inplace=True)
if df["num_cylinders"].isna().any():
    df["num_cylinders"].fillna(int(df["num_cylinders"].mode()[0]), inplace=True)

# (ii) body_style, drive_wheels: dummy encoding (one-hot)
df = pd.get_dummies(df, columns=["body_style", "drive_wheels"], drop_first=True)

# (iii) make, aspiration, engine_location, fuel_type: label encoding
label_cols = ["make", "aspiration", "engine_location", "fuel_type"]
le = LabelEncoder()
for c in label_cols:
    df[c] = le.fit_transform(df[c].astype(str))

# (iv) fuel_system: 1 if contains 'pfi' else 0
df["fuel_system"] = df["fuel_system"].str.contains("pfi", case=False, na=False).astype(int)

# (v) engine_type: 1 if contains 'ohc' else 0
df["engine_type"] = df["engine_type"].str.contains("ohc", case=False, na=False).astype(int)

# Confirm all columns numeric
non_numeric = df.select_dtypes(include=["object"]).columns.tolist()
print("\nNon-numeric columns remaining (should be empty):", non_numeric)

# 4) Features / target separation
X = df.drop("price", axis=1).values
y = df["price"].values.astype(float)

# IMPORTANT: Split before scaling to avoid data leakage (fit scaler on train only)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features (fit on train only)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# 5) Train a linear regressor on 70% train and test on 30% using sklearn
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_test = lr.predict(X_test_scaled)
r2_no_pca = r2_score(y_test, y_pred_test)
print(f"\nLinear Regression without PCA: Test R² = {r2_no_pca:.4f}")

# 6) PCA: reduce dimensionality, retrain, compare
# Fit PCA on training scaled features; keep 95% variance
pca = PCA(n_components=0.95, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca  = pca.transform(X_test_scaled)

print("\nPCA components kept:", X_train_pca.shape[1], " (explained variance ratio sum =",
      round(pca.explained_variance_ratio_.sum(), 4), ")")

# Train linear regression on PCA features
lr_pca = LinearRegression()
lr_pca.fit(X_train_pca, y_train)
y_pred_test_pca = lr_pca.predict(X_test_pca)
r2_with_pca = r2_score(y_test, y_pred_test_pca)
print(f"Linear Regression with PCA (95% variance): Test R² = {r2_with_pca:.4f}")

# Conclusion
print("\nConclusion:")
if r2_with_pca > r2_no_pca:
    print(f"  PCA improved test R²: {r2_no_pca:.4f} -> {r2_with_pca:.4f}")
elif r2_with_pca < r2_no_pca:
    print(f"  PCA reduced test R²: {r2_no_pca:.4f} -> {r2_with_pca:.4f}")
else:
    print(f"  No change in test R²: {r2_no_pca:.4f}")



   symboling  normalized_losses         make fuel_type aspiration num_doors  \
0          3                NaN  alfa-romero       gas        std       two   
1          3                NaN  alfa-romero       gas        std       two   
2          1                NaN  alfa-romero       gas        std       two   
3          2              164.0         audi       gas        std      four   
4          2              164.0         audi       gas        std      four   

    body_style drive_wheels engine_location  wheel_base  ...  engine_size  \
0  convertible          rwd           front        88.6  ...          130   
1  convertible          rwd           front        88.6  ...          130   
2    hatchback          rwd           front        94.5  ...          152   
3        sedan          fwd           front        99.8  ...          109   
4        sedan          4wd           front        99.4  ...          136   

   fuel_system  bore  stroke compression_ratio horsepower  pea