1 Q1: K-Fold Cross Validation for Multiple Linear Regression (Least Square Error Fit)
Download the dataset regarding USA House Price Prediction from the following link:

https://drive.google.com/file/d/1O_NwpJT-8xGfU_-3llUl2sgPu0xllOrX/view?usp=sharing

Load the dataset and Implement 5- fold cross validation for multiple linear regression

(using least square error fit).
Steps:
a) Divide the dataset into input features (all columns except price) and output variable
(price)

b) Scale the values of input features.

c) Divide input and output features into five folds.

d) Run five iterations, in each iteration consider one-fold as test set and remaining
four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score
for each iteration using least square error fit.

e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the
regressor for 70% of data and test the performance for remaining 30% data.

In [None]:
from sklearn.preprocessing import StandardScaler

# Scale the input features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert scaled features back to a DataFrame for easier handling
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Display the first 5 rows of scaled features
display(X_scaled_df.head())

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,1.02866,-0.296927,0.021274,0.088062,-1.317599
1,1.000808,0.025902,-0.255506,-0.722301,0.403999
2,-0.684629,-0.112303,1.516243,0.93084,0.07241
3,-0.491499,1.221572,-1.393077,-0.58454,-0.186734
4,-0.807073,-0.944834,0.846742,0.201513,-0.988387


In [None]:
import numpy as np
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score

# Add a column of ones for the intercept term
X_scaled_with_intercept = X_scaled_df.copy()
X_scaled_with_intercept.insert(0, 'intercept', 1)

# Convert to numpy arrays for matrix operations
X_scaled_np = X_scaled_with_intercept.values
y_np = y.values.reshape(-1, 1)

# Define the number of folds
n_splits = 5
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Lists to store results
beta_matrices = []
predicted_values = []
r2_scores = []

# Perform K-Fold Cross Validation
for fold, (train_index, test_index) in enumerate(kf.split(X_scaled_np)):
    print(f"Fold {fold+1}/{n_splits}")

    X_train, X_test = X_scaled_np[train_index], X_scaled_np[test_index]
    y_train, y_test = y_np[train_index], y_np[test_index]

    # Calculate beta (least squares estimate)
    # beta = (X_train_T * X_train)^-1 * X_train_T * y_train
    X_train_T = X_train.T
    beta = np.linalg.inv(X_train_T @ X_train) @ X_train_T @ y_train

    # Predict values
    y_pred = X_test @ beta

    # Calculate R2 score
    r2 = r2_score(y_test, y_pred)

    # Store results
    beta_matrices.append(beta)
    predicted_values.append(y_pred)
    r2_scores.append(r2)

    print(f"Beta Matrix:\n{beta}")
    print(f"R2 Score: {r2}\n")

# Display the R2 scores for each fold
print("R2 Scores for each fold:")
for fold, r2 in enumerate(r2_scores):
    print(f"Fold {fold+1}: {r2}")

Fold 1/5
Beta Matrix:
[[1232002.6748241 ]
 [ 230745.94073479]
 [ 163243.27314515]
 [ 120309.77397759]
 [   3011.45976111]
 [ 151552.63069359]]
R2 Score: 0.9179971706985147

Fold 2/5
Beta Matrix:
[[1232037.85755946]
 [ 229081.97914235]
 [ 165882.1605634 ]
 [ 121536.57475055]
 [   2092.4478622 ]
 [ 150874.99274586]]
R2 Score: 0.9145677884802818

Fold 3/5
Beta Matrix:
[[1231951.92563846]
 [ 230224.50511001]
 [ 162766.17455493]
 [ 121022.77324577]
 [   1247.16258975]
 [ 150234.77720419]]
R2 Score: 0.9116116385364478

Fold 4/5
Beta Matrix:
[[1232751.46486511]
 [ 229500.10043209]
 [ 165212.07110924]
 [ 122839.9376815 ]
 [   3063.71699324]
 [ 150917.88484984]]
R2 Score: 0.9193091764960816

Fold 5/5
Beta Matrix:
[[1.23161736e+06]
 [2.30225051e+05]
 [1.63956839e+05]
 [1.21115120e+05]
 [7.83467170e+02]
 [1.50662447e+05]]
R2 Score: 0.9243869413350316

R2 Scores for each fold:
Fold 1: 0.9179971706985147
Fold 2: 0.9145677884802818
Fold 3: 0.9116116385364478
Fold 4: 0.9193091764960816
Fold 5: 0.9243

Concept of Validation set for Multiple Linear Regression (Gradient Descent
Optimization)

Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the
dataset into training set (56%), validation set (14%), and test set (30%).

Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of
regression coefficients for each value of learning rate after 1000 iterations.

For each set of regression coefficients, compute R2_score for validation and test set and find
the best value of regression coefficients.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/USA_Housing.csv')

# Separate input features (X) and output variable (y)
X = df.drop('Price', axis=1)
y = df['Price']

# Display the first few rows of X and y to verify
display(X.head())
display(y.head())

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,79545.45857,5.682861,7.009188,4.09,23086.8005
1,79248.64245,6.0029,6.730821,3.09,40173.07217
2,61287.06718,5.86589,8.512727,5.13,36882.1594
3,63345.24005,7.188236,5.586729,3.26,34310.24283
4,59982.19723,5.040555,7.839388,4.23,26354.10947


Unnamed: 0,Price
0,1059034.0
1,1505891.0
2,1058988.0
3,1260617.0
4,630943.5


In [None]:
from sklearn.preprocessing import StandardScaler

# Instantiate a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the input features X and transform them
X_scaled = scaler.fit_transform(X)

# Convert the scaled features X_scaled into a pandas DataFrame
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Display the first few rows of the X_scaled_df DataFrame
display(X_scaled_df.head())

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,1.02866,-0.296927,0.021274,0.088062,-1.317599
1,1.000808,0.025902,-0.255506,-0.722301,0.403999
2,-0.684629,-0.112303,1.516243,0.93084,0.07241
3,-0.491499,1.221572,-1.393077,-0.58454,-0.186734
4,-0.807073,-0.944834,0.846742,0.201513,-0.988387


**Reasoning**:
Divide the scaled dataset into training (56%), validation (14%), and test (30%) sets and verify the shapes.



In [None]:
from sklearn.model_selection import train_test_split

# Split the data into training (56%) and a temporary set (44%)
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled_df, y, test_size=0.44, random_state=42)

# Split the temporary set into validation (14% of total, ~31.8% of temp) and test (30% of total, ~68.2% of temp)
# Calculate the proportion of the temporary set that should be the validation set
# validation_size = 0.14 / 0.44 ≈ 0.31818
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=(0.30/0.44), random_state=42)

# Verify the shapes of the resulting sets
print("Training set shape (X, y):", X_train.shape, y_train.shape)
print("Validation set shape (X, y):", X_val.shape, y_val.shape)
print("Test set shape (X, y):", X_test.shape, y_test.shape)

Training set shape (X, y): (2800, 5) (2800,)
Validation set shape (X, y): (700, 5) (700,)
Test set shape (X, y): (1500, 5) (1500,)


In [None]:
import numpy as np

def gradient_descent(X_train, y_train, learning_rate, n_iterations):
    """
    Performs gradient descent to find the regression coefficients (beta).

    Args:
        X_train (np.ndarray): Training features with intercept term.
        y_train (np.ndarray): Training target variable.
        learning_rate (float): The step size for updating beta.
        n_iterations (int): The number of iterations to run gradient descent.

    Returns:
        np.ndarray: The calculated beta coefficients after n_iterations.
    """
    # Initialize beta as a column vector of zeros
    # The shape should be (number of features + 1, 1) for the intercept
    beta = np.zeros((X_train.shape[1], 1))

    # Number of training examples
    m = len(y_train)

    for i in range(n_iterations):
        # Calculate predicted values
        y_pred = X_train @ beta

        # Calculate the error
        error = y_pred - y_train.reshape(-1, 1) # Reshape y_train to a column vector

        # Calculate the gradient
        gradient = (1/m) * X_train.T @ error

        # Update beta
        beta = beta - learning_rate * gradient

    return beta

# Add intercept to the scaled training features
X_train_with_intercept = X_train.copy()
X_train_with_intercept.insert(0, 'intercept', 1)
X_train_np = X_train_with_intercept.values # Convert to numpy array

# Convert y_train to numpy array
y_train_np = y_train.values

In [None]:
# Define the learning rates to experiment with
learning_rates = [0.001, 0.01, 0.1, 1]

# Create empty lists to store results
beta_coefficients = []
validation_r2_scores = []

# Add an intercept term to the scaled validation features
X_val_with_intercept = X_val.copy()
X_val_with_intercept.insert(0, 'intercept', 1)

# Convert validation features and target to numpy arrays
X_val_np = X_val_with_intercept.values
y_val_np = y_val.values.reshape(-1, 1)

# Iterate through each learning rate
for lr in learning_rates:
    print(f"Training with learning rate: {lr}")

    # Train the model using gradient descent
    beta = gradient_descent(X_train_np, y_train_np, lr, n_iterations=1000)

    # Store the beta coefficients
    beta_coefficients.append(beta)

    # Predict values on the validation set
    y_val_pred = X_val_np @ beta

    # Calculate R2 score on the validation set
    r2 = r2_score(y_val_np, y_val_pred)

    # Store the R2 score
    validation_r2_scores.append(r2)

    print(f"Validation R2 Score: {r2}\n")

# Display the beta coefficients and R2 scores for each learning rate
print("Results for each learning rate:")
for i, lr in enumerate(learning_rates):
    print(f"Learning Rate: {lr}")
    print(f"Beta Coefficients:\n{beta_coefficients[i]}")
    print(f"Validation R2 Score: {validation_r2_scores[i]}\n")

Training with learning rate: 0.001
Validation R2 Score: -1.0427758764418797

Training with learning rate: 0.01
Validation R2 Score: 0.9199425193890369

Training with learning rate: 0.1
Validation R2 Score: 0.9199649194854793

Training with learning rate: 1
Validation R2 Score: 0.9199649194854793

Results for each learning rate:
Learning Rate: 0.001
Beta Coefficients:
[[776416.37536817]
 [139193.78459504]
 [105659.67290638]
 [ 63395.00837487]
 [ 23342.20138111]
 [ 89244.77187829]]
Validation R2 Score: -1.0427758764418797

Learning Rate: 0.01
Beta Coefficients:
[[1232115.94476263]
 [ 230612.08499469]
 [ 165302.14248619]
 [ 119674.76300683]
 [   3324.2502332 ]
 [ 151370.72079493]]
Validation R2 Score: 0.9199425193890369

Learning Rate: 0.1
Beta Coefficients:
[[1232180.27200919]
 [ 230645.88389435]
 [ 165328.94019375]
 [ 120045.00851908]
 [   2945.02108903]
 [ 151375.22971285]]
Validation R2 Score: 0.9199649194854793

Learning Rate: 1
Beta Coefficients:
[[1232180.27200919]
 [ 230645.883894

In [None]:

df = pd.read_csv('/content/USA_Housing.csv')


X = df.drop('Price', axis=1)
y = df['Price']


display(X.head())
display(y.head())
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()


X_scaled = scaler.fit_transform(X)


X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)


display(X_scaled_df.head())
from sklearn.model_selection import train_test_split


X_train, X_temp, y_train, y_temp = train_test_split(X_scaled_df, y, test_size=0.44, random_state=42)


X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=(0.30/0.44), random_state=42)

print("Training set shape (X, y):", X_train.shape, y_train.shape)
print("Validation set shape (X, y):", X_val.shape, y_val.shape)
print("Test set shape (X, y):", X_test.shape, y_test.shape)
import numpy as np

def gradient_descent(X_train, y_train, learning_rate, n_iterations):


    beta = np.zeros((X_train.shape[1], 1))


    m = len(y_train)

    for i in range(n_iterations):

        y_pred = X_train @ beta


        error = y_pred - y_train.reshape(-1, 1)


        gradient = (1/m) * X_train.T @ error


        beta = beta - learning_rate * gradient

    return beta


X_train_with_intercept = X_train.copy()
X_train_with_intercept.insert(0, 'intercept', 1)
X_train_np = X_train_with_intercept.values


y_train_np = y_train.values

learning_rates = [0.001, 0.01, 0.1, 1]


beta_coefficients = []
validation_r2_scores = []


X_val_with_intercept = X_val.copy()
X_val_with_intercept.insert(0, 'intercept', 1)


X_val_np = X_val_with_intercept.values
y_val_np = y_val.values.reshape(-1, 1)


for lr in learning_rates:
    print(f"Training with learning rate: {lr}")

    beta = gradient_descent(X_train_np, y_train_np, lr, n_iterations=1000)

    beta_coefficients.append(beta)


    y_val_pred = X_val_np @ beta


    r2 = r2_score(y_val_np, y_val_pred)


    validation_r2_scores.append(r2)

    print(f"Validation R2 Score: {r2}\n")


print("Results for each learning rate:")
for i, lr in enumerate(learning_rates):
    print(f"Learning Rate: {lr}")
    print(f"Beta Coefficients:\n{beta_coefficients[i]}")
    print(f"Validation R2 Score: {validation_r2_scores[i]}\n")

best_r2_index = np.argmax(validation_r2_scores)

best_beta = beta_coefficients[best_r2_index]


print(f"Best Validation R2 Score: {validation_r2_scores[best_r2_index]}")
print(f"Learning Rate for Best R2 Score: {learning_rates[best_r2_index]}")
print(f"Best Beta Coefficients:\n{best_beta}")

X_test_with_intercept = X_test.copy()
X_test_with_intercept.insert(0, 'intercept', 1)


X_test_np = X_test_with_intercept.values
y_test_np = y_test.values.reshape(-1, 1)


y_test_pred = X_test_np @ best_beta


test_r2_score = r2_score(y_test_np, y_test_pred)

print(f"Test R2 Score with the best beta coefficients: {test_r2_score}")


Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,79545.45857,5.682861,7.009188,4.09,23086.8005
1,79248.64245,6.0029,6.730821,3.09,40173.07217
2,61287.06718,5.86589,8.512727,5.13,36882.1594
3,63345.24005,7.188236,5.586729,3.26,34310.24283
4,59982.19723,5.040555,7.839388,4.23,26354.10947


Unnamed: 0,Price
0,1059034.0
1,1505891.0
2,1058988.0
3,1260617.0
4,630943.5


Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
0,1.02866,-0.296927,0.021274,0.088062,-1.317599
1,1.000808,0.025902,-0.255506,-0.722301,0.403999
2,-0.684629,-0.112303,1.516243,0.93084,0.07241
3,-0.491499,1.221572,-1.393077,-0.58454,-0.186734
4,-0.807073,-0.944834,0.846742,0.201513,-0.988387


Training set shape (X, y): (2800, 5) (2800,)
Validation set shape (X, y): (700, 5) (700,)
Test set shape (X, y): (1500, 5) (1500,)
Training with learning rate: 0.001
Validation R2 Score: -1.0427758764418797

Training with learning rate: 0.01
Validation R2 Score: 0.9199425193890369

Training with learning rate: 0.1
Validation R2 Score: 0.9199649194854793

Training with learning rate: 1
Validation R2 Score: 0.9199649194854793

Results for each learning rate:
Learning Rate: 0.001
Beta Coefficients:
[[776416.37536817]
 [139193.78459504]
 [105659.67290638]
 [ 63395.00837487]
 [ 23342.20138111]
 [ 89244.77187829]]
Validation R2 Score: -1.0427758764418797

Learning Rate: 0.01
Beta Coefficients:
[[1232115.94476263]
 [ 230612.08499469]
 [ 165302.14248619]
 [ 119674.76300683]
 [   3324.2502332 ]
 [ 151370.72079493]]
Validation R2 Score: 0.9199425193890369

Learning Rate: 0.1
Beta Coefficients:
[[1232180.27200919]
 [ 230645.88389435]
 [ 165328.94019375]
 [ 120045.00851908]
 [   2945.02108903]
 [ 