<a href="https://colab.research.google.com/github/keshavkundra/Machine-learing/blob/main/ASSINMENT3(KESHAV).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Load the dataset and Implement 5- fold cross validation for multiple linear regression
(using least square error fit).
Steps:
a) Divide the dataset into input features (all columns except price) and output variable
(price)
b) Scale the values of input features.
c) Divide input and output features into five folds.
d) Run five iterations, in each iteration consider one-fold as test set and remaining
four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score
for each iteration using least square error fit.
e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the
regressor for 70% of data and test the performance for remaining 30% data

In [67]:
# 1. Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# 2. Load the dataset
df = pd.read_csv('house_prices.csv')  # Make sure this matches your filename
print("Dataset shape:", df.shape)
print("Columns:", df.columns.tolist())

# 3. Features (X) and target (y)
X = df.drop(columns=['Price'])
y = df['Price'].values.reshape(-1, 1)

# 4. Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 5. Prepare 5 folds
n = X_scaled.shape[0]
indices = np.arange(n)
np.random.shuffle(indices)
folds = np.array_split(indices, 5)

best_r2 = -np.inf
best_beta = None

# 6. 5-fold cross-validation
for i in range(5):
    test_idx = folds[i]
    train_idx = np.hstack([f for j, f in enumerate(folds) if j != i])

    X_train, X_test = X_scaled[train_idx], X_scaled[test_idx]
    y_train, y_test = y[train_idx], y[test_idx]

    # Add intercept columns
    X_tr = np.hstack([np.ones((X_train.shape[0], 1)), X_train])
    X_te = np.hstack([np.ones((X_test.shape[0], 1)), X_test])

    # Compute beta: β = (XᵀX)⁻¹ Xᵀy
    beta = np.linalg.inv(X_tr.T @ X_tr) @ X_tr.T @ y_train
    y_pred = X_te @ beta
    r2 = r2_score(y_test, y_pred)
    print(f"Fold {i+1} R²: {r2:.4f}")

    if r2 > best_r2:
        best_r2 = r2
        best_beta = beta.copy()

print(f"\nBest R² from CV folds: {best_r2:.4f}")

# 7. Train-test split (70/30) using best beta
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)
X_tr_full = np.hstack([np.ones((X_train_full.shape[0], 1)), X_train_full])
X_te_full = np.hstack([np.ones((X_test_full.shape[0], 1)), X_test_full])

y_pred_full = X_te_full @ best_beta
r2_full = r2_score(y_test_full, y_pred_full)
print(f"70/30 split R² using best beta: {r2_full:.4f}")


Dataset shape: (5000, 6)
Columns: ['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population', 'Price']
Fold 1 R²: 0.9123
Fold 2 R²: 0.9164
Fold 3 R²: 0.9196
Fold 4 R²: 0.9200
Fold 5 R²: 0.9199

Best R² from CV folds: 0.9200
70/30 split R² using best beta: 0.9148


2 Concept of Validation set for Multiple Linear Regression (Gradient Descent
Optimization)
Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the
dataset into training set (56%), validation set (14%), and test set (30%).
Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of
regression coefficients for each value of learning rate after 1000 iterations.
For each set of regression coefficients, compute R2_score for validation and test set and find
the best value of regression coefficients.

In [68]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv('house_prices.csv')
print("Dataset shape:", df.shape)
print("Columns:", df.columns.tolist())

# Define features (X) and target (y)
X = df.drop(columns=['Price'])
y = df['Price'].values.reshape(-1, 1)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training (56%), validation (14%), and test (30%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X_scaled, y, test_size=0.44, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.68, random_state=42)

# Add intercept term (bias) to the feature matrices
X_train = np.hstack([np.ones((X_train.shape[0], 1)), X_train])
X_val = np.hstack([np.ones((X_val.shape[0], 1)), X_val])
X_test = np.hstack([np.ones((X_test.shape[0], 1)), X_test])

# Gradient Descent function
def gradient_descent(X, y, learning_rate, iterations):
    m = X.shape[0]
    n = X.shape[1]
    beta = np.zeros((n, 1))
    for _ in range(iterations):
        gradient = (2/m) * X.T @ (X @ beta - y)
        beta -= learning_rate * gradient
    return beta

# Initialize variables to track the best model
best_r2_val = -np.inf
best_r2_test = -np.inf
best_beta = None
best_lr = None

# Test different learning rates
learning_rates = [0.001, 0.01, 0.1, 1]
iterations = 1000

for lr in learning_rates:
    print(f"Training with learning rate: {lr}")
    beta = gradient_descent(X_train, y_train, lr, iterations)

    # Predict on validation set
    y_val_pred = X_val @ beta
    r2_val = r2_score(y_val, y_val_pred)

    # Predict on test set
    y_test_pred = X_test @ beta
    r2_test = r2_score(y_test, y_test_pred)

    print(f"Validation R²: {r2_val:.4f}, Test R²: {r2_test:.4f}")

    # Update best model if current model is better
    if r2_val > best_r2_val:
        best_r2_val = r2_val
        best_r2_test = r2_test
        best_beta = beta
        best_lr = lr

# Output the best model details
print("\nBest Model:")
print(f"Learning Rate: {best_lr}")
print(f"Validation R²: {best_r2_val:.4f}")
print(f"Test R²: {best_r2_test:.4f}")
print("Best Beta Coefficients:")
print(best_beta)


Dataset shape: (5000, 6)
Columns: ['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population', 'Price']
Training with learning rate: 0.001
Validation R²: 0.6467, Test R²: 0.6531
Training with learning rate: 0.01
Validation R²: 0.9202, Test R²: 0.9133
Training with learning rate: 0.1
Validation R²: 0.9202, Test R²: 0.9133
Training with learning rate: 1
Validation R²: -inf, Test R²: -inf

Best Model:
Learning Rate: 0.1
Validation R²: 0.9202
Test R²: 0.9133
Best Beta Coefficients:
[[1232180.27200919]
 [ 230645.88389435]
 [ 165328.94019375]
 [ 120045.00851908]
 [   2945.02108903]
 [ 151375.22971285]]


  numerator = xp.sum(weight * (y_true - y_pred) ** 2, axis=0)
  numerator = xp.sum(weight * (y_true - y_pred) ** 2, axis=0)
