<a href="https://colab.research.google.com/github/keshavkundra/Machine-learing/blob/main/assinment3_keshav.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Load the dataset and Implement 5- fold cross validation for multiple linear regression
(using least square error fit).
Steps:
a) Divide the dataset into input features (all columns except price) and output variable
(price)
b) Scale the values of input features.
c) Divide input and output features into five folds.
d) Run five iterations, in each iteration consider one-fold as test set and remaining
four sets as training set. Find the beta (𝛽) matrix, predicted values, and R2_score
for each iteration using least square error fit.
e) Use the best value of (𝛽) matrix (for which R2_score is maximum), to train the
regressor for 70% of data and test the performance for remaining 30% data

In [73]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

data=pd.read_csv("USA_Housing.csv")
X=data.drop('Price',axis=1).values
y=data['Price'].values.reshape(-1,1)
X=StandardScaler().fit_transform(X)
folds=np.array_split(np.arange(len(X)),5)
best_r2=-1
best_beta=None
for i in range(5):
    test_idx=folds[i]
    train_idx=np.concatenate([folds[j] for j in range(5) if j!=i])
    X_train,y_train=X[train_idx],y[train_idx]
    X_test,y_test=X[test_idx],y[test_idx]
    X_train_b=np.c_[np.ones((X_train.shape[0],1)),X_train]
    X_test_b=np.c_[np.ones((X_test.shape[0],1)),X_test]
    beta=np.linalg.inv(X_train_b.T@X_train_b)@X_train_b.T@y_train
    y_pred=X_test_b@beta
    r2=r2_score(y_test,y_pred)
    if r2>best_r2:
        best_r2=r2
        best_beta=beta
n=int(0.7*len(X))
X_train,y_train=X[:n],y[:n]
X_test,y_test=X[n:],y[n:]
X_train_b=np.c_[np.ones((X_train.shape[0],1)),X_train]
X_test_b=np.c_[np.ones((X_test.shape[0],1)),X_test]
y_pred=X_test_b@best_beta
print("Final Evaluation with best beta on 70/30 split")
print("R2 Score on test set:",r2_score(y_test,y_pred))
print("Best Beta Matrix:\n",best_beta.flatten())


Final Evaluation with best beta on 70/30 split
R2 Score on test set: 0.917786034446557
Best Beta Matrix:
 [1.23144707e+06 2.29921558e+05 1.64523054e+05 1.19737507e+05
 1.12425659e+03 1.51317802e+05]


2 Concept of Validation set for Multiple Linear Regression (Gradient Descent
Optimization)
Consider the same dataset of Q1, rather than dividing the dataset into five folds, divide the
dataset into training set (56%), validation set (14%), and test set (30%).
Consider four different values of learning rate i.e. {0.001,0.01,0.1,1}. Compute the values of
regression coefficients for each value of learning rate after 1000 iterations.
For each set of regression coefficients, compute R2_score for validation and test set and find
the best value of regression coefficients.

In [74]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

data=pd.read_csv("USA_Housing.csv")
X=data.drop('Price',axis=1).values
y=data['Price'].values.reshape(-1,1)
X=StandardScaler().fit_transform(X)
n=len(X)
train_end=int(0.56*n)
val_end=int(0.7*n)
X_train,y_train=X[:train_end],y[:train_end]
X_val,y_val=X[train_end:val_end],y[train_end:val_end]
X_test,y_test=X[val_end:],y[val_end:]
X_train=np.c_[np.ones((X_train.shape[0],1)),X_train]
X_val=np.c_[np.ones((X_val.shape[0],1)),X_val]
X_test=np.c_[np.ones((X_test.shape[0],1)),X_test]
lrates=[0.001,0.01,0.1,1]
best_r2=-1
best_beta=None
for lr in lrates:
    beta=np.zeros((X_train.shape[1],1))
    for _ in range(1000):
        grad=(X_train.T@(X_train@beta-y_train))/len(X_train)
        beta=beta-lr*grad
    r2_val=r2_score(y_val,X_val@beta)
    r2_test=r2_score(y_test,X_test@beta)
    print("Learning Rate:",lr)
    print("Validation R2:",r2_val,"Test R2:",r2_test)
    print("Beta:",beta.flatten())
    if r2_val>best_r2:
        best_r2=r2_val
        best_beta=beta
print("Best Beta:",best_beta.flatten())


Learning Rate: 0.001
Validation R2: -0.9353469873109577 Test R2: -0.8082308505816143
Beta: [779956.15931298 148633.92656087 100071.44202261  73571.12255806
  22428.72038852  91893.41958638]
Learning Rate: 0.01
Validation R2: 0.9150931093041854 Test R2: 0.9174823125262497
Beta: [ 1.23240080e+06  2.31659714e+05  1.63606011e+05  1.18757266e+05
 -8.60666297e+00  1.50706774e+05]
Learning Rate: 0.1
Validation R2: 0.9151040123364315 Test R2: 0.917477081644098
Beta: [ 1.23244775e+06  2.31682635e+05  1.63635272e+05  1.19025219e+05
 -2.74956842e+02  1.50705906e+05]
Learning Rate: 1
Validation R2: 0.9151040123364313 Test R2: 0.9174770816440981
Beta: [ 1.23244775e+06  2.31682635e+05  1.63635272e+05  1.19025219e+05
 -2.74956842e+02  1.50705906e+05]
Best Beta: [ 1.23244775e+06  2.31682635e+05  1.63635272e+05  1.19025219e+05
 -2.74956842e+02  1.50705906e+05]


Pre-processing and Multiple Linear Regression
Download the dataset regarding Car Price Prediction from the following link:
https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
1. Load the dataset with following column names ["symboling", "normalized_losses",
"make", "fuel_type", "aspiration","num_doors", "body_style", "drive_wheels",
"engine_location", "wheel_base", "length", "width", "height", "curb_weight",
"engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke",
"compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"]
and replace all ? values with NaN
2. Replace all NaN values with central tendency imputation. Drop the rows with NaN
values in price column
3. There are 10 columns in the dataset with non-numeric values. Convert these values to
numeric values using following scheme:
(i) For “num_doors” and “num_cylinders”: convert words (number names) to figures
for e.g., two to 2
(ii) For "body_style", "drive_wheels": use dummy encoding scheme
(iii) For “make”, “aspiration”, “engine_location”,fuel_type: use label encoding
scheme
(iv) For fuel_system: replace values containing string pfi to 1 else all values to 0.
(v) For engine_type: replace values containing string ohc to 1 else all values to 0.
4. Divide the dataset into input features (all columns except price) and output variable
(price). Scale all input features.
5. Train a linear regressor on 70% of data (using inbuilt linear regression function of
Python) and test its performance on remaining 30% of data.
6. Reduce the dimensionality of the feature set using inbuilt PCA decomposition and then
again train a linear regressor on 70% of reduced data (using inbuilt linear regression
function of Python). Does it lead to any performance improvement on test set

In [75]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

cols=["symboling","normalized_losses","make","fuel_type","aspiration","num_doors","body_style","drive_wheels","engine_location","wheel_base","length","width","height","curb_weight","engine_type","num_cylinders","engine_size","fuel_system","bore","stroke","compression_ratio","horsepower","peak_rpm","city_mpg","highway_mpg","price"]
df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",header=None,names=cols,na_values='?')
cat=["make","fuel_type","aspiration","num_doors","body_style","drive_wheels","engine_location","engine_type","fuel_system","num_cylinders"]
num=[c for c in cols if c not in cat+["price"]]
df[num]=df[num].apply(pd.to_numeric,errors='coerce')
df=df.dropna(subset=["price"])
for c in num: df[c]=df[c].fillna(df[c].mean())
for c in cat: df[c]=df[c].fillna(df[c].mode()[0])
df["num_doors"]=df["num_doors"].map({"two":2,"four":4})
df["num_cylinders"]=df["num_cylinders"].map({"two":2,"three":3,"four":4,"five":5,"six":6,"eight":8,"twelve":12})
df["fuel_system"]=df["fuel_system"].astype(str).str.contains("pfi").astype(int)
df["engine_type"]=df["engine_type"].astype(str).str.contains("ohc").astype(int)
df["make"]=pd.factorize(df["make"])[0]
df["aspiration"]=pd.factorize(df["aspiration"])[0]
df["engine_location"]=pd.factorize(df["engine_location"])[0]
df["fuel_type"]=pd.factorize(df["fuel_type"])[0]
df=pd.get_dummies(df,columns=["body_style","drive_wheels"],drop_first=True)
X=df.drop("price",axis=1).values
y=df["price"].astype(float).values
sc=StandardScaler()
X=sc.fit_transform(X)
n=len(X)
ntr=int(0.7*n)
Xtr,Xte=X[:ntr],X[ntr:]
ytr,yte=y[:ntr],y[ntr:]
lr=LinearRegression().fit(Xtr,ytr)
yhat=lr.predict(Xte)
print("R2 without PCA:",r2_score(yte,yhat))

print()

pca=PCA(n_components=0.95)
Xtr_p=pca.fit_transform(Xtr)
Xte_p=pca.transform(Xte)
lr_p=LinearRegression().fit(Xtr_p,ytr)
yhat_p=lr_p.predict(Xte_p)
print("R2 with PCA:",r2_score(yte,yhat_p))
print("Improved:",r2_score(yte,yhat_p)>r2_score(yte,yhat))


R2 without PCA: 0.18961326980816962

R2 with PCA: 0.3415833571426846
Improved: True
