Trying out the different models: **Linear regression, Linear support vector regressor, random forest
regressor, gradient boosting regressor**

In [None]:
import pandas as pd
import os
import random

from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor


from sklearn.metrics import r2_score,mean_squared_error
from sklearn.model_selection import train_test_split,KFold,cross_val_score
from sklearn.preprocessing import StandardScaler

# current_script_dir = os.path.dirname(__file__)
# csv_path = os.path.join(current_script_dir,"../data/cal_housing.csv")

df = pd.read_csv("/content/cal_housing_tuned.csv")
df1 = df.iloc[:20000,:]

X = df1.drop("medianHouseValue",axis=1).values
y = df1["medianHouseValue"].values


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=44)



scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

LinReg = LinearRegression()
svm = LinearSVR(max_iter=10000,C=11,random_state=42)
ranfor = RandomForestRegressor(n_estimators=102, random_state=42)
gradboost = GradientBoostingRegressor(n_estimators=90, learning_rate=0.2, max_depth=3, random_state=42)

models = [LinReg,svm,ranfor,gradboost]

for model in models:

    model.fit(X_train_scaled,y_train)
    y_pred = model.predict(X_test_scaled)

    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"Performance of {model} :- ")
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}\n")


Performance of LinearRegression() :- 
Mean Squared Error: 2.04583384540475
R-squared: 0.6798062154405788

Performance of LinearSVR(C=11, max_iter=10000, random_state=42) :- 
Mean Squared Error: 2.066316844753074
R-squared: 0.6766004179144507

Performance of RandomForestRegressor(n_estimators=102, random_state=42) :- 
Mean Squared Error: 0.9358553149639114
R-squared: 0.8535291339658819

Performance of GradientBoostingRegressor(learning_rate=0.2, n_estimators=90, random_state=42) :- 
Mean Squared Error: 1.1294339240312008
R-squared: 0.8232321146912099



After testing with different values and fine tuning the hyperparameters, we can see that **Random Forest Regression** and **gradient boosting regression** are performing well and has better performance than the other. In fact, random forest has better MSE and r2 score than that of gradient boosting. so, we will select the **Random forest regressor** model.

Now, we will perform **k-fold cross validation** to ensure that our model has generalized the data well without leading to overfitting.

In [None]:

scalers = StandardScaler()
X_scaled = scalers.fit_transform(X)

ran = RandomForestRegressor(n_estimators=102, random_state=42)

kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Performing k-fold cross-validation
mse_scores = cross_val_score(ran, X_scaled, y, cv=kf, scoring='neg_mean_squared_error')
r2_scores = cross_val_score(ran, X_scaled, y, cv=kf, scoring='r2')

mse_scores = -mse_scores

for fold, (mse, r2) in enumerate(zip(mse_scores, r2_scores), 1):
    print(f"Fold {fold}:-")
    print(f"Mean Squared Error: {mse}")
    print(f"R-squared: {r2}\n")

print("Average Performance Across Folds:")
print(f"Mean Squared Error: {mse_scores.mean()}")
print(f"R-squared: {r2_scores.mean()}")



Fold 1:-
Mean Squared Error: 1.0542459376397064
R-squared: 0.8415318403339744

Fold 2:-
Mean Squared Error: 0.985438442904238
R-squared: 0.8444815878484031

Fold 3:-
Mean Squared Error: 0.9662284897782001
R-squared: 0.8474213614569975

Fold 4:-
Mean Squared Error: 0.9694928386070533
R-squared: 0.849735375558331

Fold 5:-
Mean Squared Error: 0.9523016326051417
R-squared: 0.8509171520170596

Average Performance Across Folds:
Mean Squared Error: 0.9855414683068678
R-squared: 0.8468174634429531


From the above result, we can clearly see the performance of the model is consistent accross all folds. so we conclude that our model has generalized the data well.

Now, we will try to predict the some values using known datas but which has not been introduced to the model.

In [None]:
indexes = random.sample(range(20000,20100),10)
test_data = df.iloc[indexes,0:8].values
actual_values = df.iloc[indexes,-1].values
test_scaled = scaler.transform(test_data)
pred_values = ranfor.predict(test_scaled)
print(pred_values)
print(actual_values)

[25.39814362 27.94220491 26.74732694 25.73716457 25.91640161 27.07138613
 26.19838268 26.94863507 26.00736428 25.20727169]
[25.59630927 26.92348855 28.60161536 23.27451039 25.19394569 27.34619805
 25.35899412 26.48657462 26.34904723 24.63482298]
