<a href="https://colab.research.google.com/github/nicorunini/CCMACLRL_EXERCISES_COM232/blob/main/Exercise6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 6: Choosing the best performing model on a dataset

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
- Use all Regression models

Submit your results to:
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview



In [180]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

## Dataset File

In [181]:
train_data = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/train.csv?raw=true'
df = pd.read_csv(train_data)

## Test File

In [182]:
test_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/test.csv?raw=true'
dt=pd.read_csv(test_url)

In [183]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

## Sample Submission File

In [184]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/sample_submission.csv?raw=true'

sf=pd.read_csv(sample_submission_url)

In [185]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB


In [186]:
X = df.drop(columns=["SalePrice", "Id"])
y = df["SalePrice"]

In [187]:

for col in X.select_dtypes(include=["int64", "float64"]).columns:
    X[col].fillna(X[col].median(), inplace=True)

for col in X.select_dtypes(include=["object"]).columns:
    X[col].fillna(X[col].mode()[0], inplace=True)

X = pd.get_dummies(X, drop_first=True)

# Train/test split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# One-hot encode test data
dt = pd.get_dummies(dt, drop_first=True)

# Align test set columns with training set columns
dt = dt.reindex(columns=X.columns, fill_value=0)



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(X[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X[col].fillna(X[col].mode()[0], inplace=True)


In [188]:
def cross_val(model, X, y, k=5):
    fold_size = len(X) // k
    scores = []

    for i in range(k):
        start = i * fold_size
        end = start + fold_size

        x_val = X.iloc[start:end]
        y_val = y.iloc[start:end]

        x_train = pd.concat([X.iloc[:start], X.iloc[end:]])
        y_train = pd.concat([y.iloc[:start], y.iloc[end:]])

        model.fit(x_train, y_train)
        scores.append(model.score(x_val, y_val))

    return sum(scores) / len(scores)


## 1. Train a KNN Regressor

In [189]:
KNN = KNeighborsRegressor(n_neighbors=22)
KNN.fit(x_train, y_train)
knn_score = KNN.score(x_test, y_test)
score_list["KNN Regressor"] = knn_score
print(f"Score is {knn_score}")

Score is 0.5980835491666139


- Perform cross validation

In [190]:
knn_cv = cross_val(KNN, X, y, k=5)
print(f"KNN Cross-Validation Score is {knn_cv}")
score_list["KNN Regressor"] = {"model": KNN, "score": knn_cv}

KNN Cross-Validation Score is 0.6085890609548947


## 2. Train a SVM Regression

In [191]:
SVM = SVR(kernel="rbf", C=100, gamma=0.1)
SVM.fit(x_train, y_train)
svm_score = SVM.score(x_test, y_test)
score_list["SVM Regressor"] = svm_score
print(f"Score is {svm_score}")

Score is -0.037962385609509264


- Perform cross validation

In [192]:
svm_cv = cross_val(SVM, X, y, k=5)
print(f"Score is {svm_cv}")
score_list["SVM Regressor"] = {"model": SVM, "score": svm_cv}

Score is -0.05202678413481774


## 3. Train a Decision Tree Regression

In [193]:
DT = DecisionTreeRegressor(random_state=1)
DT.fit(x_train, y_train)
dt_score = DT.score(x_test, y_test)
score_list["Decision Tree Regressor"] = dt_score
print(f"Score is {dt_score}")

Score is 0.751032883968054


- Perform cross validation

In [194]:
dt_cv = cross_val(DT, X, y, k=5)
print(f"Score is {dt_cv}")
score_list["Decision Tree Regressor"] = {"model": DT, "score": dt_cv}

Score is 0.7241490299673856


## 4. Train a Random Forest Regression

In [195]:
RF = RandomForestRegressor(n_estimators=200, random_state=1)
RF.fit(x_train, y_train)
rf_score = RF.score(x_test, y_test)
score_list["Random Forest Regressor"] = rf_score
print(f"Score is {rf_score}")

Score is 0.9060977490162936


In [202]:
rf_cv = cross_val(RF, X, y, k=5)
print(f"Score is {rf_cv}")

Score is 0.8556852157285582


## 5. Compare all the performance of all regression models

In [204]:
for alg, score in score_list:
    if isinstance(score, (int, float)):
        print(f"{alg} Score is {score:.4f}")
    else:
        print(f"{alg} has invalid score: {score}")


KNN Score is 0.5981
KNN Classifier Score is 0.5981
SVM Regressor has invalid score: {'model': SVR(C=100, gamma=0.1), 'score': -0.05202678413481774}
KNN Regressor has invalid score: {'model': KNeighborsRegressor(n_neighbors=22), 'score': 0.6085890609548947}
Random Forest Regressor Score is 0.9061
Decision Tree Regressor has invalid score: {'model': DecisionTreeRegressor(random_state=1), 'score': 0.7241490299673856}


## 6. Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [205]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/sample_submission.csv?raw=true'
sf=pd.read_csv(sample_submission_url)

id = sf.pop('Id')
y_pred = RF.predict(dt)

# Create a submission DataFrame
submission_df = pd.DataFrame({
    'Id': id,
    'SalePrice': y_pred
})

# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")

Submission file created: submission_file.csv
