In [1]:
import pandas as pd

In [25]:
df = pd.read_csv(r"C:\Users\harry\Downloads\Dataset - serus\train.csv")

In [3]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
with open(r"C:\Users\harry\Downloads\Dataset - serus\data_description.txt", 'r') as file:
    data_description = file.read()

print(data_description)

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

It's clear that this dataset contains a lot of features, both numerical and categorical, which describe various aspects of residential homes.

The next step would be to preprocess the data. This could include handling missing values, converting categorical variables into numerical ones (encoding), and scaling numerical values if necessary.

Let's start with data preprocessing. I'll first check for missing values.

In [26]:
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]
missing_values.sort_values(ascending=False)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
BsmtExposure      38
BsmtFinType2      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64

There are several features with missing values in the dataset. The missing data needs to be handled in some way before a machine learning model can be trained on the data. There are several strategies for handling missing data, which include:

Removing the feature: If a feature has a high percentage of missing values, it might be best to remove the feature entirely from the dataset.

Removing the instances: If only a few instances have missing values for a particular feature, it might be best to remove these instances from the dataset.

Imputation: This involves filling in the missing values based on the other values in the same feature. This could be done by setting the missing value to the mean, median, or mode of the feature.

Prediction: This involves using a machine learning algorithm to predict the missing values. This is typically only done for features that are thought to be very important, as it can be quite involved and computationally expensive.

For this dataset, I suggest removing features with high missing values and imputing the rest. For imputation, I'll use the median for numerical features and the mode for categorical features.

I'll also drop the 'Id' column as it's not a feature that can help in predicting house prices.

Let's proceed with this.

In [7]:
from sklearn.impute import SimpleImputer

In [27]:
df = df.drop(['Id'], axis=1)

high_missing_val_threshold = 0.5  

high_missing_val_columns = missing_values[missing_values/len(df) > high_missing_val_threshold].index
df = df.drop(high_missing_val_columns, axis=1)


missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]

num_imputer = SimpleImputer(strategy='median')  # for numerical features
cat_imputer = SimpleImputer(strategy='most_frequent')  # for categorical features


num_features = df.select_dtypes(include=['int64', 'float64']).columns
cat_features = df.select_dtypes(include=['object']).columns

df[num_features] = num_imputer.fit_transform(df[num_features])

df[cat_features] = cat_imputer.fit_transform(df[cat_features])

df.isnull().sum().any()

False

All missing values in the dataset have been successfully handled. There are no missing values left in the dataset.

Next, we need to encode the categorical features. Machine learning algorithms work best with numerical input. Therefore, we need to convert our categorical features into a numerical form.

There are several strategies for encoding categorical features:

One-hot encoding: This strategy creates a binary column for each category in the feature. This is typically used for nominal variables, i.e., variables where the order does not matter.

Ordinal encoding: This strategy converts each category in the feature to a single number. This is typically used for ordinal variables, i.e., variables where the order does matter.

Since it's not clear from the data description which categorical variables are ordinal, I'll use one-hot encoding for all categorical variables. This method has the advantage that it does not introduce an arbitrary order where there isn't one in the data.

In [9]:
df_encoded = pd.get_dummies(df)

df_encoded.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,60.0,65.0,8450.0,7.0,5.0,2003.0,2003.0,196.0,706.0,0.0,...,0,0,0,1,0,0,0,0,1,0
1,20.0,80.0,9600.0,6.0,8.0,1976.0,1976.0,0.0,978.0,0.0,...,0,0,0,1,0,0,0,0,1,0
2,60.0,68.0,11250.0,7.0,5.0,2001.0,2002.0,162.0,486.0,0.0,...,0,0,0,1,0,0,0,0,1,0
3,70.0,60.0,9550.0,7.0,5.0,1915.0,1970.0,0.0,216.0,0.0,...,0,0,0,1,1,0,0,0,0,0
4,60.0,84.0,14260.0,8.0,5.0,2000.0,2000.0,350.0,655.0,0.0,...,0,0,0,1,0,0,0,0,1,0


The dataset has been successfully encoded. All categorical features have been replaced with their one-hot encoded counterparts, leading to a total of 276 features in the dataset.

Next, we need to split the dataset into features (X) and the target (y), which is the 'SalePrice' in this case. After that, we'll split the data into a training set and a test set. The training set is used to train the machine learning model, while the test set is used to evaluate the model's performance.

In [10]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('SalePrice', axis=1)
y = df_encoded['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1168, 275), (292, 275), (1168,), (292,))

The dataset has been successfully split into a training set and a test set. The training set consists of 1168 instances and the test set consists of 292 instances.

The next step is to train a machine learning model on the training set. For this task, I suggest using the Gradient Boosting Regressor, which is a powerful and widely used model for regression tasks. This model is known to provide high performance and can handle a wide variety of data.

Let's train a Gradient Boosting Regressor on the training set. We'll then evaluate it on the test set.

In [11]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

gbr = GradientBoostingRegressor(random_state=42)

gbr.fit(X_train, y_train)

y_pred = gbr.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)

print(mae, rmse)

(16555.576771413747, 26249.76139691673)

The model has been trained and evaluated. The Mean Absolute Error (MAE) on the test set is approximately $16,556, and the Root Mean Squared Error (RMSE) is approximately $26,250.

These metrics tell us that, on average, our model's predictions are about $16,556 away from the actual house prices, and the square root of the average of the squares of all of the error is about $26,250.

While these numbers provide some insight into the model's performance, it would be useful to also calculate the R-squared (coefficient of determination) score, which provides a measure of how well future samples are likely to be predicted by the model. A higher R-squared score indicates a better fit of the model.

Let's calculate the R-squared score.

In [12]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)

print(r2)

0.9101667967198818


The R-squared score is approximately 0.91. This means that our model explains about 91% of the variance in the target variable, 'SalePrice'. This is quite a high score, indicating that our model has performed well on the test set.

However, there's always room for improvement. We could try to further improve our model's performance by tuning its parameters, trying different types of models, or adding additional features to the data.

Although this is a good result, we can try different models or try some hyperparameter tuning to see if we can improve the performance of our model.

In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

rf = RandomForestRegressor(random_state=42)
svr = make_pipeline(StandardScaler(), SVR())
ridge = Ridge(random_state=42)

models = {'Random Forest': rf, 'Support Vector Regressor': svr, 'Ridge Regressor': ridge}

scores = {}
for model_name, model in models.items():
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    r2 = r2_score(y_test, y_pred)

    scores[model_name] = r2

scores

{'Random Forest': 0.8926381032487214,
 'Support Vector Regressor': -0.024557492196307873,
 'Ridge Regressor': 0.8878879430495766}

The R-squared scores for the different regression models are as follows:

Random Forest: 0.89
Support Vector Regressor: -0.02
Ridge Regressor: 0.89
The Random Forest and Ridge Regressor models performed well, each with an R-squared score of about 0.89. However, the Support Vector Regressor performed poorly, with a negative R-squared score.

This means that the Gradient Boosting Regressor that we used initially (with an R-squared score of 0.91) still provided the best performance.

In [16]:
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
import numpy as np

gbr_params = {'n_estimators': np.arange(100, 1000, 100),
              'learning_rate': [0.01, 0.1, 0.2, 0.3],
              'max_depth': np.arange(3, 10)}

rf_params = {'n_estimators': np.arange(100, 1000, 100),
             'max_depth': np.arange(3, 10),
             'min_samples_split': [2, 5, 10],
             'min_samples_leaf': [1, 2, 4]}

xgb_params = {'n_estimators': np.arange(100, 1000, 100),
              'learning_rate': [0.01, 0.1, 0.2, 0.3],
              'max_depth': np.arange(3, 10)}

gbr = GradientBoostingRegressor(random_state=42)
rf = RandomForestRegressor(random_state=42)
xgb = XGBRegressor(random_state=42)

models = {'Gradient Boosting Regressor': (gbr, gbr_params), 
          'Random Forest': (rf, rf_params), 
          'XGBoost Regressor': (xgb, xgb_params)}

tuned_scores = {}
for model_name, (model, params) in models.items():
    grid_search = RandomizedSearchCV(model, params, cv=5, n_iter=10, scoring='r2', random_state=42)

    grid_search.fit(X_train, y_train)

    best_model = grid_search.best_estimator_

    y_pred = best_model.predict(X_test)

    r2 = r2_score(y_test, y_pred)

    tuned_scores[model_name] = r2

tuned_scores

{'Gradient Boosting Regressor': 0.9012293731232879,
 'Random Forest': 0.8883818544514055,
 'XGBoost Regressor': 0.8714151319090934}

In [17]:
from sklearn.ensemble import VotingRegressor

gbr_best_params = {'n_estimators': 400, 'max_depth': 4, 'learning_rate': 0.1}
rf_best_params = {'n_estimators': 500, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 9}
xgb_best_params = {'n_estimators': 300, 'max_depth': 5, 'learning_rate': 0.1}

gbr = GradientBoostingRegressor(**gbr_best_params, random_state=42)
rf = RandomForestRegressor(**rf_best_params, random_state=42)
xgb = XGBRegressor(**xgb_best_params, random_state=42)

ensemble = VotingRegressor([('gbr', gbr), ('rf', rf), ('xgb', xgb)])

ensemble.fit(X_train, y_train)

y_pred = ensemble.predict(X_test)

r2 = r2_score(y_test, y_pred)
r2


0.9064136074316306

In [18]:
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression


base_models = [('gbr', gbr), ('rf', rf), ('xgb', xgb)]

meta_model = LinearRegression()

stacking_regressor = StackingRegressor(estimators=base_models, final_estimator=meta_model)

stacking_regressor.fit(X_train, y_train)

y_pred = stacking_regressor.predict(X_test)

r2 = r2_score(y_test, y_pred)
r2

0.9103310532281225