## House Prices - Advanced Regression Techniques
### Predict sales prices and practice feature engineering, RFs, and gradient boosting

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home. result equally.)

## Practice Skills
Creative feature engineering 

Advanced regression techniques like random forest and gradient boosting

## Evaluation
### Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

## Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.aking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

### File descriptions

1. 
train.cs--  the training set2. 
test.c - - the test se3. t
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used he4. re
sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedro
predict.

### Data fields
Here's a brief version of what you'll find in the data description file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

### Group Objectives for the assignment
1. How do we collaborate on Github? Create branches for each teammate and keep updating with EDA, and insights. Each member to create their own branch
2. Explore the dataset by 5/12/24,
3. perform EDA and prepare insights to share with the team 
4. How often should we meet? a). Every wednesday, and b). Saturday from 8.a.m to 12 a.m. 

# EDA

In [88]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import math
import matplotlib.pyplot as plt

In [89]:
# Reading the train data
import pandas as pd
df1 = pd.read_csv("train.csv")
df1

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [90]:
df1.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


#our predicted is the sale price for the house and we want to find the best predictors for the sale price

In [91]:

y = df1["SalePrice"]
X = df1.drop("SalePrice", axis=1)

In [92]:
# Reading the test data
import pandas as pd
df2 = pd.read_csv("test.csv")
df2


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,6,2006,WD,Normal
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2006,WD,Abnorml
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,9,2006,WD,Abnorml
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


In [93]:
df2.describe()


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,1459.0,1459.0,1232.0,1459.0,1459.0,1459.0,1459.0,1459.0,1444.0,1458.0,...,1458.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0,1459.0
mean,2190.0,57.378341,68.580357,9819.161069,6.078821,5.553804,1971.357779,1983.662783,100.709141,439.203704,...,472.768861,93.174777,48.313914,24.243317,1.79438,17.064428,1.744345,58.167923,6.104181,2007.769705
std,421.321334,42.74688,22.376841,4955.517327,1.436812,1.11374,30.390071,21.130467,177.6259,455.268042,...,217.048611,127.744882,68.883364,67.227765,20.207842,56.609763,30.491646,630.806978,2.722432,1.30174
min,1461.0,20.0,21.0,1470.0,1.0,1.0,1879.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,1825.5,20.0,58.0,7391.0,5.0,5.0,1953.0,1963.0,0.0,0.0,...,318.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,2190.0,50.0,67.0,9399.0,6.0,5.0,1973.0,1992.0,0.0,350.5,...,480.0,0.0,28.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,2554.5,70.0,80.0,11517.5,7.0,6.0,2001.0,2004.0,164.0,753.5,...,576.0,168.0,72.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,2919.0,190.0,200.0,56600.0,10.0,9.0,2010.0,2010.0,1290.0,4010.0,...,1488.0,1424.0,742.0,1012.0,360.0,576.0,800.0,17000.0,12.0,2010.0


In [94]:
#to split the dataset into training and testing subsets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [95]:

print(f"X_train is a DataFrame with {X_train.shape[0]} rows and {X_train.shape[1]} columns")
print(f"y_train is a Series with {y_train.shape[0]} values")

# We always should have the same number of rows in X as values in y
assert X_train.shape[0] == y_train.shape[0]

X_train is a DataFrame with 1095 rows and 80 columns
y_train is a Series with 1095 values


In [96]:


# Declare relevant columns
relevant_columns = [
    'LotFrontage',  # Linear feet of street connected to property
    'LotArea',      # Lot size in square feet
    'Street',       # Type of road access to property
    'OverallQual',  # Rates the overall material and finish of the house
    'OverallCond',  # Rates the overall condition of the house
    'YearBuilt',    # Original construction date
    'YearRemodAdd', # Remodel date (same as construction date if no remodeling or additions)
    'GrLivArea',    # Above grade (ground) living area square feet
    'FullBath',     # Full bathrooms above grade
    'BedroomAbvGr', # Bedrooms above grade (does NOT include basement bedrooms)
    'TotRmsAbvGrd', # Total rooms above grade (does not include bathrooms)
    'Fireplaces',   # Number of fireplaces
    'FireplaceQu',  # Fireplace quality
    'MoSold',       # Month Sold (MM)
    'YrSold'        # Year Sold (YYYY)
]

# Reassign X_train so that it only contains relevant columns
X_train = X_train.loc[:, relevant_columns]

# Visually inspect X_train
X_train

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,FireplaceQu,MoSold,YrSold
1023,43.0,3182,Pave,7,5,2005,2006,1504,2,2,7,1,Gd,5,2008
810,78.0,10140,Pave,6,6,1974,1999,1309,1,3,5,1,Fa,1,2006
1384,60.0,9060,Pave,6,5,1939,1950,1258,1,2,6,0,,10,2009
626,,12342,Pave,5,5,1960,1978,1422,1,3,6,1,TA,8,2007
813,75.0,9750,Pave,6,6,1958,1958,1442,1,4,7,0,,4,2007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,78.0,9317,Pave,6,5,2006,2006,1314,2,3,6,1,Gd,3,2007
1130,65.0,7804,Pave,4,3,1928,1950,1981,2,4,7,2,TA,12,2009
1294,60.0,8172,Pave,5,7,1955,1990,864,1,2,5,0,,4,2006
860,55.0,7642,Pave,7,8,1918,1998,1426,1,3,7,1,Gd,6,2007


In [97]:
# check new shape

# X_train should have the same number of rows as before
assert X_train.shape[0] == 1095

# Now X_train should only have as many columns as relevant_columns
assert X_train.shape[1] == len(relevant_columns)

In [98]:
# find missing values
X_train.isna().sum()

LotFrontage     200
LotArea           0
Street            0
OverallQual       0
OverallCond       0
YearBuilt         0
YearRemodAdd      0
GrLivArea         0
FullBath          0
BedroomAbvGr      0
TotRmsAbvGrd      0
Fireplaces        0
FireplaceQu     512
MoSold            0
YrSold            0
dtype: int64

In [99]:
# to confirm how many houses have zero fireplaces
X_train[X_train["Fireplaces"] == 0]

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,FireplaceQu,MoSold,YrSold
1384,60.0,9060,Pave,6,5,1939,1950,1258,1,2,6,0,,10,2009
813,75.0,9750,Pave,6,6,1958,1958,1442,1,4,7,0,,4,2007
839,70.0,11767,Pave,5,6,1946,1995,1200,1,3,6,0,,5,2008
430,21.0,1680,Pave,6,5,1971,1971,987,1,2,4,0,,7,2008
513,71.0,9187,Pave,6,5,1983,1983,1080,1,3,5,0,,6,2007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,40.0,3951,Pave,6,5,2009,2009,1224,2,2,4,0,,6,2009
330,,10624,Pave,5,4,1964,1964,1728,2,6,10,0,,11,2007
1238,63.0,13072,Pave,6,5,2005,2005,1141,1,3,6,0,,3,2006
121,50.0,6060,Pave,4,5,1939,1950,1123,1,3,4,0,,6,2007


In [100]:
# combining zero fireplaces with Fireplacequ is Nan
X_train[
    (X_train["Fireplaces"] == 0) &
    (X_train["FireplaceQu"].isna())
]

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,FireplaceQu,MoSold,YrSold
1384,60.0,9060,Pave,6,5,1939,1950,1258,1,2,6,0,,10,2009
813,75.0,9750,Pave,6,6,1958,1958,1442,1,4,7,0,,4,2007
839,70.0,11767,Pave,5,6,1946,1995,1200,1,3,6,0,,5,2008
430,21.0,1680,Pave,6,5,1971,1971,987,1,2,4,0,,7,2008
513,71.0,9187,Pave,6,5,1983,1983,1080,1,3,5,0,,6,2007
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,40.0,3951,Pave,6,5,2009,2009,1224,2,2,4,0,,6,2009
330,,10624,Pave,5,4,1964,1964,1728,2,6,10,0,,11,2007
1238,63.0,13072,Pave,6,5,2005,2005,1141,1,3,6,0,,3,2006
121,50.0,6060,Pave,4,5,1939,1950,1123,1,3,4,0,,6,2007


In [101]:
# to replace those with NaNs with "N/A" to indicate its a real category

X_train["FireplaceQu"] = X_train["FireplaceQu"].fillna("N/A")
X_train["FireplaceQu"].value_counts()

FireplaceQu
N/A    512
Gd     286
TA     236
Fa      26
Ex      19
Po      16
Name: count, dtype: int64

In [102]:
# to create a new column to represent values that are NAN
from sklearn.impute import MissingIndicator

#  Identify data to be transformed
# We only want missing indicators for LotFrontage
frontage_train = X_train[["LotFrontage"]]

#  Instantiate the transformer object
missing_indicator = MissingIndicator()

#  Fit the transformer object on frontage_train
missing_indicator.fit(frontage_train)

#  Transform frontage_train and assign the result
# to frontage_missing_train
frontage_missing_train = missing_indicator.transform(frontage_train)

# Visually inspect frontage_missing_train
frontage_missing_train

array([[False],
       [False],
       [False],
       ...,
       [False],
       [False],
       [False]])

In [103]:

# add the transformed data as a new column on Xtrain
X_train["LotFrontage_Missing"] = frontage_missing_train
X_train

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,FireplaceQu,MoSold,YrSold,LotFrontage_Missing
1023,43.0,3182,Pave,7,5,2005,2006,1504,2,2,7,1,Gd,5,2008,False
810,78.0,10140,Pave,6,6,1974,1999,1309,1,3,5,1,Fa,1,2006,False
1384,60.0,9060,Pave,6,5,1939,1950,1258,1,2,6,0,,10,2009,False
626,,12342,Pave,5,5,1960,1978,1422,1,3,6,1,TA,8,2007,True
813,75.0,9750,Pave,6,6,1958,1958,1442,1,4,7,0,,4,2007,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,78.0,9317,Pave,6,5,2006,2006,1314,2,3,6,1,Gd,3,2007,False
1130,65.0,7804,Pave,4,3,1928,1950,1981,2,4,7,2,TA,12,2009,False
1294,60.0,8172,Pave,5,7,1955,1990,864,1,2,5,0,,4,2006,False
860,55.0,7642,Pave,7,8,1918,1998,1426,1,3,7,1,Gd,6,2007,False


In [104]:
# To fill the Nan values with median

from sklearn.impute import SimpleImputer

#  frontage_train was created previously, so we don't
# need to extract the relevant data again

#  Instantiate a SimpleImputer with strategy="median"
imputer = SimpleImputer(strategy="median")

#  Fit the imputer on frontage_train
imputer.fit(frontage_train)

#  Transform frontage_train using the imputer and
# assign the result to frontage_imputed_train
frontage_imputed_train = imputer.transform(frontage_train)

# Visually inspect frontage_imputed_train
frontage_imputed_train

array([[43.],
       [78.],
       [60.],
       ...,
       [60.],
       [55.],
       [53.]])

In [105]:
# replacing the original data in LotFrontage with the new data


X_train["LotFrontage"] = frontage_imputed_train

# Visually inspect X_train
X_train

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,FireplaceQu,MoSold,YrSold,LotFrontage_Missing
1023,43.0,3182,Pave,7,5,2005,2006,1504,2,2,7,1,Gd,5,2008,False
810,78.0,10140,Pave,6,6,1974,1999,1309,1,3,5,1,Fa,1,2006,False
1384,60.0,9060,Pave,6,5,1939,1950,1258,1,2,6,0,,10,2009,False
626,70.0,12342,Pave,5,5,1960,1978,1422,1,3,6,1,TA,8,2007,True
813,75.0,9750,Pave,6,6,1958,1958,1442,1,4,7,0,,4,2007,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,78.0,9317,Pave,6,5,2006,2006,1314,2,3,6,1,Gd,3,2007,False
1130,65.0,7804,Pave,4,3,1928,1950,1981,2,4,7,2,TA,12,2009,False
1294,60.0,8172,Pave,5,7,1955,1990,864,1,2,5,0,,4,2006,False
860,55.0,7642,Pave,7,8,1918,1998,1426,1,3,7,1,Gd,6,2007,False


In [106]:
# to check there is no more Nan values
X_train.isna().sum()

LotFrontage            0
LotArea                0
Street                 0
OverallQual            0
OverallCond            0
YearBuilt              0
YearRemodAdd           0
GrLivArea              0
FullBath               0
BedroomAbvGr           0
TotRmsAbvGrd           0
Fireplaces             0
FireplaceQu            0
MoSold                 0
YrSold                 0
LotFrontage_Missing    0
dtype: int64

In [107]:
# checking the data types in train data
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1095 entries, 1023 to 1126
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LotFrontage          1095 non-null   float64
 1   LotArea              1095 non-null   int64  
 2   Street               1095 non-null   object 
 3   OverallQual          1095 non-null   int64  
 4   OverallCond          1095 non-null   int64  
 5   YearBuilt            1095 non-null   int64  
 6   YearRemodAdd         1095 non-null   int64  
 7   GrLivArea            1095 non-null   int64  
 8   FullBath             1095 non-null   int64  
 9   BedroomAbvGr         1095 non-null   int64  
 10  TotRmsAbvGrd         1095 non-null   int64  
 11  Fireplaces           1095 non-null   int64  
 12  FireplaceQu          1095 non-null   object 
 13  MoSold               1095 non-null   int64  
 14  YrSold               1095 non-null   int64  
 15  LotFrontage_Missing  1095 non-null   boo

In [108]:
# to check for the non numeric columns

print(X_train["Street"].value_counts())
print()
print(X_train["FireplaceQu"].value_counts())
print()
print(X_train["LotFrontage_Missing"].value_counts())

Street
Pave    1091
Grvl       4
Name: count, dtype: int64

FireplaceQu
N/A    512
Gd     286
TA     236
Fa      26
Ex      19
Po      16
Name: count, dtype: int64

LotFrontage_Missing
False    895
True     200
Name: count, dtype: int64


In [109]:
# converting the categories in Street into binary values

#  import OrdinalEncoder from sklearn.preprocessing
from sklearn.preprocessing import OrdinalEncoder

#  Create a variable street_train that contains the
# relevant column from X_train
# (Use double brackets [[]] to get the appropriate shape)
street_train = X_train[["Street"]]

#  Instantiate an OrdinalEncoder
encoder_street = OrdinalEncoder()

#  Fit the encoder on street_train
encoder_street.fit(street_train)

# Inspect the categories of the fitted encoder
encoder_street.categories_[0]

array(['Grvl', 'Pave'], dtype=object)

In [110]:
#  transforming the street data into 1s and 0s

#  Transform street_train using the encoder and
# assign the result to street_encoded_train
street_encoded_train = encoder_street.transform(street_train)

# Flatten for appropriate shape
street_encoded_train = street_encoded_train.flatten()

# Visually inspect street_encoded_train
street_encoded_train

array([1., 1., 1., ..., 1., 1., 1.])

In [111]:

# Replace value of Street
X_train["Street"] = street_encoded_train

# Visually inspect X_train
X_train

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,FireplaceQu,MoSold,YrSold,LotFrontage_Missing
1023,43.0,3182,1.0,7,5,2005,2006,1504,2,2,7,1,Gd,5,2008,False
810,78.0,10140,1.0,6,6,1974,1999,1309,1,3,5,1,Fa,1,2006,False
1384,60.0,9060,1.0,6,5,1939,1950,1258,1,2,6,0,,10,2009,False
626,70.0,12342,1.0,5,5,1960,1978,1422,1,3,6,1,TA,8,2007,True
813,75.0,9750,1.0,6,6,1958,1958,1442,1,4,7,0,,4,2007,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,78.0,9317,1.0,6,5,2006,2006,1314,2,3,6,1,Gd,3,2007,False
1130,65.0,7804,1.0,4,3,1928,1950,1981,2,4,7,2,TA,12,2009,False
1294,60.0,8172,1.0,5,7,1955,1990,864,1,2,5,0,,4,2006,False
860,55.0,7642,1.0,7,8,1918,1998,1426,1,3,7,1,Gd,6,2007,False


In [112]:

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1095 entries, 1023 to 1126
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LotFrontage          1095 non-null   float64
 1   LotArea              1095 non-null   int64  
 2   Street               1095 non-null   float64
 3   OverallQual          1095 non-null   int64  
 4   OverallCond          1095 non-null   int64  
 5   YearBuilt            1095 non-null   int64  
 6   YearRemodAdd         1095 non-null   int64  
 7   GrLivArea            1095 non-null   int64  
 8   FullBath             1095 non-null   int64  
 9   BedroomAbvGr         1095 non-null   int64  
 10  TotRmsAbvGrd         1095 non-null   int64  
 11  Fireplaces           1095 non-null   int64  
 12  FireplaceQu          1095 non-null   object 
 13  MoSold               1095 non-null   int64  
 14  YrSold               1095 non-null   int64  
 15  LotFrontage_Missing  1095 non-null   boo

In [113]:
# # converting the categories in  LotFrontage_missing into binary values

#  Instantiate an OrdinalEncoder for missing frontage
encoder_frontage_missing = OrdinalEncoder()

#  Fit the encoder on frontage_missing_train
encoder_frontage_missing.fit(frontage_missing_train)

# Inspect the categories of the fitted encoder
encoder_frontage_missing.categories_[0]

array([False,  True])

In [114]:
# transforming the LotFrontage_missing into 1s and 0s

#  Transform frontage_missing_train using the encoder and
# assign the result to frontage_missing_encoded_train
frontage_missing_encoded_train = encoder_frontage_missing.transform(frontage_missing_train)

# Flatten for appropriate shape
frontage_missing_encoded_train = frontage_missing_encoded_train.flatten()

# Visually inspect frontage_missing_encoded_train
frontage_missing_encoded_train

array([0., 0., 0., ..., 0., 0., 0.])

In [115]:

#  Replace value of LotFrontage_Missing
X_train["LotFrontage_Missing"] = frontage_missing_encoded_train

# Visually inspect X_train
X_train

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,FireplaceQu,MoSold,YrSold,LotFrontage_Missing
1023,43.0,3182,1.0,7,5,2005,2006,1504,2,2,7,1,Gd,5,2008,0.0
810,78.0,10140,1.0,6,6,1974,1999,1309,1,3,5,1,Fa,1,2006,0.0
1384,60.0,9060,1.0,6,5,1939,1950,1258,1,2,6,0,,10,2009,0.0
626,70.0,12342,1.0,5,5,1960,1978,1422,1,3,6,1,TA,8,2007,1.0
813,75.0,9750,1.0,6,6,1958,1958,1442,1,4,7,0,,4,2007,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,78.0,9317,1.0,6,5,2006,2006,1314,2,3,6,1,Gd,3,2007,0.0
1130,65.0,7804,1.0,4,3,1928,1950,1981,2,4,7,2,TA,12,2009,0.0
1294,60.0,8172,1.0,5,7,1955,1990,864,1,2,5,0,,4,2006,0.0
860,55.0,7642,1.0,7,8,1918,1998,1426,1,3,7,1,Gd,6,2007,0.0


In [116]:

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1095 entries, 1023 to 1126
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LotFrontage          1095 non-null   float64
 1   LotArea              1095 non-null   int64  
 2   Street               1095 non-null   float64
 3   OverallQual          1095 non-null   int64  
 4   OverallCond          1095 non-null   int64  
 5   YearBuilt            1095 non-null   int64  
 6   YearRemodAdd         1095 non-null   int64  
 7   GrLivArea            1095 non-null   int64  
 8   FullBath             1095 non-null   int64  
 9   BedroomAbvGr         1095 non-null   int64  
 10  TotRmsAbvGrd         1095 non-null   int64  
 11  Fireplaces           1095 non-null   int64  
 12  FireplaceQu          1095 non-null   object 
 13  MoSold               1095 non-null   int64  
 14  YrSold               1095 non-null   int64  
 15  LotFrontage_Missing  1095 non-null   flo

In [117]:

df_example = pd.DataFrame(frontage_missing_train, columns=["LotFrontage_Missing"])
df_example

Unnamed: 0,LotFrontage_Missing
0,False
1,False
2,False
3,True
4,False
...,...
1090,False
1091,False
1092,False
1093,False


In [118]:

df_example["LotFrontage_Missing"] = df_example["LotFrontage_Missing"].astype(int)
df_example

Unnamed: 0,LotFrontage_Missing
0,0
1,0
2,0
3,1
4,0
...,...
1090,0
1091,0
1092,0
1093,0


In [119]:
# checking the categorical data in FireplaceQu

# (0) import OneHotEncoder from sklearn.preprocessing
from sklearn.preprocessing import OneHotEncoder

# (1) Create a variable fireplace_qu_train
# extracted from X_train
# (double brackets due to shape expected by OHE)
fireplace_qu_train = X_train[["FireplaceQu"]]

# (2) Instantiate a OneHotEncoder with categories="auto",
# sparse=False, and handle_unknown="ignore"
ohe = OneHotEncoder(categories="auto", sparse_output=False, handle_unknown="ignore")

# (3) Fit the encoder on fireplace_qu_train
ohe.fit(fireplace_qu_train)

# Inspect the categories of the fitted encoder
ohe.categories_

[array(['Ex', 'Fa', 'Gd', 'N/A', 'Po', 'TA'], dtype=object)]

In [120]:

# Transform fireplace_qu_train using the encoder and
# assign the result to fireplace_qu_encoded_train
fireplace_qu_encoded_train = ohe.transform(fireplace_qu_train)

# Visually inspect fireplace_qu_encoded_train
fireplace_qu_encoded_train

array([[0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1.]])

In [121]:


# Make the transformed data into a dataframe
fireplace_qu_encoded_train = pd.DataFrame(
    # Pass in NumPy array
    fireplace_qu_encoded_train,
    # Set the column names to the categories found by OHE
    columns=ohe.categories_[0],
    # Set the index to match X_train's index
    index=X_train.index
)

# Visually inspect new dataframe
fireplace_qu_encoded_train

Unnamed: 0,Ex,Fa,Gd,N/A,Po,TA
1023,0.0,0.0,1.0,0.0,0.0,0.0
810,0.0,1.0,0.0,0.0,0.0,0.0
1384,0.0,0.0,0.0,1.0,0.0,0.0
626,0.0,0.0,0.0,0.0,0.0,1.0
813,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...
1095,0.0,0.0,1.0,0.0,0.0,0.0
1130,0.0,0.0,0.0,0.0,0.0,1.0
1294,0.0,0.0,0.0,1.0,0.0,0.0
860,0.0,0.0,1.0,0.0,0.0,0.0


In [122]:

#  Drop original FireplaceQu column
X_train.drop("FireplaceQu", axis=1, inplace=True)

# Visually inspect X_train
X_train

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,TotRmsAbvGrd,Fireplaces,MoSold,YrSold,LotFrontage_Missing
1023,43.0,3182,1.0,7,5,2005,2006,1504,2,2,7,1,5,2008,0.0
810,78.0,10140,1.0,6,6,1974,1999,1309,1,3,5,1,1,2006,0.0
1384,60.0,9060,1.0,6,5,1939,1950,1258,1,2,6,0,10,2009,0.0
626,70.0,12342,1.0,5,5,1960,1978,1422,1,3,6,1,8,2007,1.0
813,75.0,9750,1.0,6,6,1958,1958,1442,1,4,7,0,4,2007,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,78.0,9317,1.0,6,5,2006,2006,1314,2,3,6,1,3,2007,0.0
1130,65.0,7804,1.0,4,3,1928,1950,1981,2,4,7,2,12,2009,0.0
1294,60.0,8172,1.0,5,7,1955,1990,864,1,2,5,0,4,2006,0.0
860,55.0,7642,1.0,7,8,1918,1998,1426,1,3,7,1,6,2007,0.0


In [123]:
# to concatenate the new data frame with original X_train

#  Concatenate the new dataframe with current X_train
X_train = pd.concat([X_train, fireplace_qu_encoded_train], axis=1)

# Visually inspect X_train
X_train

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,...,Fireplaces,MoSold,YrSold,LotFrontage_Missing,Ex,Fa,Gd,N/A,Po,TA
1023,43.0,3182,1.0,7,5,2005,2006,1504,2,2,...,1,5,2008,0.0,0.0,0.0,1.0,0.0,0.0,0.0
810,78.0,10140,1.0,6,6,1974,1999,1309,1,3,...,1,1,2006,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1384,60.0,9060,1.0,6,5,1939,1950,1258,1,2,...,0,10,2009,0.0,0.0,0.0,0.0,1.0,0.0,0.0
626,70.0,12342,1.0,5,5,1960,1978,1422,1,3,...,1,8,2007,1.0,0.0,0.0,0.0,0.0,0.0,1.0
813,75.0,9750,1.0,6,6,1958,1958,1442,1,4,...,0,4,2007,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,78.0,9317,1.0,6,5,2006,2006,1314,2,3,...,1,3,2007,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1130,65.0,7804,1.0,4,3,1928,1950,1981,2,4,...,2,12,2009,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1294,60.0,8172,1.0,5,7,1955,1990,864,1,2,...,0,4,2006,0.0,0.0,0.0,0.0,1.0,0.0,0.0
860,55.0,7642,1.0,7,8,1918,1998,1426,1,3,...,1,6,2007,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [124]:

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1095 entries, 1023 to 1126
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   LotFrontage          1095 non-null   float64
 1   LotArea              1095 non-null   int64  
 2   Street               1095 non-null   float64
 3   OverallQual          1095 non-null   int64  
 4   OverallCond          1095 non-null   int64  
 5   YearBuilt            1095 non-null   int64  
 6   YearRemodAdd         1095 non-null   int64  
 7   GrLivArea            1095 non-null   int64  
 8   FullBath             1095 non-null   int64  
 9   BedroomAbvGr         1095 non-null   int64  
 10  TotRmsAbvGrd         1095 non-null   int64  
 11  Fireplaces           1095 non-null   int64  
 12  MoSold               1095 non-null   int64  
 13  YrSold               1095 non-null   int64  
 14  LotFrontage_Missing  1095 non-null   float64
 15  Ex                   1095 non-null   flo

In [125]:
#Correlation 

correlation_matrix = X_train.corr()


In [126]:
# identifyoing the best columns as correlated with saleprice column

numerical_cols = df1.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = df1[numerical_cols].corr()
saleprice_correlation = correlation_matrix['SalePrice'].sort_values(ascending=False)
print("Top Numerical Features by Correlation:\n", saleprice_correlation.head(15))

Top Numerical Features by Correlation:
 SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
GarageArea      0.623431
TotalBsmtSF     0.613581
1stFlrSF        0.605852
FullBath        0.560664
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
YearRemodAdd    0.507101
GarageYrBlt     0.486362
MasVnrArea      0.477493
Fireplaces      0.466929
BsmtFinSF1      0.386420
Name: SalePrice, dtype: float64


In [127]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(X_train)
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train),
    # index is important to ensure we can concatenate with other columns
    index=X_train.index,
    columns=X_train.columns
)
X_train_scaled

Unnamed: 0,LotFrontage,LotArea,Street,OverallQual,OverallCond,YearBuilt,YearRemodAdd,GrLivArea,FullBath,BedroomAbvGr,...,Fireplaces,MoSold,YrSold,LotFrontage_Missing,Ex,Fa,Gd,N/A,Po,TA
1023,0.075342,0.008797,1.0,0.666667,0.500,0.963768,0.933333,0.220422,0.666667,0.250,...,0.333333,0.363636,0.50,0.0,0.0,0.0,1.0,0.0,0.0,0.0
810,0.195205,0.041319,1.0,0.555556,0.625,0.739130,0.816667,0.183685,0.333333,0.375,...,0.333333,0.000000,0.00,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1384,0.133562,0.036271,1.0,0.555556,0.500,0.485507,0.000000,0.174077,0.333333,0.250,...,0.000000,0.818182,0.75,0.0,0.0,0.0,0.0,1.0,0.0,0.0
626,0.167808,0.051611,1.0,0.444444,0.500,0.637681,0.466667,0.204974,0.333333,0.375,...,0.333333,0.636364,0.25,1.0,0.0,0.0,0.0,0.0,0.0,1.0
813,0.184932,0.039496,1.0,0.555556,0.625,0.623188,0.133333,0.208742,0.333333,0.500,...,0.000000,0.272727,0.25,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1095,0.195205,0.037472,1.0,0.555556,0.500,0.971014,0.933333,0.184627,0.666667,0.375,...,0.333333,0.181818,0.25,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1130,0.150685,0.030400,1.0,0.333333,0.250,0.405797,0.000000,0.310286,0.666667,0.500,...,0.666667,1.000000,0.75,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1294,0.133562,0.032120,1.0,0.444444,0.750,0.601449,0.666667,0.099849,0.333333,0.250,...,0.000000,0.272727,0.00,0.0,0.0,0.0,0.0,1.0,0.0,0.0
860,0.116438,0.029643,1.0,0.666667,0.875,0.333333,0.800000,0.205727,0.333333,0.375,...,0.333333,0.454545,0.25,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [128]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(fit_intercept=False, C=1e12, solver='liblinear')
model_log = logreg.fit(X_train, y_train)
model_log



In [129]:
import numpy as np

y_hat_train = logreg.predict(X_train)

train_residuals = np.abs(y_train - y_hat_train)
print(pd.Series(train_residuals, name="Residuals (counts)").value_counts())
print()
print(pd.Series(train_residuals, name="Residuals (proportions)").value_counts(normalize=True))

Residuals (counts)
0         604
8000       15
5000       15
3000       11
7000       10
         ... 
87500       1
79790       1
33500       1
192750      1
60400       1
Name: count, Length: 199, dtype: int64

Residuals (proportions)
0         0.551598
8000      0.013699
5000      0.013699
3000      0.010046
7000      0.009132
            ...   
87500     0.000913
79790     0.000913
33500     0.000913
192750    0.000913
60400     0.000913
Name: proportion, Length: 199, dtype: float64


In [130]:
#to split the dataset into training and testing subsets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)