# Meeting #15 - Feature Engineering

1. Handling Missing Values:
Practical Task:
Identify and impute missing values in a selected column using the median for numerical features or a placeholder for categorical features.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv(r'C:\Users\jorda\Documents\studies\DScourse\CourseMaterials\Pandas\data\train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [3]:
# Identify columns with null values
null_cols = df.columns[df.isnull().sum() > 0]

# Convert all possible columns to numeric values
a = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).isnull().sum()

# Numerical columns
numerical_cols = a[a == 0].index

# Categorical columns
categorical_cols = df.drop(numerical_cols, axis=1).columns

print(f'Numerical columns:\n{numerical_cols}\n')
print(f'Categorical columns:\n{categorical_cols}\n')

null_numerical_cols = [col for col in numerical_cols if col in null_cols]
null_categorical_cols = [col for col in categorical_cols if col in null_cols]

df[numerical_cols].isnull().sum()

Numerical columns:
Index(['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
       'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars',
       'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice'],
      dtype='object')

Categorical columns:
Index(['MSZoning', 'LotFrontage', 'Street', 'Alley', 'LotShape', 'LandContour',
       'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual',
       'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'BsmtFinType1', 'BsmtFinT

Id               0
MSSubClass       0
LotArea          0
OverallQual      0
OverallCond      0
YearBuilt        0
YearRemodAdd     0
BsmtFinSF1       0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
1stFlrSF         0
2ndFlrSF         0
LowQualFinSF     0
GrLivArea        0
BsmtFullBath     0
BsmtHalfBath     0
FullBath         0
HalfBath         0
BedroomAbvGr     0
KitchenAbvGr     0
TotRmsAbvGrd     0
Fireplaces       0
GarageCars       0
GarageArea       0
WoodDeckSF       0
OpenPorchSF      0
EnclosedPorch    0
3SsnPorch        0
ScreenPorch      0
PoolArea         0
MiscVal          0
MoSold           0
YrSold           0
SalePrice        0
dtype: int64

In [4]:
df[categorical_cols].isnull().sum()

MSZoning            0
LotFrontage       259
Street              0
Alley            1369
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType        872
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
BsmtQual           37
BsmtCond           37
BsmtExposure       38
BsmtFinType1       37
BsmtFinType2       38
Heating             0
HeatingQC           0
CentralAir          0
Electrical          1
KitchenQual         0
Functional          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageQual         81
GarageCond         81
PavedDrive          0
PoolQC           1453
Fence            1179
MiscFeature      1406
SaleType            0
SaleCondit

There are no numerical columns that have missing values, so I will choose a categorical one.

In [5]:
df['FireplaceQu'].value_counts()

FireplaceQu
Gd    380
TA    313
Fa     33
Ex     24
Po     20
Name: count, dtype: int64

In [6]:
# Replace NaN values with mode of column 
df['FireplaceQu'] = df['FireplaceQu'].fillna(df['FireplaceQu'].mode()[0])

In [7]:
df['FireplaceQu'].value_counts()

FireplaceQu
Gd    1070
TA     313
Fa      33
Ex      24
Po      20
Name: count, dtype: int64

Theoretical Follow-up Question:
Why is it important to handle missing values in a dataset before applying machine learning models, and what might happen if you ignore them?

It's important to handle missing values in a dataset before applying machine learning models for the following reasons:
- Some model algorithms, like those in scikit-learn, cannot process data with missing values.
- Missing values can affect the overall performance of the model, because it is trained on data that is not accurate enough.
- A model might interpret missing values as zero, for instance, and that could distort the distribution of the data.
- If missing values are ignored, the model might interpret them in unforeseen ways, given that in pandas Dataframe, for example, null values are interpreted as floats (see code below).

In [21]:
nan = df['GarageType'][df['GarageType'].isnull()].iloc[1]
nan_type = type(df['GarageType'][df['GarageType'].isnull()].iloc[1]).__name__

print(f'The type of {nan} is \033[1m\033[4m{nan_type}\033[0m')

The type of nan is [1m[4mfloat[0m


2. Encoding Categorical Variables:
Practical Task:
Choose a categorical variable other than 'MSZoning' and 'Neighborhood', and apply one-hot encoding to it.

In [75]:
# Distinguishing between high and low cardinality columns
low_cardinality_cols = df[categorical_cols].apply(pd.Series.nunique) < 5
low_cardinality_cols = low_cardinality_cols[low_cardinality_cols].index

high_cardinality_cols = df[categorical_cols].apply(pd.Series.nunique) >= 5
high_cardinality_cols = high_cardinality_cols[high_cardinality_cols].index

print(f'Low cardinality columns:\n{low_cardinality_cols}\n')
print(f'High cardinality columns:\n{high_cardinality_cols}')

Low cardinality columns:
Index(['Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LandSlope',
       'MasVnrType', 'ExterQual', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
       'CentralAir', 'KitchenQual', 'GarageFinish', 'PavedDrive', 'PoolQC',
       'Fence', 'MiscFeature'],
      dtype='object')

High cardinality columns:
Index(['MSZoning', 'LotFrontage', 'LotConfig', 'Neighborhood', 'Condition1',
       'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl',
       'Exterior1st', 'Exterior2nd', 'MasVnrArea', 'ExterCond', 'Foundation',
       'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'Electrical',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageQual',
       'GarageCond', 'SaleType', 'SaleCondition'],
      dtype='object')


In [79]:
# One-Hot Encoding
column_to_encode = df['LotShape']

encoded_data = pd.get_dummies(column_to_encode, columns=['LotShape'], drop_first=True)

encoded_data

Unnamed: 0,IR2,IR3,Reg
0,False,False,True
1,False,False,True
2,False,False,False
3,False,False,False
4,False,False,False
...,...,...,...
1455,False,False,True
1456,False,False,True
1457,False,False,True
1458,False,False,True


Theoretical Follow-up Question:
In what scenarios would one-hot encoding be preferable over label encoding for categorical variables, especially in the context of machine learning modeling?

Scenarios would one-hot encoding be preferable over label encoding for categorical variables:
- Regression models where the labels might be interpreted by the model as actual numerical values, rather than classifications.
- When the set of values for a specific categorical feature is not ordinal.
- When using models that cannot interpret the numerical labels as labels.
- When adding extra columns via One Hot encoding improves the model's performance.

3. Feature Scaling:
Practical Task:
Identify two numerical features other than 'TotalBsmtSF' and 'GrLivArea' and apply standard scaling to them.

In [9]:
# Code for Practical Task 3

Theoretical Follow-up Question:
Explain the difference between normalization and standardization in feature scaling. In what situations might you prefer one technique over the other?