In this notebook, I will be extracting the numerical features from the Ames housing data set with a view toward practicing object-orientation while carrying out simple linear regression models and evaluating their performance(s). Although the performance will, of course, not be optimal due to the restriction on the set of features included, the focus here is on quickly obtaining a data set to play with and develop skills all along the pipeline required to furnish quality regression estimates.

In [1]:
from utils import print_null_fracs
import numpy as np, pandas as pd

In [2]:
num_vars, all_vars = list(), list()
num_var_types = ['(Discrete)', '(Continuous)']
all_var_types = num_var_types + ['(Ordinal)', '(Nominal)']

# Data set...
df = pd.read_csv('AmesHousing.csv')
cols = list(df.columns)

# Variable description file...
var_descr_file = 'VariableDescriptions.txt'
lines = open(var_descr_file, 'r').readlines()

for line in lines:
    
    var_name = line.split('(')[0].rstrip()
    
    if any(var_type in line for var_type in all_var_types):
        all_vars.append(var_name)
    
    if any(var_type in line for var_type in num_var_types):
            num_vars.append(var_name)
            
fix_var_names = dict(zip(all_vars, cols))
num_vars = [fix_var_names[num_var] for num_var in num_vars]

num_df = pd.DataFrame(columns=num_vars)
num_cols = list(num_df.columns)

for col in num_cols:
    num_df[col] = list(df[col].values)
    
print_null_fracs(num_df)

Feature             Null Frac
Lot Frontage        0.1672
Mas Vnr Area        0.0078
BsmtFin SF 1        0.0003
BsmtFin SF 2        0.0003
Bsmt Unf SF         0.0003
Total Bsmt SF       0.0003
Bsmt Full Bath      0.0007
Bsmt Half Bath      0.0007
Garage Yr Blt       0.0543
Garage Cars         0.0003
Garage Area         0.0003


I'm making the decision to worry about the imputation of null values later. For now, we drop the `Lot Frontage` feature entirely and locate the unique rows where each of the other features listed above have null values.

In [3]:
num_df.drop(columns=['Order', 'Lot Frontage'], inplace=True)
num_cols = list(num_df.columns)

drop_row_idxs = list()
for col in num_cols:
    
    if len(set(num_df[col].isnull().values)) == 2:
        drop_row_idxs.extend(num_df[num_df[col].isnull()][col].index.values)
        
drop_row_idxs = list(set(drop_row_idxs))
num_df.drop(drop_row_idxs, inplace=True)

In [4]:
f'(Number of rows, Number of features): {num_df.shape}'

'(Number of rows, Number of features): (2747, 33)'

Now we can print this non-null, purely numeric dataframe to CSV.

In [5]:
num_df.to_csv('AmesHousing_NumericalFeatures.csv', index=False)