# Feature Engineering

## Task 1: Removing Null Values

* Select just the columns from the train data frame that contain no missing values.
* Assign the resulting data frame, that contains just these columns, to df_no_mv.
* Use the variables display to become familiar with these columns.

In [1]:
import pandas as pd

data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

train_null_counts = train.isnull().sum()
print(train_null_counts)
df_no_mv = train[train_null_counts[train_null_counts == 0].index]

Order                0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       249
Lot Area             0
Street               0
Alley             1351
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type        11
Mas Vnr Area        11
Exter Qual           0
Exter Cond           0
                  ... 
Bedroom AbvGr        0
Kitchen AbvGr        0
Kitchen Qual         0
TotRms AbvGrd        0
Functional           0
Fireplaces           0
Fireplace Qu       717
Garage Type         74
Garage Yr Blt       75
Garage Finish       75
Garage Cars          0
Garage Area          0
Garage Qual

## Task 2: Changing text columns to categorical

* Convert all of the text columns in train to the categorical data type.
* Select the Utilities column, return the categorical codes, and display the unique value counts for those codes: train['Utilities'].cat.codes.value_counts()

In [2]:
text_cols = df_no_mv.select_dtypes(include=['object']).columns

for col in text_cols:
    print(col+":", len(train[col].unique()))
    train[col] = train[col].astype('category')
    
train['Utilities'].cat.codes.value_counts()

MS Zoning: 6
Street: 2
Lot Shape: 4
Land Contour: 4
Utilities:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


 3
Lot Config: 5
Land Slope: 3
Neighborhood: 26
Condition 1: 9
Condition 2: 6
Bldg Type: 5
House Style: 8
Roof Style: 6
Roof Matl: 5
Exterior 1st: 14
Exterior 2nd: 16
Exter Qual: 4
Exter Cond: 5
Foundation: 6
Heating: 6
Heating QC: 4
Central Air: 2
Electrical: 4
Kitchen Qual: 5
Functional: 7
Paved Drive: 3
Sale Type: 9
Sale Condition: 5


0    1457
2       2
1       1
dtype: int64

## Task 3: Changing categorical to ordinal using one-hot encoding or dummy columns

* Convert all of the columns in text_cols from the train data frame into dummy columns.
* Delete the original columns from text_cols from the train data frame.

In [5]:
print("Original data frame contains: "+str(len(train.columns))+" columns")
dummy_cols = pd.DataFrame()
dummy_cols = pd.get_dummies(train, columns=text_cols)
print("Output of get_dummies contains: "+str(len(dummy_cols.columns))+" columns")
# Note: current version of pandas removes the old categorical columns as well

Original data frame contains: 82 columns
Output of get_dummies contains: 236 columns


In [None]:
#Another implementation if explicit removal of the old columns is necessary
dummy_cols = pd.DataFrame()
for col in text_cols:
    col_dummies = pd.get_dummies(train[col])
    train = pd.concat([train, col_dummies], axis=1)
    del train[col]

## Task 4: Improving usefulness of numerical features that arent useful for linear regression like year

* Problem: Year values aren't representative of how old a house is
* Solution:  Create a new column that gives the time elapsed between when two events took place - like house being built and being sold


In [None]:
train['years_until_remod'] = train['Year Remod/Add'] - train['Year Built']

# Task 5: Handling null values

## Two main approaches for handling missing values: remove or impute

### Remove rows containing missing values for specific columns

* Pro: Rows containing missing values are removed, leaving only clean data for modeling
* Con: Entire observations from the training set are removed, which can reduce overall prediction accuracy
* Rule of thumb: if 50% of the values are missing, drop the column

### Impute (or replace) missing values using a descriptive statistic from the column

* Pro: Missing values are replaced with potentially similar estimates, preserving the rest of the observation in the model.
* Con: Depending on the approach, we may be adding noisy data for the model to learn
* Rule of thumb: if up to 25% of the column values are missing, impute

In [13]:
import pandas as pd

data = pd.read_csv('AmesHousing.txt', delimiter="\t")
train = data[0:1460]
test = data[1460:]

# train_null_counts is now a series with indexes of column names and values of the number of null rows in that column
train_null_counts = train.isnull().sum()

# get a list of all columns with between 1 and 584 null values (~25% of the total columns) - these are candidates for imputing the missing values
cols_to_keep = train_null_counts[(train_null_counts > 0) & (train_null_counts < 584)].index

In [14]:
print(cols_to_keep)

Index(['Lot Frontage', 'Mas Vnr Type', 'Mas Vnr Area', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Garage Type', 'Garage Yr Blt',
       'Garage Finish', 'Garage Qual', 'Garage Cond'],
      dtype='object')


In [17]:
# get a dataframe with just the columns that have a number of missing values <= 25%
df_missing_values = train[cols_to_keep]
print(df_missing_values.isnull().sum())

Lot Frontage      249
Mas Vnr Type       11
Mas Vnr Area       11
Bsmt Qual          40
Bsmt Cond          40
Bsmt Exposure      41
BsmtFin Type 1     40
BsmtFin SF 1        1
BsmtFin Type 2     41
BsmtFin SF 2        1
Bsmt Unf SF         1
Total Bsmt SF       1
Bsmt Full Bath      1
Bsmt Half Bath      1
Garage Type        74
Garage Yr Blt      75
Garage Finish      75
Garage Qual        75
Garage Cond        75
dtype: int64


In [22]:
for col in df_missing_values.columns:
    print("Column name: "+col+" Data type: "+str(df_missing_values[col].dtype))

Column name: Lot Frontage Data type: float64
Column name: Mas Vnr Type Data type: object
Column name: Mas Vnr Area Data type: float64
Column name: Bsmt Qual Data type: object
Column name: Bsmt Cond Data type: object
Column name: Bsmt Exposure Data type: object
Column name: BsmtFin Type 1 Data type: object
Column name: BsmtFin SF 1 Data type: float64
Column name: BsmtFin Type 2 Data type: object
Column name: BsmtFin SF 2 Data type: float64
Column name: Bsmt Unf SF Data type: float64
Column name: Total Bsmt SF Data type: float64
Column name: Bsmt Full Bath Data type: float64
Column name: Bsmt Half Bath Data type: float64
Column name: Garage Type Data type: object
Column name: Garage Yr Blt Data type: float64
Column name: Garage Finish Data type: object
Column name: Garage Qual Data type: object
Column name: Garage Cond Data type: object


In [23]:
# get the columns that are of datatype float
float_cols = df_missing_values.select_dtypes(include=['float'])

# Strategy 1: Impute with 0
# Returns a data frame with missing values replaced with 0.
# fill_with_zero = missing_floats.fillna(0)

# Strategy 2: Impute with the mean
# Returns a data frame with missing values replaced with mean of that column.
float_cols = float_cols.fillna(float_cols.mean())
print(float_cols.isnull().sum())

Lot Frontage      0
Mas Vnr Area      0
BsmtFin SF 1      0
BsmtFin SF 2      0
Bsmt Unf SF       0
Total Bsmt SF     0
Bsmt Full Bath    0
Bsmt Half Bath    0
Garage Yr Blt     0
dtype: int64
