## Introduction to the Data Set
*  Import pandas, matplotlib, and numpy into the environment. Import the classes you need from scikit-learn as well.
*  Read __AmesHousing.txt__ into a pandas data frame.
*  For the following functions, we recommend creating them in the first few cells in the notebook. This way, you can add cells to the end of the notebook to do experiments and update the functions in these cells.
  *  Create a function named __transform_features()__ that, for now, just returns the __train__ data frame.
  *  Create a function named __select_features()__ that, for now, just returns the __Gr Liv Area__ and __SalePrice__ columns from the __train__ data frame.
  *  Create a function named __train_and_test()__ that, for now:
    *  Selects the first __1460__ rows from from __data__ and assign to __train__.
    *  Selects the remaining rows from __data__ and assign to __test__.
    *  Trains a model using all numerical columns except the __SalePrice__ column (the target column) from the data frame returned from __select_features()__
    *  Tests the model on the test set using and returns the RMSE value.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
data = pd.read_csv('AmesHousing.txt', delimiter='\t')

In [3]:
def transform_features(df):
    train = df.iloc[:1460].copy()
    test = df.iloc[1460:].copy()
    return train, test

def select_features():
    return ['Gr Liv Area'], 'SalePrice'

def train_and_test(train, test, train_features, target):
    lr = LinearRegression()
    lr.fit(train[train_features],train[target])
    predict = lr.predict(test[train_features])
    mse = mean_squared_error(test[target],predict)
    RMSE = np.sqrt(mse)
    
    return RMSE

In [4]:
train, test = transform_features(data)
train_features, target = select_features()
RMSE = train_and_test(train, test, train_features, target)
RMSE

57088.251612639091

## Feature Engineering
1. For all columns, drop any with 5% or more missing.  This will be revisited later.
1. For the text columns, drop any with 1 or more missing values, again this will be revisited later.
1. For the numerical columns, fill NaN with the following:
  *  If the column contains continuous data fill with the mean for that column.
  *  If the column contains nominal/categorical data then fill with the mode for that column.
1. What new features be created, that better capture the information in some of the features?
1. Drop columns that aren't useful for ML
1. Drop columns that leak info about the final sale.

In [5]:
# Show all columns that have empty values
train, test = transform_features(data)
missing = train.isnull().sum()
print(missing[missing>0])


Lot Frontage       249
Alley             1351
Mas Vnr Type        11
Mas Vnr Area        11
Bsmt Qual           40
Bsmt Cond           40
Bsmt Exposure       41
BsmtFin Type 1      40
BsmtFin SF 1         1
BsmtFin Type 2      41
BsmtFin SF 2         1
Bsmt Unf SF          1
Total Bsmt SF        1
Bsmt Full Bath       1
Bsmt Half Bath       1
Fireplace Qu       717
Garage Type         74
Garage Yr Blt       75
Garage Finish       75
Garage Qual         75
Garage Cond         75
Pool QC           1459
Fence             1163
Misc Feature      1400
dtype: int64


In [6]:
# Make a copy of the train DF and drop all columns that have %5 or more missing values.
percent_missing = train.isnull().sum()/len(train)
drop_missing_cols = percent_missing[percent_missing > 0.05].sort_values()
train = train.drop(drop_missing_cols.index, axis=1)
print('Dropped Columns Due to > 5% missing values')
drop_missing_cols

Dropped Columns Due to > 5% missing values


Garage Type      0.050685
Garage Yr Blt    0.051370
Garage Finish    0.051370
Garage Qual      0.051370
Garage Cond      0.051370
Lot Frontage     0.170548
Fireplace Qu     0.491096
Fence            0.796575
Alley            0.925342
Misc Feature     0.958904
Pool QC          0.999315
dtype: float64

In [7]:
# Drop all text columns that have any missing values
text_cols = train.select_dtypes(include=['object'])
missing_values = text_cols.isnull().sum()
drop_missing_cols = missing_values[missing_values > 0]
train = train.drop(drop_missing_cols.index, axis=1)
print('Dropped Text Columns Due to > 1 missing value')
drop_missing_cols

Dropped Text Columns Due to > 1 missing value


Mas Vnr Type      11
Bsmt Qual         40
Bsmt Cond         40
Bsmt Exposure     41
BsmtFin Type 1    40
BsmtFin Type 2    41
dtype: int64

In [8]:
# Fill continuous numerical data with mean & fill ordinal/nominal (on) numerical data with mode
num_cols = train.select_dtypes(include=['int64','float64'])
on = []
con = []

# By review the data set I found that none of the ordinal/nominal data has more the 16 possibilities
for col in num_cols.columns:
    if len(num_cols[col].unique()) > 16:
        con.append(col)
    else:
        on.append(col)

print('ordinal/nominal numerical data with empty values')
missing_values = train[on].isnull().sum()
print(missing_values[missing_values > 0])

print('\ncontinuous numerical data with empty values')
missing_values = train[con].isnull().sum()
print(missing_values[missing_values > 0])

train[on] = train[on].fillna(train[on].mode().iloc[0])
train[con] = train[con].fillna(train[con].mean())

ordinal/nominal numerical data with empty values
Bsmt Full Bath    1
Bsmt Half Bath    1
dtype: int64

continuous numerical data with empty values
Mas Vnr Area     11
BsmtFin SF 1      1
BsmtFin SF 2      1
Bsmt Unf SF       1
Total Bsmt SF     1
dtype: int64


In [9]:
# Verify that the resulting DataFrame has no missing values
missing_values = train.isnull().sum()
print(missing_values[missing_values > 0])

Series([], dtype: int64)


In [10]:
# Create new features, that better captures the information.

# The 'Year Remod/Add' & 'Year Built' columns to calculate the years after the house
# was built until it was remodeled
years_until_remod = train['Year Remod/Add'] - train['Year Built']


# Similar to the above transformation... Yr Sold would provide better information if we altered it
# to the age of the house when sold.
age_when_sold = train['Yr Sold'] - train['Year Built']

# Check for negative values
print(years_until_remod[years_until_remod<0])
print(age_when_sold[age_when_sold<0])

850   -1
dtype: int64
Series([], dtype: int64)


In [11]:
# Add newly created features and remove rows with negative values
train['years_until_remod'] = years_until_remod
train['age_when_sold'] = age_when_sold
train = train.drop([850], axis=0)

## No longer need original year columns
train = train.drop(['Yr Sold', 'Year Built'], axis=1)

In [12]:
# Drop columns that aren't useful for ML
train = train.drop(['Order', 'PID'], axis=1)

# These columns have information that cannot be known 
# prior to the sale, as such they leak data.
train = train.drop(["Mo Sold", "Sale Condition", "Sale Type"], axis=1)

In [13]:
# Review data
print(train.isnull().sum())
train.info()

(1459, 60)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 0 to 1459
Data columns (total 60 columns):
MS SubClass          1459 non-null int64
MS Zoning            1459 non-null object
Lot Area             1459 non-null int64
Street               1459 non-null object
Lot Shape            1459 non-null object
Land Contour         1459 non-null object
Utilities            1459 non-null object
Lot Config           1459 non-null object
Land Slope           1459 non-null object
Neighborhood         1459 non-null object
Condition 1          1459 non-null object
Condition 2          1459 non-null object
Bldg Type            1459 non-null object
House Style          1459 non-null object
Overall Qual         1459 non-null int64
Overall Cond         1459 non-null int64
Year Remod/Add       1459 non-null int64
Roof Style           1459 non-null object
Roof Matl            1459 non-null object
Exterior 1st         1459 non-null object
Exterior 2nd         1459 non-null object
Mas V

Index(['MS SubClass', 'MS Zoning', 'Lot Area', 'Street', 'Lot Shape',
       'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood',
       'Condition 1', 'Condition 2', 'Bldg Type', 'House Style',
       'Overall Qual', 'Overall Cond', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Area',
       'Exter Qual', 'Exter Cond', 'Foundation', 'BsmtFin SF 1',
       'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC',
       'Central Air', 'Electrical', '1st Flr SF', '2nd Flr SF',
       'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath',
       'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr',
       'Kitchen Qual', 'TotRms AbvGrd', 'Functional', 'Fireplaces',
       'Garage Cars', 'Garage Area', 'Paved Drive', 'Wood Deck SF',
       'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
       'Pool Area', 'Misc Val', 'SalePrice', 'years_until_remod',
       'age_when_so

In [14]:
# Overwrite transform_features() so that it preforms the above feature engineering steps

def transform_features(df):
    train = df.iloc[:1460].copy()
    test = df.iloc[1460:].copy()
    
    # Drop all columns that have %5 or more missing values.
    percent_missing = train.isnull().sum()/len(train)
    drop_missing_cols = percent_missing[percent_missing > 0.05].sort_values()
    train = train.drop(drop_missing_cols.index, axis=1)
    
    # Drop all text columns that have any missing values
    text_cols = train.select_dtypes(include=['object'])
    missing_values = text_cols.isnull().sum()
    drop_missing_cols = missing_values[missing_values > 0]
    train = train.drop(drop_missing_cols.index, axis=1)
    
    # Fill missing continuous numerical data with mean & 
    # fill missing ordinal/nominal numerical data with mode
    num_cols = train.select_dtypes(include=['int64','float64'])
    on = []
    con = []
    # By review the data set I found that none of the ordinal/nominal data has more the 16 possibilities
    for col in num_cols.columns:
        if len(num_cols[col].unique()) > 16:
            con.append(col)
        else:
            on.append(col)
    train[on] = train[on].fillna(train[on].mode().iloc[0])
    train[con] = train[con].fillna(train[con].mean())
    
    # Create new features, that better captures the information.
    # The 'Year Remod/Add' & 'Year Built' columns to calculate the years after the house
    # was built until it was remodeled
    train['years_until_remod'] = train['Year Remod/Add'] - train['Year Built']
    # Similar to the above transformation... Yr Sold would provide better information if we altered it
    # to the age of the house when sold.
    train['age_when_sold'] = train['Yr Sold'] - train['Year Built']
    # Remove rows with negative values
    train = train.drop([850], axis=0)
    # No longer need original year columns
    train = train.drop(['Yr Sold', 'Year Built'], axis=1)
    
    # Drop columns that aren't useful for ML
    train = train.drop(['Order', 'PID'], axis=1)

    # These columns have information that cannot be known 
    # prior to the sale, as such they leak data.
    train = train.drop(["Mo Sold", "Sale Condition", "Sale Type"], axis=1)
    
    return train, test

In [15]:
# Quick Check that we get the same results as above
train, test = transform_features(data)
print(train.isnull().sum())
train.info()

(1459, 60)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1459 entries, 0 to 1459
Data columns (total 60 columns):
MS SubClass          1459 non-null int64
MS Zoning            1459 non-null object
Lot Area             1459 non-null int64
Street               1459 non-null object
Lot Shape            1459 non-null object
Land Contour         1459 non-null object
Utilities            1459 non-null object
Lot Config           1459 non-null object
Land Slope           1459 non-null object
Neighborhood         1459 non-null object
Condition 1          1459 non-null object
Condition 2          1459 non-null object
Bldg Type            1459 non-null object
House Style          1459 non-null object
Overall Qual         1459 non-null int64
Overall Cond         1459 non-null int64
Year Remod/Add       1459 non-null int64
Roof Style           1459 non-null object
Roof Matl            1459 non-null object
Exterior 1st         1459 non-null object
Exterior 2nd         1459 non-null object
Mas V

Index(['MS SubClass', 'MS Zoning', 'Lot Area', 'Street', 'Lot Shape',
       'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood',
       'Condition 1', 'Condition 2', 'Bldg Type', 'House Style',
       'Overall Qual', 'Overall Cond', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Area',
       'Exter Qual', 'Exter Cond', 'Foundation', 'BsmtFin SF 1',
       'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC',
       'Central Air', 'Electrical', '1st Flr SF', '2nd Flr SF',
       'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath',
       'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr',
       'Kitchen Qual', 'TotRms AbvGrd', 'Functional', 'Fireplaces',
       'Garage Cars', 'Garage Area', 'Paved Drive', 'Wood Deck SF',
       'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch',
       'Pool Area', 'Misc Val', 'SalePrice', 'years_until_remod',
       'age_when_so

## Feature Selection
*  Generate a correlation heatmap matrix of the numerical features in the training data set.
  *  Which features correlate strongly with our target column, __SalePrice__?
  *  Calculate the correlation coefficients for the columns that seem to correlate well with __SalePrice__. Because we have a pipeline in place, it's easy to try different features and see which features result in a better cross validation score.
*  Which columns in the data frame should be converted to the categorical data type? All of the columns marked as __nominal__ from the [documentation](DataDocumentation.txt) are candidates for being converted to categorical. Here are some other things you should think about:
  *  If a categorical column has hundreds of unique values (or categories), should you keep it? When you dummy code this column, hundreds of columns will need to be added back to the data frame.
  *  Which categorical columns have a few unique values but more than 95% of the values in the column belong to a specific category? This would be similar to a low variance numerical feature (no variability in the data for the model to capture).
*  Which columns are currently numerical but need to be encoded as categorical instead (because the numbers don't have any semantic meaning)?
*  What are some ways we can explore which categorical columns "correlate" well with __SalePrice__?
  *  Read this post for some [potential strategies](https://machinelearningmastery.com/feature-selection-machine-learning-python/).
*  Update the logic for the __select_features()__ function. This function should take in the new, modified train and test data frames that were returned from __transform_features()__.

## 04 Train & Test
*  The optional __k__ parameter should accept integer values, with a default value of __0__.
*  When __k__ equals __0__, perform holdout validation (what we already implemented):
  *  Select the first __1460__ rows and assign to __train__.
  *  Select the remaining rows and assign to __test__.
  *  Train on train and test on __test__.
  *  Compute the RMSE and return.
*  When __k__ equals __1__, perform simple cross validation:
  *  Shuffle the ordering of the rows in the data frame.
  *  Select the first __1460__ rows and assign to __fold_one__.
  *  Select the remaining rows and assign to __fold_two__.
  *  Train on __fold_one__ and test on __fold_two__.
  *  Train on __fold_two__ and test on __fold_one__.
  *  Compute the average RMSE and return.
*  When __k__ is greater than __0__, implement k-fold cross validation using __k__ folds:
  *  Perform k-fold cross validation using __k__ folds.
  *  Calculate the average RMSE value and return this value.