## Introduction to the Data Set
*  Import pandas, matplotlib, and numpy into the environment. Import the classes you need from scikit-learn as well.
*  Read __AmesHousing.txt__ into a pandas data frame.
*  For the following functions, we recommend creating them in the first few cells in the notebook. This way, you can add cells to the end of the notebook to do experiments and update the functions in these cells.
  *  Create a function named __transform_features()__ that, for now, just returns the __train__ data frame.
  *  Create a function named __select_features()__ that, for now, just returns the __Gr Liv Area__ and __SalePrice__ columns from the __train__ data frame.
  *  Create a function named __train_and_test()__ that, for now:
    *  Selects the first __1460__ rows from from __data__ and assign to __train__.
    *  Selects the remaining rows from __data__ and assign to __test__.
    *  Trains a model using all numerical columns except the __SalePrice__ column (the target column) from the data frame returned from __select_features()__
    *  Tests the model on the test set using and returns the RMSE value.

In [55]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [56]:
data = pd.read_csv('AmesHousing.txt', delimiter='\t')
train = data.iloc[:1460].copy()
test = data.iloc[1460:].copy()

In [57]:
def transform_features(train):
    df = train.copy()
    return df

def select_features():
    return ['Gr Liv Area'], 'SalePrice'

def train_and_test(train, test, train_features, target):
    lr = LinearRegression()
    lr.fit(train[train_features],train[target])
    predict = lr.predict(test[train_features])
    mse = mean_squared_error(test[target],predict)
    RMSE = np.sqrt(mse)
    
    return RMSE

In [58]:
train_features, target = select_features()
RMSE = train_and_test(train, test, train_features, target)
RMSE

57088.251612639091

## Feature Engineering
*  As we mentioned earlier, we recommend adding some cells to explore and experiment with different features (before rewriting these functions).
*  The __transform_features()__ function shouldn't modify the __train__ data frame and instead return a new one entirely. This way, we can keep using __train__ in the experimentation cells.
*  Which columns contain less than 5% missing values?
  *  For numerical columns that meet this criteria, let's fill in the missing values using the most popular value for that column.
*  What new features can we create, that better capture the information in some of the features?
  *  An example of this would be the __years_until_remod__ feature we created in the last mission.
*  Which columns need to be dropped for other reasons?
  *  Which columns aren't useful for machine learning?
  *  Which columns leak data about the final sale?

In [59]:
# Make a copy of the train DF and show all the numerical columns
train_copy = transform_features(train)
numerical_train = train_copy.select_dtypes(include=['int64', 'float64'])
numerical_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 39 columns):
Order              1460 non-null int64
PID                1460 non-null int64
MS SubClass        1460 non-null int64
Lot Frontage       1211 non-null float64
Lot Area           1460 non-null int64
Overall Qual       1460 non-null int64
Overall Cond       1460 non-null int64
Year Built         1460 non-null int64
Year Remod/Add     1460 non-null int64
Mas Vnr Area       1449 non-null float64
BsmtFin SF 1       1459 non-null float64
BsmtFin SF 2       1459 non-null float64
Bsmt Unf SF        1459 non-null float64
Total Bsmt SF      1459 non-null float64
1st Flr SF         1460 non-null int64
2nd Flr SF         1460 non-null int64
Low Qual Fin SF    1460 non-null int64
Gr Liv Area        1460 non-null int64
Bsmt Full Bath     1459 non-null float64
Bsmt Half Bath     1459 non-null float64
Full Bath          1460 non-null int64
Half Bath          1460 non-null int64
Bedroom AbvGr      

In [60]:
# Get the numerial columns that have less than 5% missing values
isnull_percentage = numerical_train.isnull().sum()/numerical_train.shape[0]
cols_Numerical_LessThanFivePercentNulls = isnull_percentage[isnull_percentage < 0.05].index
print(cols_Numerical_LessThanFivePercentNulls)
numerical_train = train_copy[cols_Numerical_LessThanFivePercentNulls].copy()

# From these columns fill the missing values with the most common value (mode).
# Note, if there are multiple modes, the column will be filled with the mean of the modes
numerical_train = numerical_train.fillna(numerical_train.mode().iloc[0])
print('\nVerify there are no missing values in the DataFrame')
numerical_train.isnull().sum()

Index(['Order', 'PID', 'MS SubClass', 'Lot Area', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area',
       'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area',
       'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath',
       'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces',
       'Garage Cars', 'Garage Area', 'Wood Deck SF', 'Open Porch SF',
       'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Misc Val',
       'Mo Sold', 'Yr Sold', 'SalePrice'],
      dtype='object')

Verify there are no missing values in the DataFrame


Order              0
PID                0
MS SubClass        0
Lot Area           0
Overall Qual       0
Overall Cond       0
Year Built         0
Year Remod/Add     0
Mas Vnr Area       0
BsmtFin SF 1       0
BsmtFin SF 2       0
Bsmt Unf SF        0
Total Bsmt SF      0
1st Flr SF         0
2nd Flr SF         0
Low Qual Fin SF    0
Gr Liv Area        0
Bsmt Full Bath     0
Bsmt Half Bath     0
Full Bath          0
Half Bath          0
Bedroom AbvGr      0
Kitchen AbvGr      0
TotRms AbvGrd      0
Fireplaces         0
Garage Cars        0
Garage Area        0
Wood Deck SF       0
Open Porch SF      0
Enclosed Porch     0
3Ssn Porch         0
Screen Porch       0
Pool Area          0
Misc Val           0
Mo Sold            0
Yr Sold            0
SalePrice          0
dtype: int64

In [61]:
# The following columns can be dropped as they are not useful for machine learning
# I will comment that I am thinking that Mo Sold & Yr Sold might be valuable for converting the
# sale price into an inflationary adjusted value.  For now I'm going to leave that alone.
# cols_drop = ['Order', 'PID', 'Mo Sold', 'Yr Sold']
numerical_train = numerical_train.drop(['Order', 'PID', 'Mo Sold', 'Yr Sold'], axis=1)

# The MS SubClass Column requires categorization/dummies
numerical_train['MS SubClass'] = numerical_train['MS SubClass'].astype('category')
col_dummies = pd.get_dummies(numerical_train['MS SubClass'])
numerical_train = pd.concat([numerical_train, col_dummies], axis=1)
numerical_train = numerical_train.drop(['MS SubClass'], axis=1)

# Converts nominal/categorical numerical data which contains values that
# don't directly correlate to the target value, into numerical values that do correlate to the target
# In this case uses the 'Year Remod/Add' & 'Year Built' columns to calculate the years after the house
# was built until it was remodeled
numerical_train['years_until_remod'] = numerical_train['Year Remod/Add'] - numerical_train['Year Built']
numerical_train = numerical_train.drop(['Year Remod/Add', 'Year Built'], axis=1)

# Print the columns
numerical_train.columns

Index([         'Lot Area',      'Overall Qual',      'Overall Cond',
            'Mas Vnr Area',      'BsmtFin SF 1',      'BsmtFin SF 2',
             'Bsmt Unf SF',     'Total Bsmt SF',        '1st Flr SF',
              '2nd Flr SF',   'Low Qual Fin SF',       'Gr Liv Area',
          'Bsmt Full Bath',    'Bsmt Half Bath',         'Full Bath',
               'Half Bath',     'Bedroom AbvGr',     'Kitchen AbvGr',
           'TotRms AbvGrd',        'Fireplaces',       'Garage Cars',
             'Garage Area',      'Wood Deck SF',     'Open Porch SF',
          'Enclosed Porch',        '3Ssn Porch',      'Screen Porch',
               'Pool Area',          'Misc Val',         'SalePrice',
                        20,                  30,                  40,
                        45,                  50,                  60,
                        70,                  75,                  80,
                        85,                  90,                 120,
                    

## Feature Selection
*  Generate a correlation heatmap matrix of the numerical features in the training data set.
  *  Which features correlate strongly with our target column, __SalePrice__?
  *  Calculate the correlation coefficients for the columns that seem to correlate well with __SalePrice__. Because we have a pipeline in place, it's easy to try different features and see which features result in a better cross validation score.
*  Which columns in the data frame should be converted to the categorical data type? All of the columns marked as __nominal__ from the [documentation](DataDocumentation.txt) are candidates for being converted to categorical. Here are some other things you should think about:
  *  If a categorical column has hundreds of unique values (or categories), should you keep it? When you dummy code this column, hundreds of columns will need to be added back to the data frame.
  *  Which categorical columns have a few unique values but more than 95% of the values in the column belong to a specific category? This would be similar to a low variance numerical feature (no variability in the data for the model to capture).
*  Which columns are currently numerical but need to be encoded as categorical instead (because the numbers don't have any semantic meaning)?
*  What are some ways we can explore which categorical columns "correlate" well with __SalePrice__?
  *  Read this post for some [potential strategies](https://machinelearningmastery.com/feature-selection-machine-learning-python/).
*  Update the logic for the __select_features()__ function. This function should take in the new, modified train and test data frames that were returned from __transform_features()__.

## 04 Train & Test
*  The optional __k__ parameter should accept integer values, with a default value of __0__.
*  When __k__ equals __0__, perform holdout validation (what we already implemented):
  *  Select the first __1460__ rows and assign to __train__.
  *  Select the remaining rows and assign to __test__.
  *  Train on train and test on __test__.
  *  Compute the RMSE and return.
*  When __k__ equals __1__, perform simple cross validation:
  *  Shuffle the ordering of the rows in the data frame.
  *  Select the first __1460__ rows and assign to __fold_one__.
  *  Select the remaining rows and assign to __fold_two__.
  *  Train on __fold_one__ and test on __fold_two__.
  *  Train on __fold_two__ and test on __fold_one__.
  *  Compute the average RMSE and return.
*  When __k__ is greater than __0__, implement k-fold cross validation using __k__ folds:
  *  Perform k-fold cross validation using __k__ folds.
  *  Calculate the average RMSE value and return this value.