In this module, we started by building intuition for model based learning, explored how the linear regression model worked, understood how the two different approaches to model fitting worked, and some techniques for cleaning, transforming, and selecting features. In this guided project, you can practice what you learned in this course by exploring ways to improve the models we built. 

Let's start by setting up a pipeline of functions that will let us quickly iterate on different models.

#### Instructions

- Import pandas, matplotlib, and numpy into the environment.
- Read AmesHousing.txt into a pandas data frame. Select the first 1460 rows from from data and assign to train. Select the remaining rows from data and assign to test.
- For the following functions, we recommend creating them in the first few cells in the notebook. This way, you can add cells to the end of the notebook to do experiments and update the functions in these cells.
- Create a function named transform_features() that, for now, just returns the train data frame.
- Create a function named select_features() that, for now, just returns the Gr Liv Area and SalePrice columns from the train data frame.
- Create a function named train_and_test() that, for now:
  - trains a model using all columns except the SalePrice column from the data frame returned from select_features()
  - tests the model on the test set using k-fold cross-validation and returns the fold-level RMSE values as well as the average RMSE value.

In [77]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv('AmesHousing.txt', delimiter='\t')
data.head(4)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000


In [78]:
# Separate the train and test data
train = data[0:1460]
test = data[1460:]

In [79]:
train['Utilities'].value_counts()

AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

In [80]:
# Create a function named transform_features() that, for now, just returns the train data frame.
# Drop the Utilities column, since you can see from above that 97 % of the values fall in the AllPub category and the rest 
# among two other codes, which is a case of low variance feature (no variability in the data for the model to capture.)
def transform_features():
    traindf = train
    testdf = test
    drop_cols = ['Order', 'PID', 'Mo Sold', 'Yr Sold', 'Utilities','Garage Yr Blt', 'Misc Feature']
    
    traindf.drop(drop_cols, axis=1, inplace=True)
    testdf.drop(drop_cols, axis=1, inplace=True)
    
    text_cols = train.select_dtypes(include=['object']).columns
    
    for col in text_cols:
        #traindf[col] = traindf[col].astype('category')
        #col_dummies = pd.get_dummies(traindf[col])
        #traindf = pd.concat([traindf,col_dummies],axis=1)
        
        #testdf[col] = testdf[col].astype('category')
        #col_dummies = pd.get_dummies(testdf[col])
        #testdf = pd.concat([testdf, col_dummies], axis=1)
    
    
    
    # create a new feature called years_until_remod using the 'Year Remod/Add' and the 'Year Built' features
    traindf['years_until_remod'] = traindf['Year Remod/Add'] - traindf['Year Built']
    testdf['years_until_remod'] = testdf['Year Remod/Add'] - testdf['Year Built']
    traindf = traindf.fillna(traindf.mean())
    testdf = testdf.fillna(testdf.mean())
    return traindf, testdf


IndentationError: expected an indented block (<ipython-input-80-31831635a289>, line 26)

In [81]:
# Create a function named select_features() that, for now:
#  just returns the Gr Liv Area and SalePrice columns from the train data frame.
features = ['Gr Liv Area', 'SalePrice']

def select_features():
    df = train[features]
    return df

In [82]:
# Create a function named train_and_test() that, for now:
#   trains a model using all columns except the SalePrice column from the data frame returned from select_features()
#   tests the model on the test set using k-fold cross-validation and returns the fold-level RMSE values as well as the average RMSE value.
#train.isnull().sum()
train_data_for_kfold, test_data = transform_features()
train_data_for_kfold.head(2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,Lot Config,Land Slope,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Val,Sale Type,Sale Condition,SalePrice,years_until_remod
0,20,RL,141.0,31770,Pave,,IR1,Lvl,Corner,Gtl,...,0,0,0,,,0,WD,Normal,215000,0
1,20,RH,80.0,11622,Pave,,Reg,Lvl,Inside,Gtl,...,0,120,0,,MnPrv,0,WD,Normal,105000,0


In [44]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold

model_features = train_data_for_kfold.columns.tolist()
model_features.remove('SalePrice')

target_col = 'SalePrice'

def train_and_test(train_cols, target_col, folds, train_df, test_df):
    model = LinearRegression()
    model.fit(train_df[train_cols], train_df[target_col])
    kfold_avg_rmses = {}
    fold_level_rmses = {}
    for fold in folds:
        kf = KFold(fold, shuffle=True, random_state=1)
        #lr = LinearRegression()
        mses = cross_val_score(model, train_df[train_cols], train_df[target_col], scoring ="neg_mean_squared_error", cv=kf)
        #mses = cross_val_predict(model, test_df[train_cols], test_df[target_col], scoring ="neg_mean_squared_error", cv=kf)
        rmses = [ np.sqrt(np.absolute(mse)) for mse in mses]
        avg_rmse = np.mean(rmses)
        fold_level_rmses[fold] = rmses
        kfold_avg_rmses[fold] = avg_rmse    
    
    return kfold_avg_rmses, fold_level_rmses

folds = [3, 5, 7, 9, 10, 11, 13, 15, 17, 19, 21, 23]

train_data_for_kfold.isnull().sum()
#train_data_for_kfold.columns.tolist()
#avg_rmses, rmses = train_and_test(model_features, target_col, folds, train_data_for_kfold, test_data)
#print(avg_rmses)


MS SubClass            0
MS Zoning              0
Lot Frontage         249
Lot Area               0
Street                 0
Lot Shape              0
Land Contour           0
Lot Config             0
Land Slope             0
Neighborhood           0
Condition 1            0
Condition 2            0
Bldg Type              0
House Style            0
Overall Qual           0
Overall Cond           0
Year Built             0
Year Remod/Add         0
Roof Style             0
Roof Matl              0
Exterior 1st           0
Exterior 2nd           0
Mas Vnr Type          11
Mas Vnr Area          11
Exter Qual             0
Exter Cond             0
Foundation             0
Bsmt Qual             40
Bsmt Cond             40
Bsmt Exposure         41
                    ... 
Fa                     0
Gd                     0
Po                     0
TA                     0
N                      0
P                      0
Y                      0
Ex                     0
GdPrv                  0


Let's now start to removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns. Update transform_features() so that any column from the data frame with more than 25% (or another cutoff value) missing values is dropped. You also need to remove any columns that leak information about the sale (e.g. like the year the sale happened). In general, the goal of this function is to:

- Remove features that we don't want to use in the model, just based on the number of missing values or data leakage.
- Transform features into the proper format (numerical to categorical, scaling numerical, filling in missing values, etc).
- Create new features by combining other features.

Next, you need to get more familiar with the remaining columns by reading the [data documentation](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) for each column, determining what transformations are necessary (if any), and more. As we mentioned earlier, succeeding in predictive modeling (and competitions like Kaggle) is highly dependent on the quality of features the model has. Libraries like scikit-learn has made it quick and easy to simply try and tweak many different models, but cleaning, selecting, and transforming features is still more of an art that requires a bit of human ingenuity.

#### Instructions
- As we mentioned earlier, add cells to the end of the notebook to explore and experiment with different features.
- The transform_features() function shouldn't modify the train data frame and instead return a new one entirely. This way, we can keep using train in the experimentation cells.
- Which columns need to be dropped immediately?
  - The PID column doesn't seem helpful for making predictions. Read the documentation to understand why (or display the column).
- Which columns contain a large number of missing values?
- Which columns leak data about the final sale?
- Which columns contain less than 25% missing values and how should they be filled in?
- Which columns in the data frame should be categorical? Here are some things you should think about:
  - If a categorical column has hundreds of unique values (or categories), should you keep it? When you dummy code this column, hundreds of columns will need to be added back to the data frame.
  - Which categorical columns have a few unique values but more than 95% of the values in the column belong to a specific category? This would be similar to a low variance numerical feature (no variability in the data for the model to capture).
  - Which columns are currently numerical but need to be encoded as categorical instead (because the numbers don't have any semantic meaning)?
- What new features can we create, that better capture the information in some of the features?
  - An example of this would be the years_until_remod feature we created in the last mission.
- Use this function to transform the test data frame as well.

In [30]:
float_cols = train.select_dtypes(include=['float'])
float_cols.isnull().sum()
    

Lot Frontage      249
Mas Vnr Area       11
BsmtFin SF 1        1
BsmtFin SF 2        1
Bsmt Unf SF         1
Total Bsmt SF       1
Bsmt Full Bath      1
Bsmt Half Bath      1
Garage Yr Blt      75
Garage Cars         0
Garage Area         0
dtype: int64