<img src="https://i.imgur.com/JDsOpVN.png" style="float: left; margin: 20px; height: 290px">

# Feature Selection 

---
Predicting House Prices with Linear Regression

**Author**: Miriam Sosa
1. [Define Features to Model](#Define-Features-to-Model)
    - [Model 1](#Model-1)
    - [Model 2](#Model-2)
    - [Model 3](#Model-3)

## Imports

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd           
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Read-in Data

In [2]:
train = pd.read_csv('../data/train_clean.csv', keep_default_na=False, na_values=[''])   
test = pd.read_csv('../data/test_clean.csv', keep_default_na=False, na_values=[''])

## Data Shape

In [3]:
train.shape, test.shape

((2051, 93), (878, 91))

### Define Features to Model 

In [4]:
features = ['MS Zoning', 
            'Utilities', 
            'Neighborhood', 
            'Mas Vnr Type', 
            'Foundation', 
            'Heating', 
            'Central Air', 
            'Garage Type', 
            'Garage Cond', 
            'pool', 
            'cond_norm', 
            'cond_pos', 
            'LotFrontage', 
            'Year Built', 
            'BsmtFin SF 1', 
            'SF', 
            'TotRms AbvGrd', 
            'Garage Cars', 
            'Garage Area', 
            'outdoorSF']

# Model 1

In [5]:
# Removed dummy variables that were not present in test data, and vice versa 

In [6]:
X = train[features]
X = pd.get_dummies(X, columns=['MS Zoning', 
                               'Utilities', 
                               'Neighborhood', 
                               'Mas Vnr Type', 
                               'Foundation', 
                               'Heating', 
                               'Central Air', 
                               'Garage Type', 
                               'Garage Cond', 
                               'pool', 
                               'cond_norm', 
                               'cond_pos'], drop_first=True)
X.drop(columns=['Neighborhood_GrnHill',
                'Neighborhood_Landmrk',
                'MS Zoning_C (all)',
                'Utilities_NoSeWa',
                'Heating_Wall'], inplace=True)
y = train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [7]:
lr_1 = LinearRegression()
lr_1.fit(X_train, y_train)
lr_1.score(X_train, y_train), lr_1.score(X_test, y_test)

(0.8146497291776744, 0.8102039753657923)

In [8]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 65 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   LotFrontage           2051 non-null   float64
 1   Year Built            2051 non-null   int64  
 2   BsmtFin SF 1          2051 non-null   float64
 3   SF                    2051 non-null   int64  
 4   TotRms AbvGrd         2051 non-null   int64  
 5   Garage Cars           2051 non-null   float64
 6   Garage Area           2051 non-null   float64
 7   outdoorSF             2051 non-null   int64  
 8   MS Zoning_FV          2051 non-null   uint8  
 9   MS Zoning_I (all)     2051 non-null   uint8  
 10  MS Zoning_RH          2051 non-null   uint8  
 11  MS Zoning_RL          2051 non-null   uint8  
 12  MS Zoning_RM          2051 non-null   uint8  
 13  Utilities_NoSewr      2051 non-null   uint8  
 14  Neighborhood_Blueste  2051 non-null   uint8  
 15  Neighborhood_BrDale  

In [9]:
# Recreate X with test data: test and train data have different values (drop dummy not found in train)

In [10]:
X = test[features]
X = pd.get_dummies(X, columns=['MS Zoning', 
                               'Utilities', 
                               'Neighborhood', 
                               'Mas Vnr Type', 
                               'Foundation', 
                               'Heating', 
                               'Central Air', 
                               'Garage Type', 
                               'Garage Cond', 
                               'pool', 
                               'cond_norm', 
                               'cond_pos'], drop_first=True)
X.drop(columns=['Mas Vnr Type_CBlock'], inplace=True)

In [11]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 878 entries, 0 to 877
Data columns (total 65 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   LotFrontage           878 non-null    float64
 1   Year Built            878 non-null    int64  
 2   BsmtFin SF 1          878 non-null    int64  
 3   SF                    878 non-null    int64  
 4   TotRms AbvGrd         878 non-null    int64  
 5   Garage Cars           878 non-null    int64  
 6   Garage Area           878 non-null    int64  
 7   outdoorSF             878 non-null    int64  
 8   MS Zoning_FV          878 non-null    uint8  
 9   MS Zoning_I (all)     878 non-null    uint8  
 10  MS Zoning_RH          878 non-null    uint8  
 11  MS Zoning_RL          878 non-null    uint8  
 12  MS Zoning_RM          878 non-null    uint8  
 13  Utilities_NoSewr      878 non-null    uint8  
 14  Neighborhood_Blueste  878 non-null    uint8  
 15  Neighborhood_BrDale   8

In [12]:
# Generate predicted values for test data and export Kaggle Submission 1

In [13]:
kaggle_preds = lr_1.predict(X)

In [14]:
test.columns = [column.replace(' ', '_').lower() for column in test.columns]

In [15]:
test['saleprice'] = kaggle_preds

In [16]:
test[['id','saleprice']]

Unnamed: 0,id,saleprice
0,2658,141249.649573
1,2718,201656.892487
2,2414,192113.054592
3,1989,102145.967192
4,625,169399.433652
...,...,...
873,1662,238704.111399
874,1234,209321.241136
875,1373,124622.378600
876,1672,133176.112644


In [17]:
test[['id','saleprice']].to_csv('../submissions/kaggle_sub_model_1.csv', index=False)

## Model 2

In [18]:
# Avoided multi-level 'Categorical Variables' to allow regularization in later approaches 

In [19]:
features2 = ['pool', 
             'cond_norm', 
             'cond_pos', 
             'LotFrontage', 
             'Year Built', 
             'BsmtFin SF 1', 
             'SF', 
             'TotRms AbvGrd', 
             'Garage Cars', 
             'Garage Area', 
             'outdoorSF']

In [20]:
X = train[features2]
y = train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [21]:
lr_1 = LinearRegression()
lr_1.fit(X_train, y_train)
lr_1.score(X_train, y_train), lr_1.score(X_test, y_test)

(0.7165250436474981, 0.7668436248221406)

In [22]:
test = pd.read_csv('../data/test_clean.csv', keep_default_na=False, na_values=[''])

In [23]:
X = test[features2]

In [24]:
# Generate predicted values for test data and export Kaggle Submission 2

In [25]:
kaggle_preds = lr_1.predict(X)

In [26]:
test.columns = [column.replace(' ', '_').lower() for column in test.columns]

In [27]:
test['saleprice'] = kaggle_preds

In [28]:
kaggle_sub_data = test[['id','saleprice']]

In [29]:
kaggle_sub_data

Unnamed: 0,id,saleprice
0,2658,150562.384981
1,2718,218813.820755
2,2414,208208.607552
3,1989,98401.720816
4,625,176280.490586
...,...,...
873,1662,257226.404002
874,1234,227648.425302
875,1373,120268.649577
876,1672,141326.494870


In [30]:
kaggle_sub_data[['id','saleprice']].to_csv('../submissions/kaggle_sub_model_2.csv', index=False)

## Model 3

In [31]:
# Eliminate colinearities and features validation found less valuable
# Remove Neighborhoods that mismatched between test and train

In [32]:
features = ['Neighborhood', 
            'foundations', 
            'central_air', 
            'garage_index', 
            'pool', 'cond_norm', 
            'cond_pos', 
            'LotFrontage', 
            'Year Built', 
            'BsmtFin SF 1',
            'SF', 
            'outdoorSF']

In [33]:
X = train[features]
X = pd.get_dummies(X, columns=['Neighborhood'], drop_first=True)
X.drop(columns=['Neighborhood_GrnHill','Neighborhood_Landmrk'], inplace=True)
y = train['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [34]:
lr_1 = LinearRegression()
lr_1.fit(X_train, y_train)
lr_1.score(X_train, y_train), lr_1.score(X_test, y_test)

(0.8006867274665588, 0.8172411078068521)

In [35]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 36 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   foundations           2051 non-null   int64  
 1   central_air           2051 non-null   int64  
 2   garage_index          2051 non-null   float64
 3   pool                  2051 non-null   int64  
 4   cond_norm             2051 non-null   int64  
 5   cond_pos              2051 non-null   int64  
 6   LotFrontage           2051 non-null   float64
 7   Year Built            2051 non-null   int64  
 8   BsmtFin SF 1          2051 non-null   float64
 9   SF                    2051 non-null   int64  
 10  outdoorSF             2051 non-null   int64  
 11  Neighborhood_Blueste  2051 non-null   uint8  
 12  Neighborhood_BrDale   2051 non-null   uint8  
 13  Neighborhood_BrkSide  2051 non-null   uint8  
 14  Neighborhood_ClearCr  2051 non-null   uint8  
 15  Neighborhood_CollgCr 

In [36]:
# Generate predicted values for test data and export Kaggle Submission 3

In [37]:
test = pd.read_csv('../data/test_clean.csv', keep_default_na=False, na_values=[''])

In [38]:
X = test[features]
X = pd.get_dummies(X, columns=['Neighborhood'], drop_first=True)

In [39]:
kaggle_preds = lr_1.predict(X)

In [40]:
test.columns = [column.replace(' ', '_').lower() for column in test.columns]

In [41]:
test['saleprice'] = kaggle_preds

In [42]:
test[['id','saleprice']]

Unnamed: 0,id,saleprice
0,2658,131216.773153
1,2718,186819.777000
2,2414,191675.665500
3,1989,93167.097635
4,625,163122.040560
...,...,...
873,1662,243880.101951
874,1234,205378.185587
875,1373,119148.855029
876,1672,137313.049936


In [44]:
test[['id','saleprice']].to_csv('../submissions/kaggle_sub_model_3.csv', index=False)