# Kaggle: Intermediate Machine Learning
### Introduction
Load the training and validation features in `X_train` and `X_valid`, along with the prediction targets in `y_train` and `y_valid`.  The test features are loaded in `X_test`.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [2]:
# Read the data
X_full = pd.read_csv('../Intro-to-ML/home-data-for-ml-course/train.csv', index_col='Id')
X_test_full = pd.read_csv('../Intro-to-ML/home-data-for-ml-course/test.csv', index_col='Id')

# Obtain target and features
y = X_full.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[features].copy()
X_test = X_test_full[features].copy()

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [3]:
# Looking at first few rows of data for the train set
X_train.head()

Unnamed: 0_level_0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
619,11694,2007,1828,0,2,3,9
871,6600,1962,894,0,1,2,5
93,13360,1921,964,0,1,2,5
818,13265,2002,1689,0,2,3,7
303,13704,2001,1541,0,2,3,6


Defining five different Random Forest models.

In [4]:
# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]

To select the best model out of the five, we define a function `score_model()` below.  This function returns the mean absolute error (MAE) from the validation set.  Recall that the best model will obtain the lowest MAE.

In [5]:
# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)

model_runs = {}
for i in range(len(models)):
    mae = score_model(models[i])
    model_runs[f'model_{i+1}'] = [mae, models[i]]
    print(f"Model {i+1} \t MAE: {mae}")

Model 1 	 MAE: 24015.492818003917
Model 2 	 MAE: 23740.979228636657
Model 3 	 MAE: 23528.78421232877
Model 4 	 MAE: 23996.676789668687
Model 5 	 MAE: 23706.672864217904


### Evaluate and select best model

In [6]:
best_model = min(model_runs.values())
best_model

[23528.78421232877,
 RandomForestRegressor(criterion='absolute_error', random_state=0)]

### Generate test predictions
Create a Random Forest model with the variable name `my_model`.

In [7]:
# Create Random Forest model from best_model
my_model = best_model[1]
my_model

RandomForestRegressor(criterion='absolute_error', random_state=0)

Fit the model to the training and validation sets, and generate predictions on the test set saved as `X_test`.

In [8]:
# Fit model to the training set
my_model.fit(X, y)

# Generate predictions
preds_test = my_model.predict(X_test)

# Save predictions in the format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice:': preds_test})
output.to_csv('submission.csv', index=False)

In [9]:
output

Unnamed: 0,Id,SalePrice:
0,1461,119433.08
1,1462,158367.50
2,1463,185351.21
3,1464,178343.12
4,1465,192898.29
...,...,...
1454,2915,86155.00
1455,2916,89050.00
1456,2917,156296.92
1457,2918,132232.50


### Missing Values

There are many ways data can end up with missing values. For example,
- A 2 bedroom house won't include a value for the size of a third bedroom.
- A survey respondent may choose not to share his income.

Most machine learning libraries (including scikit-learn) give an error if you try to build a model using data with missing values. So we need methods to deal with missing values before building the model.

In [10]:
# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X.drop(['SalePrice'], axis=1, inplace=True)
X_test = X_test_full.select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [11]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

(1168, 36)
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


In [12]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
cols_with_missing

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [13]:
# Fill in the line below: How many rows are in the training data?
num_rows = X_train.shape[0]
print(num_rows)

# Fill in the line below: How many columns in the training data
# have missing values?
num_cols_with_missing = len([col for col in X_train.columns if X_train[col].isnull().any()])
print(num_cols_with_missing)

# Fill in the line below: How many missing entries are contained in 
# all of the training data?
tot_missing = sum(X_train.isnull().sum())
print(tot_missing)

1168
3
276


To compare different approaches to dealing with missing values, you'll use the same `score_dataset()` function from the tutorial.  This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [14]:
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Approach 1: Drop columns with missing values
Preprocess the data in `X_train` and `X_valid` to remove columns with missing values.  Set the preprocessed DataFrames to `reduced_X_train` and `reduced_X_valid`, respectively.

In [15]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

# Fill in the lines below: drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

# Check your answers
missing_vals = (reduced_X_train.isnull().sum())
print(missing_vals[missing_vals > 0])

Series([], dtype: int64)


Get MAE for this approach

In [16]:
print("MAE (Drop columns with missing values):",
      score_dataset(X_train=reduced_X_train, X_valid=reduced_X_valid, y_train=y_train, y_valid=y_valid))

MAE (Drop columns with missing values): 17837.82570776256


### Approach 2: Imputation
Impute missing values with the mean value along each column.  Set the preprocessed DataFrames to `imputed_X_train` and `imputed_X_valid`.  Make sure that the column names match those in `X_train` and `X_valid`.

In [17]:
# Fill in the lines below: imputation
imputer = SimpleImputer() # Your code here
imputed_X_train = pd.DataFrame(imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imputer.transform(X_valid))

# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

Get MAE for this approach

In [18]:
print("MAE (Imputation):",
      score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

MAE (Imputation): 18062.894611872147


### Categorical Variables
There's a lot of non-numeric data out there. Here's how to use it for machine learning.

In [19]:
# Load the data
X = pd.read_csv('../Intro-to-ML/home-data-for-ml-course/train.csv', index_col='Id')
X_test = pd.read_csv('../Intro-to-ML/home-data-for-ml-course/test.csv', index_col='Id')

# Remove rows with missing target, separarte target from features
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X['SalePrice']
X.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll drop the columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)

# Train/Validation split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [20]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,RL,11694,Pave,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,...,108,0,0,260,0,0,7,2007,New,Partial
871,20,RL,6600,Pave,Reg,Lvl,AllPub,Inside,Gtl,NAmes,...,0,0,0,0,0,0,8,2009,WD,Normal
93,30,RL,13360,Pave,IR1,HLS,AllPub,Inside,Gtl,Crawfor,...,0,44,0,0,0,0,8,2009,WD,Normal
818,20,RL,13265,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Mitchel,...,59,0,0,0,0,0,7,2008,WD,Normal
303,20,RL,13704,Pave,IR1,Lvl,AllPub,Corner,Gtl,CollgCr,...,81,0,0,0,0,0,1,2006,WD,Normal


To compare different models, I'll use the same `score_dataset()` function.  This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [21]:
# Function to compare different Random Forest models
def score_model(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Approach 1: Drop columns with categorical data

In [22]:
# Select columns with categorical data
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()

# Drop columns
drop_X_train = X_train.drop(categorical_cols, axis=1)
drop_X_valid = X_valid.drop(categorical_cols, axis=1)

In [23]:
# Get MAE for this approach
print('MAE (Drop categorical variables):',
      score_model(drop_X_train, drop_X_valid, y_train, y_valid))

MAE (Drop categorical variables): 17837.82570776256


Before jumping into ordinal encoding, we'll investigate the dataset.  Specifically, we'll look at the `'Condition2'` column.  The code cell below prints the unique entries in both the training and validation sets.

In [24]:
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())

Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']

Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']


### Approach 2: Ordinal Encoding

In [25]:
# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == 'object']

# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if set(X_valid[col]).issubset(set(X_train[col]))]

# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols) - set(good_label_cols))

print("Categorical columns that will be ordinal encoded:", good_label_cols)
print("\nCategorical columns that will be dropped:", bad_label_cols)

Categorical columns that will be ordinal encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'BldgType', 'HouseStyle', 'RoofStyle', 'Exterior1st', 'Exterior2nd', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped: ['RoofMatl', 'Functional', 'Condition2']


Ordinal encode the data in `X_train` and `X_valid`.  Set the preprocessed DataFrames to `label_X_train` and `label_X_valid`, respectively.

In [26]:
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)

# Apply ordinal encoder 
ordinal_encoder = OrdinalEncoder() # Your code here
label_X_train[good_label_cols] = ordinal_encoder.fit_transform(X_train[good_label_cols])
label_X_valid[good_label_cols] = ordinal_encoder.transform(X_valid[good_label_cols])

In [27]:
label_X_train[good_label_cols].head()

Unnamed: 0_level_0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,BldgType,...,ExterQual,ExterCond,Foundation,Heating,HeatingQC,CentralAir,KitchenQual,PavedDrive,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,3.0,1.0,3.0,3.0,0.0,4.0,0.0,16.0,2.0,0.0,...,0.0,4.0,2.0,1.0,0.0,1.0,2.0,2.0,6.0,5.0
871,3.0,1.0,3.0,3.0,0.0,4.0,0.0,12.0,4.0,0.0,...,3.0,4.0,1.0,1.0,2.0,0.0,3.0,2.0,8.0,4.0
93,3.0,1.0,0.0,1.0,0.0,4.0,0.0,6.0,2.0,0.0,...,3.0,2.0,0.0,1.0,0.0,1.0,3.0,2.0,8.0,4.0
818,3.0,1.0,0.0,3.0,0.0,1.0,0.0,11.0,2.0,0.0,...,2.0,4.0,2.0,1.0,0.0,1.0,2.0,2.0,8.0,4.0
303,3.0,1.0,0.0,3.0,0.0,0.0,0.0,5.0,2.0,0.0,...,2.0,4.0,2.0,1.0,0.0,1.0,2.0,2.0,8.0,4.0


In [28]:
print('MAE (Ordinal Encoder):',
      score_model(label_X_train, label_X_valid, y_train, y_valid))

MAE (Ordinal Encoder): 17098.01649543379


In [29]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

[('Street', 2),
 ('Utilities', 2),
 ('CentralAir', 2),
 ('LandSlope', 3),
 ('PavedDrive', 3),
 ('LotShape', 4),
 ('LandContour', 4),
 ('ExterQual', 4),
 ('KitchenQual', 4),
 ('MSZoning', 5),
 ('LotConfig', 5),
 ('BldgType', 5),
 ('ExterCond', 5),
 ('HeatingQC', 5),
 ('Condition2', 6),
 ('RoofStyle', 6),
 ('Foundation', 6),
 ('Heating', 6),
 ('Functional', 6),
 ('SaleCondition', 6),
 ('RoofMatl', 7),
 ('HouseStyle', 8),
 ('Condition1', 9),
 ('SaleType', 9),
 ('Exterior1st', 15),
 ('Exterior2nd', 16),
 ('Neighborhood', 25)]

### Investigate Cardinality
The output above shows, for each column with categorical data, the number of unique values in the column.  For instance, the `'Street'` column in the training data has two unique values: `'Grvl'` and `'Pave'`, corresponding to a gravel road and a paved road, respectively.

We refer to the number of unique entries of a categorical variable as the **cardinality** of that categorical variable.  For instance, the `'Street'` variable has cardinality 2.

In [30]:
# Columns with cardinality greater than 10
high_cardinality_cols = [k for k,v in d.items() if v > 10]
print(high_cardinality_cols)

# How many columns are needed to one-hot encode the 'Neighborhood' variable in the training data?
num_cols_neighborhood = d['Neighborhood']
print(num_cols_neighborhood)

['Neighborhood', 'Exterior1st', 'Exterior2nd']
25


Let's use **one-hot encoding**. 

But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.

In [31]:
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols) - set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)

Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Exterior1st', 'Neighborhood', 'Exterior2nd']


In [32]:
# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))

# One-Hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index 

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

In [33]:
print('MAE (One-Hot Encoding):',
      score_model(OH_X_train, OH_X_valid, y_train, y_valid))



MAE (One-Hot Encoding): 17525.345719178084




## Submit results

In [34]:
def load_data(train_path, test_path, target):
    """
    Function that does the following:
    1. Loads the train and test datasets
    2. Remove rows with missing target and separarte target from features
    3. Drop columns with missing values
    """
    # Load the data
    X_train = pd.read_csv(train_path, index_col='Id')
    X_test = pd.read_csv(test_path, index_col='Id')
    X_train.dropna(axis=0, subset=[target], inplace=True)
    
    # Concat datasets
    X = pd.concat([X_train, X_test], axis=0)
    
    # Remove rows with missing target, separarte target from features
    y = X[target]
    X.drop([target], axis=1, inplace=True)
    
    # Return train and test datasets
    return X, y, X_test

In [35]:
# Data path and target variable
train_path = '../Intro-to-ML/home-data-for-ml-course/train.csv'
test_path = '../Intro-to-ML/home-data-for-ml-course/test.csv'
target = 'SalePrice'

# Load data
X, y, X_test = load_data(train_path, test_path, target)

# Shape
print(X.shape)
print(X_test.shape)
print(y.shape)

(2919, 79)
(1459, 79)
(2919,)


In [36]:
# Fill null values
missing_vals = X.isnull().sum()
cols_fill_null = missing_vals[missing_vals < 10].index.tolist()
X[cols_fill_null] = X[cols_fill_null].fillna(0)

# Drop columns with missing values
cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
X.drop(cols_with_missing, axis=1, inplace=True)

In [37]:
print(X.shape)

(2919, 61)


In [38]:
features = X.select_dtypes('number').columns
cat_features = X.select_dtypes('object').columns

In [39]:
# Categorical features
cat_features

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'ExterQual', 'ExterCond', 'Foundation', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional',
       'PavedDrive', 'SaleType', 'SaleCondition'],
      dtype='object')

In [40]:
missing_val_count_by_column = X[features].isnull().sum()
print(f"Missing numerical features: {missing_val_count_by_column[missing_val_count_by_column > 0]}")

missing_val_count_by_column = X[cat_features].isnull().sum()
print(f"\nMissing categorical features: {missing_val_count_by_column[missing_val_count_by_column > 0]}")

Missing numerical features: Series([], dtype: int64)

Missing categorical features: Series([], dtype: int64)


In [41]:
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in cat_features if X[col].nunique() < 10]

# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(cat_features) - set(low_cardinality_cols))

print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)

Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleCondition']

Categorical columns that will be dropped from the dataset: ['Exterior1st', 'SaleType', 'Neighborhood', 'Exterior2nd']


In [42]:
# One-Hot Encoding
X.drop(high_cardinality_cols, axis=1, inplace=True)

In [43]:
# Convert all categorical columns to string datatypes
X[low_cardinality_cols] = X[low_cardinality_cols].astype(str)

In [44]:
def OH_encoding(data):
    # Use as many lines of code as you need!
    OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
    OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(data[low_cardinality_cols]))

    # One-Hot encoding removed index; put it back
    OH_cols_train.index = data.index

    # Remove categorical columns (will replace with one-hot encoding)
    num_X_train = data.drop(low_cardinality_cols, axis=1)

    # Add one-hot encoded columns to numerical features
    X = pd.concat([num_X_train, OH_cols_train], axis=1)
    return X

In [45]:
# Perform One-Hot Encoding
X = OH_encoding(X)
X.shape

(2919, 160)

In [46]:
# Concat target variable
X = pd.concat([X, y], axis=1)

# Split dataFrame df back to train and test set
dtrain = X[X['SalePrice'].notna()]
dtest = X[X['SalePrice'].isna()]

In [47]:
print(dtrain.shape)
print(dtest.shape)

(1460, 161)
(1459, 161)


In [48]:
y = dtrain['SalePrice']
X = dtrain.drop('SalePrice', axis=1)

test = dtest.drop('SalePrice', axis=1)

In [49]:
# Fit model to the training set
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X, y)

# Generate predictions
preds_test = model.predict(test)

# Save predictions in the format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)



In [50]:
output

Unnamed: 0,Id,SalePrice
0,1461,127045.50
1,1462,155456.25
2,1463,173272.20
3,1464,181311.40
4,1465,200091.56
...,...,...
1454,2915,85464.50
1455,2916,84751.11
1456,2917,159820.87
1457,2918,112961.00
