 **missing values** handling. 


In [1]:
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex2 import *
print("Setup Complete")

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# To keep things simple, we'll use only numerical predictors
X = X_full.select_dtypes(exclude=['object'])
X_test = X_test_full.select_dtypes(exclude=['object'])

# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

 first five rows of the data.

In [3]:
X_train.head()

few missing values in the first several rows. 

# Step 1: Preliminary investigation

In [48]:
# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])
#print(missing_val_count_by_column)

### Part A



In [19]:
# Fill in the line below: How many rows are in the training data?
num_rows =X_train.shape[0]

# Fill in the line below: How many columns in the training data
# have missing values?
#print(len(num_cols_with_missing[num_cols_with_missing>0]))

num_cols_with_missings = (X_train.isnull().sum())
num_cols_with_missing=len(num_cols_with_missings[num_cols_with_missings>0])
#print(num_cols_with_missing)
# Fill in the line below: How many missing entries are contained in 
# all of the training data?
tot_missing = num_cols_with_missings[num_cols_with_missings>0].sum()

# Check your answers
step_1.a.check()

### Part B

Since there are relatively few missing entries in the data (the column with the greatest percentage of missing values is missing less than 20% of its entries), we can expect that dropping columns is unlikely to yield good results. This is because we'd be throwing away a lot of valuable data, and so imputation will likely perform better.

To compare different approaches to dealing with missing values, you'll use the same `score_dataset()` function from the tutorial.  This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.

In [23]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

# Step 2: Drop columns with missing values

 preprocess the data in `X_train` and `X_valid` to remove columns with missing values.  Set the preprocessed DataFrames to `reduced_X_train` and `reduced_X_valid`, respectively.  

In [28]:
# Fill in the line below: get names of columns with missing values
# Your code here
nc=[col for col in X_train.columns if not X_train[col].isnull().any()]
nv=[col for col in X_valid.columns if not X_valid[col].isnull().any()]

# Fill in the lines below: drop columns in training and validation data
reduced_X_train = X_train[nc]
reduced_X_valid = X_valid[nv]

# Check your answers
step_2.check()

obtain the MAE for this approach.

In [29]:
print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

# Step 3: Imputation

### Part A

 impute missing values with the mean value along each column.  Set the preprocessed DataFrames to `imputed_X_train` and `imputed_X_valid`.  Make sure that the column names match those in `X_train` and `X_valid`.

In [36]:
from sklearn.impute import SimpleImputer
#import

# Fill in the lines below: imputation
____ # Your code here
imp=SimpleImputer()
imputed_X_train = pd.DataFrame(imp.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(imp.transform(X_valid))

#print(imputed_X_train)
# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

# Check your answers
step_3.a.check()

obtain the MAE for this approach.

In [37]:
print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

### Part B

Compare the MAE from each approach.  

Given that thre are so few missing values in the dataset, we'd expect imputation to perform better than dropping columns entirely. However, we see that dropping columns performs slightly better! While this can probably partially be attributed to noise in the dataset, another potential explanation is that the imputation method is not a great match to this dataset. That is, maybe instead of filling in the mean value, it makes more sense to set every missing value to a value of 0, to fill in the most frequently encountered value, or to use some other method. For instance, consider the GarageYrBlt column (which indicates the year that the garage was built). It's likely that in some cases, a missing value could indicate a house that does not have a garage. Does it make more sense to fill in the median value along each column in this case? Or could we get better results by filling in the minimum value along each column? It's not quite clear what's best in this case, but perhaps we can rule out some options immediately - for instance, setting missing values in this column to 0 is likely to yield horrible results!

# Step 4: Generate test predictions

use any approach of choosing to deal with missing values.  Once preprocessed the training and validation features, train and evaluate a random forest model.  Then, preprocess the test data before generating predictions that can be submitted to the competition!

### Part A

preprocess the training and validation data.  Set the preprocessed DataFrames to `final_X_train` and `final_X_valid`.  **use any approach here!**  in order for this step to be marked as correct, you need only ensure:
- the preprocessed DataFrames have the same number of columns,
- the preprocessed DataFrames have no missing values, 
- `final_X_train` and `y_train` have the same number of rows, and
- `final_X_valid` and `y_valid` have the same number of rows.

In [41]:
# Preprocessed training and validation features
imp=SimpleImputer(strategy="median")
final_X_train = pd.DataFrame(imp.fit_transform(X_train))
final_X_valid = pd.DataFrame(imp.transform(X_valid))

final_X_train.columsn=X_train.columns
final_X_valid.column=X_valid.columns

# Check your answers
step_4.a.check()

train and evaluate a random forest model.  

In [43]:
# Define and fit model
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(final_X_train, y_train)

# Get validation predictions and MAE
preds_valid = model.predict(final_X_valid)
print("MAE (Your approach):")
print(mean_absolute_error(y_valid, preds_valid))

### Part B

preprocess the test data. use a method that agrees with how you preprocessed the training and validation data, and set the preprocessed test features to `final_X_test`.

Then, use the preprocessed test features and the trained model to generate test predictions in `preds_test`.

In order for this step to be marked correct:
- the preprocessed test DataFrame has no missing values, and
- `final_X_test` has the same number of rows as `X_test`.

In [45]:
# Fill in the line below: preprocess test data
final_X_test = pd.DataFrame(imp.transform(X_test))
final_X_test.columns=X_test.columns

# Fill in the line below: get test predictions
preds_test = model.predict(final_X_test)

# Check your answers
step_4.b.check()

save results to a CSV file that can be submitted directly to the competition.

In [49]:
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)