<a href="https://colab.research.google.com/github/nsfoliveira/Machine-Learning/blob/main/Kaggle_InterML_MissingValues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lesson 02: Missing Values**

Missing values happen. Be prepared for this common challenge in real datasets.


## **Example: Melbourne Housing**


In the example, we will work with the Melbourne Housing dataset. Our model will use information such as the number of rooms and land size to predict home price.



In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv('/content/melb_data.csv')

# Select target
y = data.Price

# To keep things simple, we'll use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# Divide data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                      random_state=0)

### **Define Function to Measure Quality of Each Approach**

We define a function score_dataset() to compare different approaches to dealing with missing values. This function reports the mean absolute error (MAE) from a random forest model.

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### **Score from Approach 1 (Drop Columns with Missing Values)**

Since we are working with both training and validation sets, we are careful to drop the same columns in both DataFrames.

In [None]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))