## intro
Some notes about how to find and handle missing values
- detecting missing values
- keep numeric columns only
- one-hot encoding

#### Detecting Missing Values
Python libraries represent missing numbers as nan which is short for "not a number". 

You can detect which cells have missing values, and then count how many there are in each column with the command:

In [None]:
missing_val_count_by_column = (data.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

---

#### Keep the columns with numbers only:

In [None]:
melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis=1)

melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])

---

#### Dealing with Missing Data

1.drop all the columns with missing values

In many cases, you'll have both a training dataset and a test dataset. You will want to drop the same columns in both DataFrames. In that case, you would write

In [None]:
cols_with_missing = [col for col in original_data.columns 
                                 if original_data[col].isnull().any()]
redued_original_data = original_data.drop(cols_with_missing, axis=1)
reduced_test_data = test_data.drop(cols_with_missing, axis=1)

2.imputation

Imputation fills in the missing value with some number. The imputed value won't be exactly right in most cases, but it usually gives more accurate models than dropping the column entirely.

In [None]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)

3.an extension of imputation

Imputation is the standard approach, and it usually works well. However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset). Or rows with missing values may be unique in some other way. In that case, your model would make better predictions by considering which values were originally missing. Here's how it might look:

In [None]:
# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns

---

### One-Hot Encoding: the standard approach for categorical data

One hot encoding creates new (binary) columns, indicating the presence of each possible value from the original data.
Pandas assigns a data type (called a dtype) to each column or Series. 

In [None]:
train_predictors.dtypes.sample(10) # check out 10 samples dtype in train_predictors

Object indicates a column has text (there are other things it could be theoretically be, but that's unimportant for our purposes). It's most common to one-hot encode these "object" columns, since they can't be plugged directly into most models. Pandas offers a convenient function called get_dummies to get one-hot encodings. 

In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

we are tend to deal with multiple files in the real world, so what about when you have multiple files?
Scikit-learn is sensitive to the ordering of columns, so if the training dataset and test datasets get misaligned, your results will be nonsense. This could happen if a categorical had a different number of values in the training data vs the test data.
Ensure the test data is encoded in the same manner as the training data with the align command:

In [None]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors,
                                                                    join='left', 
                                                                    axis=1) 
#The align command makes sure the columns show up in the same order in both datasets 
#(it uses column names to identify which columns line up in each dataset.) 
#The argument join='left' specifies that we will do the equivalent of SQL's left join. 
#That means, if there are ever columns that show up in one dataset and not the other, 
#we will keep exactly the columns from our training data. 
#The argument join='inner' would do what SQL databases call an inner join, 
#keeping only the columns showing up in both datasets. That's also a sensible choice.