## Machine Learning Course (part 2)

- This document summarizes machine learning introduction from [Machine learning course from Kaggle](https://www.kaggle.com/learn/machine-learning).
- You can download data required for the following exercises through [Kaggle API](https://github.com/Kaggle/kaggle-api)

In [1]:
# download sample data
!kaggle datasets download --path ./data_files --unzip dansbecker/melbourne-housing-snapshot

Downloading melbourne-housing-snapshot.zip to ./data_files
100%|████████████████████████████████████████| 451k/451k [00:00<00:00, 1.14MB/s]



## Handling Missing Values
Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.

### Step 1 : Basic problem set-up

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split

# load dataset 
melb_data = pd.read_csv('./data_files/melb_data.csv')

melb_target = melb_data.Price
melb_predictors = melb_data.drop(['Price'], axis='columns')

# for the sake of keeping the example simple, we'll use only numeric predictors.
melb_numeric_predictors = melb_predictors.select_dtypes(exclude=['object'])

# divide our data into training and test
x_train, x_test, y_train, y_test = train_test_split(melb_numeric_predictors,
                                                                        melb_target,
                                                                        train_size = 0.7,
                                                                        test_size = 0.3,
                                                                        random_state = 0)

### Step 2 : Create function to measure quality of different approaches to missing values

In [26]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def score_dataset(x_train, x_test, y_train, y_test):
    model = RandomForestRegressor(random_state=1)
    model.fit(x_train, y_train)
    preds = model.predict(x_test)
    return mean_absolute_error(y_test, preds)

### Step 3 : Get model score from dropping columns with missing values

In [38]:
cols_with_missing = [col for col in x_train.columns if x_train[col].isnull().any()]
reduced_x_train = x_train.drop(cols_with_missing, axis='columns')
reduced_x_test = x_test.drop(cols_with_missing, axis='columns')

print('Mean Absolute Error from dropping columns with missing values : ')
print(score_dataset(reduced_x_train, reduced_x_test, y_train, y_test))

Mean Absolute Error from dropping columns with missing values : 




191833.53711690864


### Step 4 : Get model score from imputation

In [73]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()

# re-cast imputed results to Pandas DataFrame 
imputed_x_train = pd.DataFrame(my_imputer.fit_transform(x_train))
imputed_x_test = pd.DataFrame(my_imputer.transform(x_test))
imputed_x_train.columns = x_train.columns
imputed_x_test.columns = x_test.columns

print('Mean Absolute Error from imputation : ')
print(score_dataset(imputed_x_train, imputed_x_test, y_train, y_test))

Mean Absolute Error from imputation : 




182349.8471234542


> #### fit_transform vs. transform
fit_transform means to do some calculation (normally calculate the means of colums) and then do transformation (replacing the missing values). So for training set, you need to both calcuate and do transformation.
But for testing set which will be less in number, it is intuitive to use mean values derived from larger training data. So it doesn't need to calculate, it just performs the transformation 

### Step 5 : Get model score from imputation with extra columns showing what was imputed

In [90]:
# make copy to avoid changing original data 
x_train_plus = x_train.copy()
x_test_plus = x_test.copy()

cols_with_missing = [col for col in x_train.columns if x_train[col].isnull().any()]

# make new columns indicating what will be imputed
for col in cols_with_missing:
    x_train_plus[col + '_was_missing'] = x_train_plus[col].isnull()
    x_test_plus[col + '_was_missing'] = x_test_plus[col].isnull()
    
my_imputer = SimpleImputer()
imputed_x_train_plus = pd.DataFrame(my_imputer.fit_transform(x_train_plus))
imputed_x_test_plus = pd.DataFrame(my_imputer.transform(x_test_plus))
imputed_x_train_plus.columns = x_train_plus.columns
imputed_x_test_plus.columns = x_test_plus.columns

print('Mean Absolute Error from imputation while tracking what was imputed: ')
print(score_dataset(imputed_x_train_plus, imputed_x_test_plus, y_train, y_test))

Mean Absolute Error from imputation while tracking what was imputed: 




182182.084975571


---- 

## Using categorical data with one hot encoding
Categorical data is data that takes only a limited number of values. You will get an error if you tro to plug these variables into most machine learning models in Python wihtout "encoding" them first. Here we'll see the most popular method for encoding categorical variables.

In [93]:
melb_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [95]:
# Download dataset
!kaggle competitions download -c house-prices-advanced-regression-techniques --path ./data_files --file train.csv
!kaggle competitions download -c house-prices-advanced-regression-techniques --path ./data_files --file test.csv

Downloading train.csv to ./data_files
100%|█████████████████████████████████████████| 450k/450k [00:01<00:00, 289kB/s]

Downloading test.csv to ./data_files
100%|████████████████████████████████████████| 441k/441k [00:00<00:00, 1.06MB/s]



### Step 1 : Basic problem set-up

In [113]:
# read the data
train_data = pd.read_csv('./data_files/train.csv')
test_data = pd.read_csv('./data_files/test.csv')

# drop houses where the target is missing
train_data.dropna(axis='rows', subset=['SalePrice'], inplace=True)

target = train_data.SalePrice

# since missing values isn't the focus of this tutorial, we use the simplest
# possible approach, which drops these columns. 
cols_with_missing = [col for col in train_data.columns if train_data[col].isnull().any()]                                  
candidate_train_predictors = train_data.drop(['Id', 'SalePrice'] + cols_with_missing, axis='columns')
candidate_test_predictors = test_data.drop(['Id'] + cols_with_missing, axis='columns')

# "cardinality" means the number of unique values in a column.
# we use it as our only way to select categorical columns here. This is convenient, though
# a little arbitrary.
low_cardinality_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].nunique() < 10 and
                                candidate_train_predictors[cname].dtype == "object"]
numeric_cols = [cname for cname in candidate_train_predictors.columns if 
                                candidate_train_predictors[cname].dtype in ['int64', 'float64']]

my_cols = low_cardinality_cols + numeric_cols

train_predictors = candidate_train_predictors[my_cols]
test_predictors = candidate_test_predictors[my_cols]

### Step 2 : get one-hot encodings using `get_dummies`

In [146]:
# by default, get_dummies would only create dummy variables for dtype=object
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)

print('original dataframe has Street column which has two unique values : {}'.format(train_predictors.Street.unique().tolist()))
print('one-hot encoded dataframe has two columns derived from Street column as follows : ')
print(one_hot_encoded_training_predictors[['Street_Pave', 'Street_Grvl']].head())

original dataframe has Street column which has two unique values : ['Pave', 'Grvl']
one-hot encoded dataframe has two columns derived from Street column as follows : 
   Street_Pave  Street_Grvl
0            1            0
1            1            0
2            1            0
3            1            0
4            1            0


### Step 3 : Compare MAE with two different approaches

In [150]:
from sklearn.model_selection import cross_val_score

def get_mae(x, y):
    # multiple by -1 to make positive MAE score instead of neg value returned as sklearn convention
    return -1 * cross_val_score(RandomForestRegressor(50), x, y, scoring = 'neg_mean_absolute_error').mean()

predictors_without_categoricals = train_predictors.select_dtypes(exclude=['object'])

mae_without_categoricals = get_mae(predictors_without_categoricals, target)

mae_one_hot_encoded = get_mae(one_hot_encoded_training_predictors, target)

print('Mean Absolute Error when Dropping Categoricals: ' + str(int(mae_without_categoricals)))
print('Mean Abslute Error with One-Hot Encoding: ' + str(int(mae_one_hot_encoded)))



Mean Absolute Error when Dropping Categoricals: 18584
Mean Abslute Error with One-Hot Encoding: 18288


### Step 4 : Applying to multiple files
Scikit-learn is sensitive to the ordering of columns, so if the training dataset and test dataset gets misaligned, your results will be nonsense. This could happen if a categorical had a different number of values in the training data vs the test data. 
Ensure the test data is encoded in the same manner as the training data with the `align` command

In [177]:
one_hot_encoded_training_predictors = pd.get_dummies(train_predictors)
one_hot_encoded_test_predictors = pd.get_dummies(test_predictors)
final_train, final_test = one_hot_encoded_training_predictors.align(one_hot_encoded_test_predictors, join='left', axis='columns')