# kaggle - Learn: Intermediate Machine Learning
- https://www.kaggle.com/learn/intermediate-machine-learning

## 4.- Pipelines
- A critical skill for deploying (and even testing) complex models with pre-processing.
- A critical skill for deploying (and even testing) complex models with pre-processing.

### Intro
- There's a lot of non-numeric data out there. Here's how to use it for machine learning.
- What is a Catagorical Variable?
    - Options like: Never, Rarely, Most days, Every day.
    - Brands Names, like: Honda, Toyota, Ford
- You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first.
- Three approaches for handling this type of data:
    1. Drop Categorical Variables: if cat-cols don't contain useful info.
        - this only work if cols didn't contain useful info.
    2. Ordinal Encoding: assigns each unique value to a different integer.
        - Indisputable ranking to the categories. Ordinal Variables.
        - ex. Never(0), Rarely(1), Most days(2), Every day(3).
    3. One-Hot Encoding: new columns related with categories.
        - No clear ordering of the categories. Nominal Variables.
        - ex. diff. Colors, diff Cars Brands  - w/o intrinsic ranking of vars.

## A Case
- first load and separate of training and validation datasets.

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('files/min_melb_data.csv')

## JM: to see (and change) rows w/ NaN values
#df.info()       # Rooms has one?
df[df['Rooms'].isnull()]        # index 1920 all NaNs except Suburb & Addr
# future: find the exactly case of 2 not NaN & the other non-NaN

# Now delete this row
df.drop([1920], inplace=True)
#df[df['Rooms'].isnull()] 
# df.info()     # -> there are still a lot of cols with one missing
df[df['Date'].isnull()]         # index 2130 
df.drop([2130], inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2677 entries, 0 to 2678
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         2677 non-null   object 
 1   Address        2677 non-null   object 
 2   Rooms          2677 non-null   float64
 3   Type           2677 non-null   object 
 4   Price          2677 non-null   float64
 5   Method         2677 non-null   object 
 6   SellerG        2677 non-null   object 
 7   Date           2677 non-null   object 
 8   Distance       2677 non-null   float64
 9   Postcode       2677 non-null   float64
 10  Bedroom2       2677 non-null   float64
 11  Bathroom       2677 non-null   float64
 12  Car            2655 non-null   float64
 13  Landsize       2677 non-null   float64
 14  BuildingArea   1503 non-null   float64
 15  YearBuilt      1692 non-null   float64
 16  CouncilArea    2129 non-null   object 
 17  Lattitude      2677 non-null   float64
 18  Longtitu

In [25]:
## Continue w/
#   - separate target and predictors
#   - separate training and validation datasets
#   - handel missing values
#   - select categorical cols w low cardinality (nunique < 10)
#   - select numerical cols

# basic target and predictors
y = df.Price
X = df.drop(['Price'], axis=1)

# split train and validation datasets
X_train_f, X_valid_f, y_train, y_valid = train_test_split(X, y, random_state=0,
                                                      train_size=0.8,
                                                      test_size=0.2)
# X_train_f.info()
print(X_train_f.shape)

# Missing Values Handling: drops cols w/missing values (simplest approach)
cols_w_missing = [col for col in X_train_f.columns
                   if X_train_f[col].isnull().any()]     # cols_w_missing list
X_train_f.drop(cols_w_missing, axis=1, inplace=True)
X_valid_f.drop(cols_w_missing, axis=1, inplace=True)

print(X_train_f.shape)      # ok! only loose 4 columns for NaNs

## "Cardinality" means the number of unique values in a column
# Mk a list of categorical columns with relatively low cardinality (convenient
# but arbitrary)    - cnmae ??, better low_cname -> lcn
low_cardinality_cols = [cname for cname in X_train_f.columns
                        if X_train_f[cname].nunique() < 10 and
                        X_train_f[cname].dtype == 'object']

print(low_cardinality_cols)                     

# Select numerical columns  - cname ??. better col
numerical_cols = [cname for cname in X_train_f.columns
                  if X_train_f[cname].dtype in ['int64', 'float64']]
# ..dtype in ['int8', 'int16', 'int32', 'int64', 'float32', 'float64']
# or ..dtype != 'object'

# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
print(my_cols)
X_train = X_train_f[my_cols].copy()
X_valid = X_valid_f[my_cols].copy()

X_train.head(2)

(2141, 20)
(2141, 16)
['Type', 'Method', 'Regionname']
['Type', 'Method', 'Regionname', 'Rooms', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude', 'Propertycount']


Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
817,h,VB,Southern Metropolitan,3.0,10.7,3187.0,3.0,2.0,619.0,-37.927,145.0267,6938.0
1366,u,PI,Southern Metropolitan,1.0,11.4,3163.0,1.0,1.0,0.0,-37.8982,145.0625,7822.0



Next, we obtain a list of all of the categorical variables in the training data.

We do this by checking the data type (or dtype) of each column. The object dtype indicates a column has text (there are other things it could theoretically be, but that's unimportant for our purposes). For this dataset, the columns with text indicate categorical variables.

In [26]:
# Get list of categorical variables
s = X_train.dtypes == 'object'  # bool serie, True for objects cols
#s[s].index, type(s[s].index)
object_cols = list(s[s].index)
# s.index
print(object_cols)

# JM other way
o_c = [col for col in X_train.columns if X_train[col].dtype == 'object']
print(o_c)

print('Categorical variables:', object_cols)
X_train.columns

['Type', 'Method', 'Regionname']
['Type', 'Method', 'Regionname']
Categorical variables: ['Type', 'Method', 'Regionname']


Index(['Type', 'Method', 'Regionname', 'Rooms', 'Distance', 'Postcode',
       'Bedroom2', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude',
       'Propertycount'],
      dtype='object')

### Define Function to Measure Quality of Each Approach
We define a function score_dataset() to compare the three different approaches to dealing with categorical variables. This function reports the mean absolute error (MAE) from a random forest model. In general, we want the MAE to be as low as possible!

In [27]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

## Score from Approach 1 (Drop Categorical Variables)
We drop the object columns with the select_dtypes() method.

In [28]:
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):
209800.12821606253


## Score from Approach 2 (Ordinal Encoding)
Scikit-learn has a OrdinalEncoder class that can be used to get ordinal encodings. We loop over the categorical variables and apply the ordinal encoder separately to each column.

In [29]:
from sklearn.preprocessing import OrdinalEncoder

# Make copy to avoid changing original data 
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
#label_X_train.columns
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE from Approach 2 (Ordinal Encoding):") 
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE from Approach 2 (Ordinal Encoding):
201420.60514303483


In the code cell above, for each column, we randomly assign each unique value to a different integer. This is a common approach that is simpler than providing custom labels; however, we can expect an additional boost in performance if we provide better-informed labels for all ordinal variables.

## Score from Approach 3 (One-Hot Encoding)
We use the OneHotEncoder class from scikit-learn to get one-hot encodings. There are a number of parameters that can be used to customize its behavior.
- We set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data, and
- setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix).    

To use the encoder, we supply only the categorical columns that we want to be one-hot encoded. For instance, to encode the training data, we supply X_train[object_cols]. (object_cols in the code cell below is a list of the column names with categorical data, and so X_train[object_cols] contains all of the categorical data in the training set.)

In [30]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

print("MAE from Approach 3 (One-Hot Encoding):") 
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))



MAE from Approach 3 (One-Hot Encoding):
198747.04837686566


## Which approach is best?¶

In this case, dropping the categorical columns (Approach 1) performed worst, since it had the highest MAE score. As for the other two approaches, since the returned MAE scores are so close in value, there doesn't appear to be any meaningful benefit to one over the other.

In general, one-hot encoding (Approach 3) will typically perform best, and dropping the categorical columns (Approach 1) typically performs worst, but it varies on a case-by-case basis.
## Conclusion

The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!