### code snippets  | notes | references | resources

In [1]:
import pandas as pd
import numpy as np

In [3]:
# using the auto dataset
X = pd.read_csv('auto.csv')
X.shape

(201, 29)

In [4]:
X.head()

Unnamed: 0,symboling,normalized-losses,make,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,...,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,horsepower-binned,diesel,gas
0,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,9.0,111.0,5000.0,21,27,13495,11.190476,Medium,0,1
1,3,122,alfa-romero,std,two,convertible,rwd,front,88.6,0.811148,...,9.0,111.0,5000.0,21,27,16500,11.190476,Medium,0,1
2,1,122,alfa-romero,std,two,hatchback,rwd,front,94.5,0.822681,...,9.0,154.0,5000.0,19,26,16500,12.368421,Medium,0,1
3,2,164,audi,std,four,sedan,fwd,front,99.8,0.84863,...,10.0,102.0,5500.0,24,30,13950,9.791667,Medium,0,1
4,2,164,audi,std,four,sedan,4wd,front,99.4,0.84863,...,8.0,115.0,5500.0,18,22,17450,13.055556,Medium,0,1


In [5]:
# quick check for missing values
# if only a handful - you could just drop them
X.isna().sum().sum()

5

In [None]:
# Drop rows with missing values - if you want...
# alternatively you may interpolate or use some other method for filling things in
X.dropna(axis=0, inplace=True) 

### Finding Categorical and Numerical Columns

Looking at the data set - should categorical columns be encoded first or only after splitting.
The general consensus is that it should be encoded after splitting, but for a divergent opinion:
https://jamesmccaffrey.wordpress.com/2020/05/27/should-you-normalize-and-encode-data-before-train-test-splitting-or-after-splitting/

In [25]:
# first it's helpful to see what columns contain what 
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X.columns if X[cname].nunique() < 3 and 
                        X[cname].dtype == "object"]
low_cardinality_cols

['aspiration', 'engine-location']

In [7]:
# Select numerical columns
numerical_cols = [cname for cname in X.columns if X[cname].dtype in ['int64', 'float64']]

In [8]:
# Get list of categorical variables
s = (X.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables are in columns:")
print(object_cols)
print('Numerical variables are in columns:')
print(numerical_cols)

Categorical variables are in columns:
['make', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'engine-type', 'num-of-cylinders', 'fuel-system', 'horsepower-binned']
Numerical variables are in columns:
['symboling', 'normalized-losses', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price', 'city-L/100km', 'diesel', 'gas']


In [9]:
# print unique values in the categorical columns
# NB - if there is a test set - compare the values in the test and training sets to
# ensure that the intersection of both sets is complete otherwise the encoding steps
# will throw an error
for o in object_cols:
    print(f'column header "{o}" contains these unique values...')
    print(X[o].unique())
    print()

column header "make" contains these unique values...
['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mercury' 'mitsubishi' 'nissan' 'peugot'
 'plymouth' 'porsche' 'renault' 'saab' 'subaru' 'toyota' 'volkswagen'
 'volvo']

column header "aspiration" contains these unique values...
['std' 'turbo']

column header "num-of-doors" contains these unique values...
['two' 'four']

column header "body-style" contains these unique values...
['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']

column header "drive-wheels" contains these unique values...
['rwd' 'fwd' '4wd']

column header "engine-location" contains these unique values...
['front' 'rear']

column header "engine-type" contains these unique values...
['dohc' 'ohcv' 'ohc' 'l' 'rotor' 'ohcf']

column header "num-of-cylinders" contains these unique values...
['four' 'six' 'five' 'three' 'twelve' 'two' 'eight']

column header "fuel-system" contains these unique values...
['mpfi' '2bbl' 

In the categorical columns values is there an inherent ranking present?

### ordinal encoding

In [None]:
# remap values with a dictionary manually with replace
X.replace({"num-of-doors":{'two':2, 'four':4},
           "num-of-cylinders":{'four':4, 'six':6, 'five':5, 'three':3, 'twelve':12, 'two':2, 'eight':8})

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

and since it is a blind spot in the documentation - for a clear explanation on how to use categories
https://datascience.stackexchange.com/questions/72343/encoding-with-ordinalencoder-how-to-give-levels-as-user-input

In [21]:
from sklearn.preprocessing import OrdinalEncoder

# for one column using categories
#ordinal_encoder = OrdinalEncoder(categories=[['two','four']])
#X_copy = ordinal_encoder.fit_transform(X.loc[:,["num-of-doors"]])

# for using multiple - put the labels in order in the lists
door_cats = ['two', 'four']
cylinder_cats = ['two','three','four', 'five', 'six','eight','twelve']
horse_cats = ['Low', 'Medium', 'High']

# and then feed them to the encoder class and use the fit_transform method
ordinal_encoder = OrdinalEncoder(categories=[door_cats,cylinder_cats,horse_cats])
X[['num-of-doors', 'num-of-cylinders']] = ordinal_encoder.fit_transform(X[['num-of-doors', 
                                                                           'num-of-cylinders',
                                                                           'horsepower-binned']])

# or to let the ordinal encoder lable things automatically...
#label_X_train[good_label_cols] = ordinal_encoder.fit_transform(X_train[good_label_cols])


In [24]:
X[['num-of-doors', 'num-of-cylinders','horsepower-binned']]

Unnamed: 0,num-of-doors,num-of-cylinders
0,0.0,2.0
1,0.0,2.0
2,0.0,4.0
3,1.0,2.0
4,1.0,3.0
...,...,...
196,1.0,2.0
197,1.0,2.0
198,1.0,4.0
199,1.0,4.0


In [None]:
# to retrieve the original values
ordinal_encoder.inverse_transform(X[['num-of-doors', 'num-of-cylinders']])

### separate X and y variables

In [None]:
y = X.price
X.drop(['price'], axis=1, inplace=True) 
# if assigning to another variable remove the inplace
# X = X_train.drop(['price'], axis=1)

In [None]:
# double check to make sure nothing is missing
y_miss = y.isna().sum().sum()
x_miss = X.isna().sum().sum()
print(f'X missing values: {y_miss} Y missing values: {x_miss}')

In [None]:
X.head()

In [None]:
for n in numerical_cols:
    print('column header', n, 'is numerical and has these stats:')
    print('mean', X_train[n].mean())
    print('median', X_train[n].median())
    print('std deviation', X_train[n].std())
    print()

In [None]:
# filter out all categorical variables from the dataset
#drop_X_train = X_train.select_dtypes(exclude=['object'])

### Quantifying Missing values

In [None]:
# How many columns in the training data
# have missing values?
missing_count = (X_train.isnull().sum())
num_cols_with_missing = missing_count[missing_count > 0].count()

# Fill in the line below: How many missing entries are contained in 
# all of the training data?
tot_missing = X_train.isna().sum().sum()
print(f'num_rows: {num_rows} num_columns: {num_columns}')
print(f'number of columns with missing values: {num_cols_with_missing}')
print(f'total number of missing values: {tot_missing}')
print()
print('columns with missing values + count of missing')
print(missing_count)

In [None]:
# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
cols_with_missing

### Removing Missing Values

https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data

In [None]:
# Drop columns in training and validation data with axis=1 
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_train

# alternate
# reduced_X_train = X_train.dropna(how='any') #'all' = only drop if all of a row or column is na

In [None]:
# drop rows in training and validation data with missing values
# 0, or ‘index’ : Drop rows which contain missing values.
# 1, or ‘columns’ : Drop columns which contain missing value.
X_train.dropna(axis=0, inplace=True)
X_valid.dropna(axis=0, inplace=True)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
num_rows = X_train.shape[0]
num_columns = X_train.shape[1]
print('df is: ',num_rows, 'by', num_columns)

### Remove categorical columns with values that don't map onto testing data

In [None]:
# All categorical columns
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]

# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if 
                   set(X_valid[col]).issubset(set(X_train[col]))]
        
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
        
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that could be dropped from the dataset:', bad_label_cols)

In [None]:
# Get number of unique entries in each column with categorical data
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))

# Print number of unique entries by column, in ascending order
sorted(d.items(), key=lambda x: x[1])

### one hot encoding

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index

OH_cols_train # this is col 'four' with the five different values translated
              # into a column for the five differnt values 

In [None]:
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_train

In [None]:
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_train

### Use an imputer to fill in missing values

In [None]:
from sklearn.impute import SimpleImputer

### BASIC EXAMPLE ###
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train.select_dtypes(exclude=['object'])))
# Imputation removed column names; put them back
imputed_X_train.columns = X_train.select_dtypes(exclude=['object']).columns
imputed_X_train

In [None]:
### ADD A COLUMN TO IDENTIFY WHICH VALUES WERE MISSING ###

# Make copy to avoid changing original data (when imputing)
X_train_plus = X_train.select_dtypes(exclude=['object']).copy()

# Make new columns indicating what will be imputed
for col in imputed_X_train:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))

# Imputation removed column names; put them back
imputed_X_train_plus.columns = X_train_plus.columns

In [None]:
imputed_X_train_plus

### ====|PIPELINES|====

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
#from sklearn.impute import SimpleImputer
#from sklearn.preprocessing import OneHotEncoder

In [None]:
from sklearn.model_selection import train_test_split

# Read the data
data = pd.read_csv(r'C:\Users\John Rando\Documents\Code\Kaggle\melbourne-housing-snapshot\melb_data.csv')

# Separate target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [None]:
X_train.tail()

With the pipeline, we preprocess the training data and fit the model in a single line of code. (In contrast, without a pipeline, we have to do imputation, one-hot encoding, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables!)
With the pipeline, we supply the unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. (However, without a pipeline, we have to remember to preprocess the validation data before making predictions.)

In [None]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

In [None]:
from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

### Cross Validation

We obtain the cross-validation scores with the cross_val_score() function from scikit-learn. We set the number of folds with the cv parameter.

The scoring parameter chooses a measure of model quality to report: in this case, we chose negative mean absolute error (MAE). The docs for scikit-learn show a list of options.

It is a little surprising that we specify negative MAE. Scikit-learn has a convention where all metrics are defined so a high number is better. Using negatives here allows them to be consistent with that convention, though negative MAE is almost unheard of elsewhere.

We typically want a single measure of model quality to compare alternative models. So we take the average across experiments.

In [None]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("MAE scores:\n", scores)

In [None]:
print("Average MAE score (across experiments):")
print(scores.mean())

In [None]:
def get_score(n_estimators):
    """Return the average MAE over 3 CV folds of random forest model.
    
    Keyword argument:
    n_estimators -- the number of trees in the forest
    """
    # Replace this body with your own code
    a_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=n_estimators, random_state=0))])
    a_score = -1 * cross_val_score(a_pipeline, X, y,
                              cv=3,
                              scoring='neg_mean_absolute_error')
    return a_score.mean()

In [None]:
# score models with different numbers of esitmators, then plot the scores and look for the elbow
results = {}
for n in range(50, 450, 50):
    results[n] = get_score(n)

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(list(results.keys()), list(results.values()))
plt.show()

In [None]:
from xgboost import XGBRegressor
