# Covered Topics

### Machine Learning Topics:
- Missing Values
- Categorical Data
- Pipelines
- Advanced Topics

### Guide
[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)

# Regression Example 1

You will work with data from the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course) to predict home prices in Iowa using 79 explanatory variables describing (almost) every aspect of the homes. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
# Read the data
X = pd.read_csv('data/train.csv', index_col='Id')
X_test_full = pd.read_csv('data/test.csv', index_col='Id')

X

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


In [3]:
# Remove rows with missing target
X.dropna(axis=0, subset=['SalePrice'], inplace=True)

In [4]:
# Separate target from predictors
y = X.SalePrice              
X.drop(['SalePrice'], axis=1, inplace=True)

In [5]:
# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

In [6]:
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

In [7]:
# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

### Categorical Data

Numbers or strings representing different things.

Survey Question: How often you eat breakfast?
<br>Responses:
- Never
- Rarely
- Most days
- Every day



Survey Question: On a scale of 0 to 2 (0 meaning bad, 1 meaning neutral, 2 meaning good), how was the course?
<br>Responses:
- 0
- 1
- 2

Survey Question: What type of car do you drive?
<br>Responses:
- Toyota
- Honda
- Ford

Three Approaches to handle Categorical Data:
1. Drop Categorical Variables
2. Label Encoding
 - This approach assumes an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).
 - This assumption makes sense in this example, because there is an indisputable ranking to the categories. Not all categorical variables have a clear ordering in the values, but we refer to those that do as **ordinal** variables. We refer to categorical variables without an intrinsic ranking as **nominal** variables.
3. One-hot encoding
  - creates new columns indicating the presence (or absence) of each possible value in the original data.

### Missing Values

Three Approaches:
1. Drop Columns with Missing Values
2. Imputation
  - Imputation fills in the missing values with some number.
  - The imputed value won't be exactly right in most cases, but it usually leads to more accurate models than you would get from dropping the column entirely.
3. Imputation and Add Columns Denoting Missing Status

### Pipelines

**Pipelines** are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

We use the `ColumnTransformer` class to bundle together different preprocessing steps. The code below:

- imputes missing values in **numerical** data, and
- imputes missing values and applies a one-hot encoding to **categorical** data.

In [8]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [9]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

In [10]:
def evaluate(predictions, y_valid):
    errors = abs(predictions - y_valid)
    mape = 100 * np.mean(errors / y_valid)
    accuracy = 100 - mape
    print('Model Performance')
    print('MAE: {}'.format(mean_absolute_error(y_valid, preds)))
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy

In [11]:
# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

evaluate(preds, y_valid)

Model Performance
MAE: 17861.780102739725
Average Error: 17861.7801 degrees.
Accuracy = 90.13%.


90.12724117844509

### Improve the model
Now that you've trained a default model as baseline, it's time to tinker with the parameters, to see if we can get better performance! 

A good place is the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) on the random forest in Scikit-Learn. This tells us the most important settings are the number of trees in the forest (n_estimators) and the number of features considered for splitting at each leaf node (max_features).

Also, try other imputing strategies dependent on the data itself.

In [12]:
from pprint import pprint 

In [13]:
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(model.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 0,
 'verbose': 0,
 'warm_start': False}


In [14]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='median')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=400,max_depth=70,random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

evaluate(preds, y_valid)

Model Performance
MAE: 17282.303193493153
Average Error: 17282.3032 degrees.
Accuracy = 90.38%.


90.37894472420624

# Regression Example 2

In [15]:
# import dataset
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

import matplotlib.pyplot as plt  
import pandas as pd
import numpy as np

%matplotlib inline


In [16]:
# load data
data = load_boston()
# load into a dataframe
df = pd.DataFrame(data['data'], columns=data['feature_names'].tolist())
df['label'] = data['target']
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,label
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


`label` on the right represents **median value of owner-occupied homes in $1000’s average home value in thousands of dollars in each area**. The idea behind this dataset is to use values in the other columns to predict average home value.

In [17]:
X = df.drop("label", axis=1)
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [18]:
# train model
model = LinearRegression()
model.fit(X_train, y_train)

# predict results
y_pred = model.predict(X_test)

### Regression Metrics
3 of the most common metrics for evaluating predictions on regression machine learning problems:

1. Mean Absolute Error
  - The Mean Absolute Error (or MAE) is the average of the absolute differences between predictions and actual values. It gives an idea of how wrong the predictions were.
  - The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting).
  - A value of 0 indicates no error or perfect predictions.
2. Mean Squared Error
  - The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of error.
  - Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation. This is called the Root Mean Squared Error (or RMSE).
  - A value of 0 indicates no error or perfect predictions.
3. R$^2$
  - The R$^2$ (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination.
  - This is a value between 0 and 1 for no-fit and perfect fit respectively.
  - Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

[Displaying Accuracy](https://towardsdatascience.com/a-beginners-guide-to-linear-regression-in-python-with-scikit-learn-83a8f7ae2b4f)

In [19]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R^2:', metrics.r2_score(y_test, y_pred)) 

Mean Absolute Error: 3.211454138492077
Mean Squared Error: 16.714282873330262
Root Mean Squared Error: 4.0883105157669055
R^2: 0.7731254196270163


In [20]:
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])  
coeff_df

Unnamed: 0,Coefficient
CRIM,-0.120624
ZN,0.032051
INDUS,0.02664
CHAS,2.980532
NOX,-19.089522
RM,3.450683
AGE,0.002143
DIS,-1.47397
RAD,0.402614
TAX,-0.01602


This means that for a unit increase in `NOX`, there is a decrease of 16.99 units in the median value of owner-occupied homes. Similarly, a unit increase in `RM` results in an increase of 4.07 units in the median value of owner-occupied homes. We can see that other features closer to 0 have very little effect on the median value of owner-occupied homes.

# Advanced Topics

### [Cross Validation](https://www.kaggle.com/alexisbcook/cross-validation)

In cross-validation, we run our modeling process on different subsets of the data to get multiple measures of model quality.

Cross-validation gives a more accurate measure of model quality, which is especially important if you are making a lot of modeling decisions. However, it can take longer to run, because it estimates multiple models (one for each fold).

### [XGBoost](https://www.kaggle.com/alexisbcook/xgboost)

We have made predictions with the random forest method, which achieves better performance than a single decision tree simply by averaging the predictions of many decision trees.

[XGBoost](https://xgboost.readthedocs.io/en/latest/) is a the leading software library for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). With careful parameter tuning, you can train highly accurate models.

**XGBoost** stands for extreme gradient boosting, which is an implementation of gradient boosting with several additional features focused on performance and speed. (Scikit-learn has another version of gradient boosting, but XGBoost has some technical advantages.)