# Week 6 - Classification models  

## Part 3: Travel mode choice - Probit regression

In this part we will revisit our real world problem of travel model choice. 

The first part is very similar to previous notebook for part 2: loading data, preprocessing, train/test split, etc. However, in this part, we will consider a Probit regression model. For the sake of simplicty, lets assume that we are just interested in distinguishing between car vs non-car (binary classification problem).

Lets just start running the parts corresponding to imports, data loading, preprocessing, train/test split, etc.

Import required libraries:

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import linear_model
import pystan
import pystan_utils

# fix random generator seed (for reproducibility of results)
np.random.seed(42)

# matplotlib style options
plt.style.use('ggplot')
%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 10)

Load data:

In [2]:
# load csv
df = pd.read_csv("modechoice_data.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,individual,hinc,psize,ttme_air,invc_air,invt_air,gc_air,ttme_train,invc_train,invt_train,gc_train,ttme_bus,invc_bus,invt_bus,gc_bus,invc_car,invt_car,gc_car,mode_chosen
0,0,70.0,30.0,4.0,10.0,61.0,80.0,73.0,44.0,24.0,350.0,77.0,53.0,19.0,395.0,79.0,4.0,314.0,52.0,1.0
1,1,8.0,15.0,4.0,64.0,48.0,154.0,71.0,55.0,25.0,360.0,80.0,53.0,14.0,462.0,84.0,4.0,351.0,57.0,2.0
2,2,62.0,35.0,2.0,64.0,58.0,74.0,69.0,30.0,21.0,295.0,66.0,53.0,24.0,389.0,83.0,7.0,315.0,55.0,2.0
3,3,61.0,40.0,3.0,45.0,75.0,75.0,96.0,44.0,33.0,418.0,96.0,53.0,28.0,463.0,98.0,5.0,291.0,49.0,1.0
4,4,27.0,70.0,1.0,20.0,106.0,190.0,127.0,34.0,72.0,659.0,143.0,35.0,33.0,653.0,104.0,44.0,592.0,108.0,1.0


Preprocess data:

In [17]:
# separate between features/inputs (X) and target/output variables (y)
mat = df.as_matrix()
X = mat[:,2:-1]
print (X.shape)
y = mat[:,-1].astype("int")
print (y.shape)
ind = mat[:,1].astype("int")
print (ind.shape)

(394, 17)
(394,)
(394,)


### This part is important!

This is where we turn our previous 4-class problem into a binary classification problem: car vs non-car

In [18]:
# transform to binary problem: car vs non-car
y = (y == 4).astype("int")

In [19]:
# standardize input features
X_mean = X.mean(axis=0)
X_std = X.std(axis=0)
X = (X - X_mean) / X_std

Train/test split:

In [20]:
train_perc = 0.66 # percentage of training data
split_point = int(train_perc*len(y))
perm = np.random.permutation(len(y))
ix_train = perm[:split_point]
ix_test = perm[split_point:]
X_train = X[ix_train,:]
X_test = X[ix_test,:]
y_train = y[ix_train]
y_test = y[ix_test]
print("num train: %d" % len(y_train))
print("num test: %d" % len(y_test))

num train: 260
num test: 134


Again, for the purpose of comparison, we run the logistic regression method from sklearn. But note that although sklearn has an implementation of logistic regression, it is not a Bayesian approach, nor does it support probit regression or some other variant that you may think is more appropriate for your particular problem. On the other hand, STAN offers us complete flexibility!

In [21]:
# create and fit logistic regression model
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)

# make predictions for test set
y_hat = logreg.predict(X_test)
print ("predictions:", y_hat)
print ("true values:", y_test)

# evaluate prediction accuracy
print ("Accuracy:", 1.0*np.sum(y_hat == y_test) / len(y_test))

predictions: [0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1
 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 1
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0]
true values: [1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1
 1 1 1 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1
 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0]
Accuracy: 0.746268656716418


Ok, time to implement binary logistic regression in STAN!

Your turn now :-)

Note: don't forget to include an explicit intercept parameter $\alpha$ in the model!

In [22]:
# define Stan model
model_definition = """
data {
    int<lower=0> N;             // number of data items
    int<lower=1> D;             // number of predictors
    int<lower=1> C;             // number of classes
    matrix[N,D] X;              // predictor matrix
    int<lower=0,upper=1> y[N];  // classes vector
}
parameters {
    real alpha;     // intercepts
    vector[D] beta; // coefficients for predictors
} 
model {
    alpha ~ normal(0,10); // prior on the intercepts
    beta ~ normal(0,10);  // prior on the coefficients
    
    y ~ bernoulli_logit(alpha + X * beta); // likelihood
}
"""

Prepare input data for STAN, compile STAN program and run inference (MCMC):

In [24]:
# prepare data for Stan model
N, D = X_train.shape
C = int(y_train.max())
print ("N=%d, D=%d, C=%d" % (N,D,C))
data = {'N': N, 'D': D, 'C': C, 'X': X_train, 'y': y_train}

N=260, D=17, C=1


In [25]:
%%time
# create Stan model object
sm = pystan.StanModel(model_code=model_definition)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_4bbe9bdf5dd6ad3e2bb274bfb7a5fbf4 NOW.


CPU times: user 1.12 s, sys: 78.3 ms, total: 1.2 s
Wall time: 44.2 s


In [26]:
%%time
fit = sm.sampling(data=data, iter=1000, chains=4, algorithm="NUTS", seed=42, verbose=True)

  elif np.issubdtype(np.asarray(v).dtype, float):


CPU times: user 29.6 ms, sys: 36 ms, total: 65.5 ms
Wall time: 20.3 s


In [27]:
print(fit)

Inference for Stan model: anon_model_4bbe9bdf5dd6ad3e2bb274bfb7a5fbf4.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

           mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha      -0.6  3.6e-3   0.16  -0.92  -0.71   -0.6  -0.49  -0.29   2000    1.0
beta[0]    0.41  4.3e-3   0.19   0.01   0.29   0.41   0.53   0.81   2000    1.0
beta[1]     0.5  6.6e-3    0.3  -0.07   0.29    0.5   0.68   1.09   2000    1.0
beta[2]    0.65  3.9e-3   0.17   0.31   0.53   0.66   0.76    1.0   2000    1.0
beta[3]    5.46    0.12   3.64  -1.18   2.97   5.43   7.82  12.93    929    1.0
beta[4]    0.91    0.03   0.83   -0.7   0.35   0.89   1.47   2.58    976    1.0
beta[5]   -6.13    0.13   3.93 -14.27  -8.71  -6.13   -3.4   1.14    931    1.0
beta[6]    0.23  4.0e-3   0.18  -0.13   0.11   0.23   0.35   0.57   2000    1.0
beta[7]    -1.1    0.09   2.41  -5.73  -2.69  -1.13   0.54   3.69    708   1.01
beta[8]    2.

Extract samples from posterior, make predictions and compute accuracy (make sure that you understand all the code!):

In [28]:
samples = fit.extract(permuted=True)  # return a dictionary of arrays

In [29]:
# make predictions for test set
mu = np.mean(samples["alpha"].T + np.dot(X_test, samples["beta"].T), axis=1)
y_hat = (mu > 0).astype("int") # all objects that have a mu closer to 1 get a 1
print ("predictions:", y_hat)
print ("true values:", y_test)

# evaluate prediction accuracy
print ("Accuracy:", 1.0*np.sum(y_hat == y_test) / len(y_test))

predictions: [0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1
 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 1
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 1 1 0 0]
true values: [1 1 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1
 1 1 1 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1
 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 0]
Accuracy: 0.746268656716418


Nice, it seems that we are already doing better than sklearn!

Ok, now lets try a **probit regression model in STAN**.

Can you implement it?

In [42]:
# define Stan model
model_definition1 = """
data {
    int<lower=0> N;             // number of data items
    int<lower=1> D;             // number of predictors
    int<lower=1> C;             // number of classes
    matrix[N,D] X;              // predictor matrix
    int<lower=0,upper=1> y[N];  // classes vector
}
parameters {
    real alpha;     // intercepts
    vector[D] beta; // coefficients for predictors
} 
model {
    alpha ~ normal(0,10); // prior on the intercepts
    beta ~ normal(1,10);  // prior on the coefficients
    y ~ bernoulli(Phi_approx(alpha + X * beta)); // bernoulli for binary, likelihood
}
"""

Prepare input data for STAN, compile STAN program and run inference (MCMC):

In [43]:
# prepare data for Stan model
N, D = X_train.shape
C = int(y_train.max())
print ("N=%d, D=%d, C=%d" % (N,D,C))
data = {'N': N, 'D': D, 'C': C, 'X': X_train, 'y': y_train}

N=260, D=17, C=1


In [44]:
%%time
# create Stan model object
sm = pystan.StanModel(model_code=model_definition1)

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_43be7a799db9b77d5070e1f684f8696f NOW.


CPU times: user 1.17 s, sys: 57.3 ms, total: 1.23 s
Wall time: 45.2 s


In [45]:
%%time
fit = sm.sampling(data=data, iter=1000, chains=4, algorithm="NUTS", seed=42, verbose=True)

  elif np.issubdtype(np.asarray(v).dtype, float):


CPU times: user 33.3 ms, sys: 38.4 ms, total: 71.7 ms
Wall time: 23 s


In [22]:
print(fit)

Inference for Stan model: anon_model_5ec6236228cf91a31e1fbcaf964ffc23.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

           mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
alpha     -0.37  2.1e-3   0.09  -0.56  -0.43  -0.37  -0.31  -0.18 2000.0    1.0
beta[0]    0.36  2.7e-3   0.12   0.13   0.28   0.36   0.44    0.6 2000.0    1.0
beta[1]    0.19  3.8e-3   0.17  -0.14   0.06   0.19    0.3   0.52 2000.0    1.0
beta[2]    0.39  2.3e-3    0.1   0.19   0.32   0.39   0.46    0.6 2000.0    1.0
beta[3]    4.23    0.07   2.13   0.47   2.74   4.07   5.56   8.71  887.0    1.0
beta[4]    0.93    0.02   0.49   0.04   0.59   0.91   1.24   1.96  936.0    1.0
beta[5]   -4.57    0.08    2.3  -9.45   -6.0  -4.42  -2.93  -0.52  887.0    1.0
beta[6]    0.13  2.4e-3   0.11  -0.08   0.05   0.12    0.2   0.34 2000.0    1.0
beta[7]    0.07    0.07   1.64  -3.19  -1.02   0.06   1.13   3.32  602.0    1.0
beta[8]    2.

Extract samples from posterior, make predictions and compute accuracy (make sure that you understand all the code!):

In [23]:
samples = fit.extract(permuted=True)  # return a dictionary of arrays

In [24]:
# make predictions for test set
mu = np.mean(samples["alpha"].T + np.dot(X_test, samples["beta"].T), axis=1)
y_hat = (mu > 0).astype("int")
print ("predictions:", y_hat)
print ("true values:", y_test)

# evaluate prediction accuracy
print ("Accuracy:", 1.0*np.sum(y_hat == y_test) / len(y_test))

predictions: [0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1
 1 1 0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 0 0 1
 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1]
true values: [1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0
 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 1
 1 1 0 0 1 0 1 1 0 1 0 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 1 1]
Accuracy: 0.701492537313


How are your results in comparison to the version with the logistic sigmoid?

In some cases, using a probit function instead of the logistic sigmoid can make a significant difference. In other cases, it doesn't... You have to consider what makes more sense to the specific problem that you are trying to solve. Or, we can just try different approaches! That is just fine... STAN makes it very easy to try all these different variants.