In [1]:
import numpy as np
import pandas as pd
import os
from pathfollowing import *
from datageneration import DataGeneration, SplitData

## Readme

This is a tiny tutorial for using the path-following algorithm proposed in Feng, Huijie, Yang Ning, and Jiwei Zhao. "Nonregular and Minimax Estimation of Individualized Thresholds in High Dimension with Binary Responses." arXiv preprint arXiv:1905.10888 (2019). 

The main function `PathFollowing` will take a python `dict` object containing X, Y, and Z (in numpy.ndarray format) and output a numpy.ndarray containing the estimated coefficients. For the inplementation detail, please see the paper Algorithm 2.1 and 2.2.

The function `DataGeneration` will create datasets under three different data generating processes (binary response model, conditional mean model and logistic model). For more detail please refer to the paper Section 6.2. 

In the following we will provide an example for each case.


### Example: estimation and prediction for binary response model with heteroskedastic noise
Here we simulate data from the binary response model
$$Y = sign(X - \theta^TZ + u),$$
where $u \sim N(0,\sigma^2(1 + 2(X - \theta^TZ)^2))$

The following chunk of code will generate dataset (training plus test set), each with sample size n = 1000, dimension of Z d = 100, number of nonzero coefficient s = 10, and $\sigma = 0.8$. The coefficient vector $\theta$ is generated by firstly sample s coordinates uniformly randomly in \[1,2\], and is then normalized such that $||\theta||_2 = 1$.  

In [2]:
# Generate Data
n = 1000
d = 100
s = 10
Model = DataGeneration('Manski',n*2,d,s,sig = 0.8).create()
print(Model)

Manski model with n:2000,d:100,s:10,rho:0.5,normalize:True,sig:0.8


In [3]:
# use Model.generate() to generate a random sample,
# and then split it into training and test set.
dat_train,dat_test = SplitData(Model.generate())

In [4]:
# check the parameter definition and default values
?PathFollowing

In [5]:
# To use the path-following algorithm
coeff = PathFollowing(dat_train,1,Gkernel,0.5*np.sqrt(np.log(d)/n))

In [6]:
# To evaluate the performance
l2_error = np.linalg.norm(coeff - dat_train['true_coeff'])
pred = np.mean(dat_test['Y'] == np.sign(dat_test['X'] - dat_test['Z'].dot(coeff)))
print('l2 estimation error is {:.4f}, test set classification error is {:.4f}'.format(l2_error,pred))

l2 estimation error is 0.5575, test set classification error is 0.7120


### Example: estimation and prediction for the other two scenarios
For the details of the data generating processes please refer to the original paper. 

#### Conditional mean model

In [9]:
n = 1000
d = 100
s = 10
Model = DataGeneration('ConditionalMean',n*2,d,s,sig = 0.5).create()
print(Model)
dat_train,dat_test = SplitData(Model.generate())
coeff = PathFollowing(dat_train,1,Gkernel,0.5*np.sqrt(np.log(d)/n))
l2_error = np.linalg.norm(coeff - dat_train['true_coeff'])
pred = np.mean(dat_test['Y'] == np.sign(dat_test['X'] - dat_test['Z'].dot(coeff)))
print('l2 estimation error is {:.4f}, test set classification error is {:.4f}'.format(l2_error,pred))

ConditionalMean model with n:2000,d:100,s:10,rho:0.5,normalize:True,sig:0.5,ratio:0.5,mid:2.0,bias:0
l2 estimation error is 0.3688, test set classification error is 0.9290


#### Logistic model

In [10]:
n = 1000
d = 100
s = 10
Model = DataGeneration('Logistic',n*2,d,s).create()
print(Model)
dat_train,dat_test = SplitData(Model.generate())
coeff = PathFollowing(dat_train,1,Gkernel,0.5*np.sqrt(np.log(d)/n))
l2_error = np.linalg.norm(coeff - dat_train['true_coeff'])
pred = np.mean(dat_test['Y'] == np.sign(dat_test['X'] - dat_test['Z'].dot(coeff)))
print('l2 estimation error is {:.4f}, test set classification error is {:.4f}'.format(l2_error,pred))

Logistic model with n:2000,d:100,s:10,rho:0.5,normalize:True
l2 estimation error is 0.4053, test set classification error is 0.7890
