In [1]:
# import Pkg
# Pkg.add("Ipopt")
# Pkg.add("PyCall")
# Pkg.add("Cbc")

In [11]:
#Load basic package
using CSV,Ipopt,DataFrames,JuMP
using Statistics
using PyCall
using Cbc

#Load python package (please install python package imblearn and sklearn)
mm = pyimport("imblearn.metrics")
sk = pyimport("sklearn")

┌ Info: Precompiling Cbc [9961bab8-2fa3-5c5a-9d89-47fab24efd76]
└ @ Base loading.jl:1242


PyObject <module 'sklearn' from '/Users/cynthiazeng/anaconda3/lib/python3.7/site-packages/sklearn/__init__.py'>

In [2]:
#Load algorithms
include("basic.jl")

getbalancedt (generic function with 1 method)

In [3]:
#Read in a dataset (last column is the binary class you want to predict)
df = CSV.read("image-segmentation_path.csv");
p = size(df)[2]

#Split the dataset
train_split = 0.8
train_valid_split = 0.75
X_all, X_test, y_all, y_test = sk.model_selection.train_test_split(Matrix(df[:,1:p-1]),df[p], test_size = 1-train_split, random_state = 0);

#Normalize the data
scaler = sk.preprocessing.StandardScaler().fit(X_all)
X_all = scaler.transform(X_all)
X_test = scaler.transform(X_test)
X_train, X_valid, y_train,y_valid = sk.model_selection.train_test_split(X_all, y_all, test_size = 1-train_valid_split, random_state = 0)
df_all = (X_all,X_train,X_valid,X_test,y_all,y_train,y_valid,y_test)

#Define some important hyper-parameters (see the definiation of k in the paper)
max_k = Int(round(size(y_all)[1]/sum(y_all)*0.75))
half_k = Int(round(size(y_all)[1]/sum(y_all)*0.5))
min_k = max(Int(round(size(y_all)[1]/sum(y_all)*0.25)),1)
all_k = [1,min_k,half_k,max_k]
zero_bound = 1e-8
epsilon = 0.05

#Train the robust logistic regression and return the parameters
beta,beta0 = robustlr(all_k,df_all,zero_bound,epsilon)

#Report the AUC and F1-score of robust logistic regression
res = getmetric(X_test,y_test,beta,beta0)
println("AUC of robust logisic regression:",res[1])
println("F1-score of robust logisic regression:",res[2])

#Train the robust SVM and return the parameters
beta,beta0 = robustsvm(all_k,df_all,zero_bound,epsilon)

#Report the AUC and F1-score of robust SVM
res = getmetric_svm(X_test,y_test,beta,beta0)
println("AUC of robust SVM:",res[1])
println("F1-score of robust SVM:",res[2])

│   caller = top-level scope at In[3]:8
└ @ Core In[3]:8



******************************************************************************
This program contains Ipopt, a library for large-scale nonlinear optimization.
 Ipopt is released as open source code under the Eclipse Public License (EPL).
         For more information visit http://projects.coin-or.org/Ipopt
******************************************************************************

This is Ipopt version 3.12.10, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:       20
Number of nonzeros in inequality constraint Jacobian.:     3674
Number of nonzeros in Lagrangian Hessian.............:    35089

Total number of variables............................:      209
                     variables with only lower bounds:       20
                variables with lower and upper bounds:        0
                     variables with only upper bounds:      167
Total number of equ

UndefVarError: UndefVarError: Cbc not defined

In [12]:
#Generate a balanced dataset for other classification methods to use 
#(for most datasets, using 75% of the majority class data works the best)
k_need = Int(round(size(y_all)[1]/sum(y_all)*0.75))
df_balanced = getbalancedt(df_all,k_need,zero_bound,epsilon)

#denormalize
p = size(df_balanced)[2]
X_origin = scaler.inverse_transform(df_balanced[:,1:p-1])
df_final = [X_origin df_balanced[:,p]]
print("Imbalance ratio of this new dataset:",size(df_final)[1]/sum(df_final[:,p])-1)

This is Ipopt version 3.12.10, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:       20
Number of nonzeros in inequality constraint Jacobian.:     3674
Number of nonzeros in Lagrangian Hessian.............:    35089

Total number of variables............................:      209
                     variables with only lower bounds:       20
                variables with lower and upper bounds:        0
                     variables with only upper bounds:      167
Total number of equality constraints.................:        1
Total number of inequality constraints...............:      167
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:      167

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0 

In [15]:
df_final

144×20 Array{Float64,2}:
  14.0  146.0  9.0  0.111111  …   15.6667   0.487864  -2.10891   0.0
 160.0  110.0  9.0  0.0           39.3333   0.307444  -2.18724   0.0
  16.0  128.0  9.0  0.0            7.11111  0.561508  -0.985811  0.0
 143.0   24.0  9.0  0.0          141.667    0.169397  -2.34925   0.0
 150.0  158.0  9.0  0.0           12.2222   0.503086  -1.94345   0.0
 138.0  133.0  9.0  0.0       …    8.22222  0.558201  -1.07766   0.0
 118.0  126.0  9.0  0.0           28.6667   0.437078  -2.1588    0.0
  29.0  195.0  9.0  0.0           20.3333   0.342307   1.94304   0.0
  85.0  101.0  9.0  0.0           26.7778   0.404792  -1.5586    0.0
 163.0   68.0  9.0  0.0           68.2222   0.265053  -1.98431   0.0
 196.0   95.0  9.0  0.0       …   10.7778   0.512566  -1.74926   0.0
 157.0   85.0  9.0  0.0           26.8889   0.459024  -2.16091   0.0
 187.0   80.0  9.0  0.0           47.6667   0.24473   -1.9427    0.0
   ⋮                          ⋱                                     
 118.0  1

In [17]:
df_balanced

144×20 Array{Float64,2}:
 -1.5025      0.388845   0.0   3.44182   …   0.301325   -0.482323  0.0
  0.520245   -0.22393    0.0  -0.290544     -0.510716   -0.532307  0.0
 -1.47479     0.0824576  0.0  -0.290544      0.632783    0.234345  0.0
  0.28472    -1.68778    0.0  -0.290544     -1.13204    -0.635689  0.0
  0.381701    0.593103   0.0  -0.290544      0.369837   -0.376739  0.0
  0.215448    0.167565   0.0  -0.290544  …   0.617899    0.175735  0.0
 -0.0616396   0.0484145  0.0  -0.290544      0.0727442  -0.514161  0.0
 -1.29468     1.2229     0.0  -0.290544     -0.353804    2.10329   0.0
 -0.518834   -0.377124   0.0  -0.290544     -0.0725688  -0.13116   0.0
  0.561808   -0.938834   0.0  -0.290544     -0.701513   -0.402812  0.0
  1.019      -0.479253   0.0  -0.290544  …   0.412502   -0.252823  0.0
  0.478681   -0.649468   0.0  -0.290544      0.171522   -0.515506  0.0
  0.894313   -0.734576   0.0  -0.290544     -0.792983   -0.37626   0.0
  ⋮                                      ⋱          

In [19]:
df = CSV.read("../data/2015_S2.csv");

In [26]:
df_balanced = getbalancedt(df,100,zero_bound,epsilon)

This is Ipopt version 3.12.10, running with linear solver mumps.
NOTE: Other linear solvers might be more efficient (see Ipopt documentation).

Number of nonzeros in equality constraint Jacobian...:       20
Number of nonzeros in inequality constraint Jacobian.:     3674
Number of nonzeros in Lagrangian Hessian.............:    35089

Total number of variables............................:      209
                     variables with only lower bounds:       20
                variables with lower and upper bounds:        0
                     variables with only upper bounds:      167
Total number of equality constraints.................:        1
Total number of inequality constraints...............:      167
        inequality constraints with only lower bounds:        0
   inequality constraints with lower and upper bounds:        0
        inequality constraints with only upper bounds:      167

iter    objective    inf_pr   inf_du lg(mu)  ||d||  lg(rg) alpha_du alpha_pr  ls
   0 

BoundsError: BoundsError: attempt to access 143-element Array{Int64,1} at index [1:2400]