### Introduction ###
This notebook performs some basic Logistic Regression analysis of the Media 6 Degrees data from the [Doing Data Science : Straight Talk from the Frontline book][booklink] by Cathy O'Neil & Rachel Schutt published by O'Reilly Media. The data can be downloaded [here][datalink] and specifically I will be looking at the `dds_ch5_binary-class-dataset.txt` data.

The file contains data exploring whether users buy a product based on their habits on a website, I think. It is not clear from the book nor the data what exactly the features relate to. I decided to replicate the code given in the book as best as possible here, and then I also decided to repeat some of the analysis using the [MLJ](https://alan-turing-institute.github.io/MLJ.jl/stable/) library. I think implementing some of the functions gives you a better understanding of what is going on, however it is also important to understand how to use these libraries to their best effect.

I will be doing the data analysis using Julia.

[booklink]: https://www.oreilly.com/library/view/doing-data-science/9781449363871/
[datalink]: https://github.com/oreillymedia/doing_data_science

#### Logistic Regression ####
Let's briefly discuss logistic regression. Logistic Regression outputs values bounded by 0 & 1, and hence we can directly interpret them as probabilities. It relies on the inverse logit function

$$P(t)= \text{logit}^{-1}(t) = \frac{1}{1+e^{-t}} = \frac{e^{t}}{1+e^{t}}.$$

We can see that it maps t from $\mathbb R$ to $[0,1]$, but let's plot it to see the shape.

In [27]:
using Plots
plotly()
Plots.PlotlyBackend()
t = -10:0.1:10;
y = 1 ./(1 .+ exp.(.-t));
plot(t,y)

To define a logistic regression model we start with
$$P(c_{i}|x_{i}) = [\text{logit}^{-1}(\alpha + \beta^{\text{T}}x_{i})]^{c_{i}}[1-\text{logit}^{-1}(\alpha + \beta^{\text{T}}x_{i})]^{(1-c_{i})},$$
here $x_{i}$ is the vector of features of user $i$ and $c_{i}$ is the class ($1$ or $0$). We then have two possibilities:
$$P(c_{i}=1|x_{i}) = [\text{logit}^{-1}(\alpha + \beta^{\text{T}}x_{i})],$$
and
$$P(c_{i}=0|x_{i}) = [1-\text{logit}^{-1}(\alpha + \beta^{\text{T}}x_{i})].$$
To make this a linear model in the outcomes $c_{i}$ we can take the log of the odds ratio:
$$\frac{\log(P(c_{i}=1|x_{i}))}{1-P(c_{i}=1|x_{i})} = \alpha + \beta^{\text{T}}x_{i}.$$
We note that the logit function is defined as
$$\text{logit}(p) =  \log(\frac{p}{1-p} = \log(p) - \log(1-p),$$
and therefore
$$logit(P(c_{i}=1|x_{i})) = \alpha + \beta^{\text{T}}x_{i}.$$

In our case we have the logit of the probability that a user buys a product is being modelled as a linear function of the features. The value $\alpha$ is the base rate, the unconditional probability that a user buys a product, knowing nothing of the features. The values of $\beta$ determine the extent to which certain features are markers for increased or decreased likelihood of a user buying a product.

#### Estimating $\alpha$ and $\beta$ ####
As in most machine learning models we want to predict the values of $\alpha$ and $\beta$ given some training data. To estimate the parameters lets set $\Theta = {\alpha,\beta}$ and define the likelihood function as
$L(\Theta|X_{1},X_{2},...,X_{n}) = P(X|\Theta) = P(X_{1}|\Theta)...P(X_{n}|\Theta),$
where we assume that the data points $X_{i}$ are independent and $i = 1,\dots,n$ are our n users. In this case, our independence assumption corresponds to saying the behaviour of one user does not affect the behaviour of any other user.

We want to maximise the likelihood function, i.e. find the parameters that satisfy
$$\Theta_{\text{MLE}} = \text{argmax}_{\Theta} \prod_{i=1}^{n}P(X_{i}|\Theta).$$
Setting $p_{i} = 1/\left(1+e^{-(\alpha+\beta^{\text{T}}x_{i})}\right)$, then $P(X_{i}|\Theta) = p_{i}^{c_{i}}(1-p_{i})^{(1-c_{i})}$, and
$$\Theta_{\text{MLE}} = \text{argmax}_{\Theta} \prod_{i=1}^{n}p_{i}^{c_{i}}(1-p_{i})^{(1-c_{i})}.$$

We can compute this value using standard methods.

#### Model Evaluation####

As logistic regression is a probabilistic classifier, to predict a class for a new user $x$, we compute the value $P(c=1|x)$, if this is above some threshold then we classify the user as $c = 1$. If the value is below the threshold we'll classify the user as $c = 0$. One metric for measuring the performance of a classifier is to create the receiver operating characteristic (ROC) curve and calculate the area under the curve (AUC). The ROC curve is the number of true positives plotted against the number of false positives calculated as the threshold increases in value. A perfect predictor would have a $0$ false positive rate and a $1$ true positive rate at the optimal threshold, giving an AUC of $1$. For a completely random classifier we would expect the false positive rate to be equal to the true positive rate at all values of the threshold, giving an AUC of $0.5$. The AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than randomly chosen negative instance.

The ROC curve is a good way of estimating the ranking performance of a classifier, but not the accuracy of the probability predictions. An interesting introduction to ROC curve analysis can be found [here](https://people.inf.elte.hu/kiss/13dwhdm/roc.pdf). Lets dive in to the exercise now!

The first step is to import the libraries I'll be using and then load up the data into a DataFrame. As I mentioned earlier, I'll first replicate the R code from the book before using MLJ to implement the model as well. For the initial fitting I'll use GLM.

In [3]:
using CSV
using DataFrames
using GLM
using MLJ
using Printf
using PyCall
using Statistics
using StatsBase
df = DataFrame(CSV.File("dds_ch5_binary-class-dataset.txt",delim="\t"));

┌ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1260
┌ Info: Precompiling MLJ [add582a8-e3ab-11e8-2d5e-e98b27df1bc7]
└ @ Base loading.jl:1260
[ Info: Model metadata loaded from registry. 
┌ Info: Precompiling Plots [91a5bcdd-55d7-5caf-9e0b-520d859cae80]
└ @ Base loading.jl:1260
┌ Info: Precompiling PyCall [438e738f-606a-5dbb-bf0a-cddfbfd45ab0]
└ @ Base loading.jl:1260
┌ Info: Precompiling ORCA [47be7bcc-f1a6-5447-8b36-7eeeff7534fd]
└ @ Base loading.jl:1260
env: node: No such file or directory


Let's take a look at the data in a bit of detail. The labels are contained in the ```:y_buy``` column, and the other columns relate to the user features. Looking at the first 5 values shows a lot of 0 values, and when we call ```describe``` we can also see that some values are continuous, some are boolean, and some are counts.

In [4]:
first(df,5) |> pretty
println(propertynames(df))

┌[0m────────────────[0m┬[0m─────────────[0m┬[0m────────────────────[0m┬[0m───────────────────[0m┬[0m─────[0m ⋯
│[0m[1m at_buy_boolean [0m│[0m[1m at_freq_buy [0m│[0m[1m at_freq_last24_buy [0m│[0m[1m at_freq_last24_sv [0m│[0m[1m at_f[0m ⋯
│[0m[90m Int64          [0m│[0m[90m Int64       [0m│[0m[90m Int64              [0m│[0m[90m Int64             [0m│[0m[90m Int6[0m ⋯
│[0m[90m Count          [0m│[0m[90m Count       [0m│[0m[90m Count              [0m│[0m[90m Count             [0m│[0m[90m Coun[0m ⋯
├[0m────────────────[0m┼[0m─────────────[0m┼[0m────────────────────[0m┼[0m───────────────────[0m┼[0m─────[0m ⋯
│[0m 0.0            [0m│[0m 0.0         [0m│[0m 0.0                [0m│[0m 0.0               [0m│[0m 0.0 [0m ⋯
│[0m 0.0            [0m│[0m 0.0         [0m│[0m 0.0                [0m│[0m 0.0               [0m│[0m 0.0 [0m ⋯
│[0m 0.0            [0m│[0m 0.0         [0m│[0m 0.0                [0m│[0m 0

In [5]:
describe(df, :min, :max, :mean, :median, :std)

Unnamed: 0_level_0,variable,min,max,mean,median,std
Unnamed: 0_level_1,Symbol,Real,Real,Float64,Float64,Float64
1,at_buy_boolean,0.0,1.0,0.0426315,0.0,0.202027
2,at_freq_buy,0.0,15.0,0.052891,0.0,0.298157
3,at_freq_last24_buy,0.0,4.0,0.00194196,0.0,0.0516822
4,at_freq_last24_sv,0.0,36.0,0.117049,0.0,0.822048
5,at_freq_last24_sv_int_buy,0.0,36.0,0.0235417,0.0,0.483363
6,at_freq_sv,0.0,84.0,1.85278,1.0,2.92182
7,at_freq_sv_int_buy,0.0,75.0,0.271636,0.0,2.20629
8,at_interval_buy,0.0,174.625,0.210008,0.0,3.92202
9,at_interval_sv,0.0,184.917,5.82561,0.0,17.5954
10,at_interval_sv_int_buy,0.0,176.708,0.381641,0.0,3.88416


Let's start by splitting the data up features and labels, although we won't use all of the features to predict the labels we'll keep all of them in the feature matrix to start with. We'll also split these into training and test sets, with a 0.65 split.

In [6]:
y,X = unpack(df,==(:y_buy),!=(:y_buy));
train, test = partition(eachindex(y), 0.65, shuffle=true, rng=1234); # 65:35 split

Before training the model, we'll define a few functions. I felt that these function were not so well explained in the book, with minimal commenting. The first function calculates the weighted mean absolute error between the predicted values and the actual classes.

In [7]:
function getmae(p,y,b,doplot)
    # Normalise the predicted values to the range 0 -> 1.
    max_p = maximum(p);
    p_norm = p./max_p;
    # Convert from normalised value to histogram bin
    bin = max_p .* floor.(p_norm.*b)./b;
    # Put these values into a dataframe
    d = DataFrame(bin = bin,p = p,y = y);
    t = countmap(bin);
    # Get the unique values of the bins, we'll iterate over these
    u_bin = unique(bin);
    summ = DataFrame(bin =[],mean_p =[],mean_y =[],count = []);
    # Iterate over the bin, for the predicted values that lie within this bin
    # calculate the mean of these, calculate the mean true class labels of these as well
    # and calculate the number of instances in this bin.
    for u in u_bin
        inds = findall(x->x==u,bin);
        push!(summ,[u,mean(p[inds]),mean(y[inds]),length(inds)]);
    end
    num = 0;
    den = 0;
    for i in nrows(summ)
        # For each bin calculate n*(|mean(p(bin)) - mean(y(bin))|)
        # where n is the number of instances in the bin
        num += (summ[i,:count]*abs(summ[i,:mean_p] - summ[i,:mean_y]));
        # sum up the number of instances (data points).
        den += summ[i,:count];
    end
    # The wMAE is the ratio of these values
    wmae = num./den;
    # We can also plot mean p & y values in each bin against the number of data points in each bin.
    if (doplot)
        p1 = plot(summ[:,:count], summ[:,:mean_p], seriestype = :scatter, title = string("MAE = ",wmae),label = "Predicted");
        plot!(p1,summ[:,:count], summ[:,:mean_y], seriestype = :scatter,label = "True");
        display(p1)
    end
    return wmae
end

getmae (generic function with 1 method)

This next function generates the ROC curve, and calculates the AUC.

In [8]:
function getauc(y_pred,test)
    # Set the thresholds to use, as the unique predicted values and include 1 above and below
    # the range.
    thresh = sort(unique([minimum(y_pred) - 0.1 ;y_pred[:]; maximum(y_pred)+0.1]), rev=true);
    n_tp = [];
    n_fp = [];
    # Get the number of true positives and true negatives in the test set
    n_totaltruep = sum(test[:,:y_buy] .>0);
    n_totaltruef = sum(test[:,:y_buy] .<1);
    # Iterate through the thresholds calculating the true and false positives
    for t in thresh
        y_bin = y_pred .> t;
        push!(n_tp,sum((y_bin .>0) .& (test[:,:y_buy] .>0)))
        push!(n_fp,sum((y_bin .>0) .& (test[:,:y_buy] .<1)))
    end
    # Normalise for the ROC curve
    n_tp_norm = n_tp./n_totaltruep;
    n_fp_norm = n_fp./n_totaltruef;
    # Trapezoidal rule for calculating the AUC.
    auc = sum((n_fp_norm[2:end] .- n_fp_norm[1:end-1]).*(n_tp_norm[2:end] .+ n_tp_norm[1:end-1])./2.0);
    return auc,n_tp_norm,n_fp_norm
end

getauc (generic function with 1 method)

The final function we'll define is to perform cross validation and calculate the wMAE and AUC for each fold. Cross validation is the process of assigning a fold number to each user. The model is trained using users in folds 2:n and the performance is tested on the users in fold 1. This is then repeated with fold 2 as the test set, and so on. The average metrics can be calculated over all folds to give a better estimate of the model performance.

In [9]:
function getxval(invars,data,folds,mae_bins)
    # Assign the folds
    data[!,:fold] = Int64.(floor.(rand(nrows(data))*folds).+1);
    auc = [];
    wmae = [];
    fold = [];
    f = Term(Symbol("y_buy")) ~ sum(term.(Symbol.(invars)));
    for i in 1:folds
        # Split using the i'th fold.
        train = data[data.fold .!=i,:];
        test = data[data.fold .==i,:];
        # Fit and predict
        mod = glm(f, train, Binomial(), LogitLink());
        y_pred = StatsBase.predict(mod,test);
        push!(fold,i);
        # Calculate the metrics
        push!(wmae,getmae(y_pred,test[:,:y_buy],mae_bins,false));
        a,_ = getauc(y_pred,test)
        push!(auc,a);
    end
    return DataFrame(fold=fold,wmae=wmae,auc=auc)
end

getxval (generic function with 1 method)

No we've defined these functions we can train the model. The exercise in the book started by selecting a subset of features and trains the model using all of these. Subsequently, each feature from the subset is used to train a univariate logistic regression model and the performance is calculated using 10-fold cross validation. 

In [10]:
invars = lowercase.(["AT_BUY_BOOLEAN", "AT_FREQ_BUY", "AT_FREQ_LAST24_BUY", "AT_FREQ_LAST24_SV", "AT_FREQ_SV", "EXPECTED_TIME_BUY", "EXPECTED_TIME_SV", "LAST_BUY", "LAST_SV", "num_checkins"]);
f = Term(Symbol("y_buy")) ~ sum(term.(Symbol.(invars)));
mod = glm(f, df[train,:], Binomial(), LogitLink())
y_pred = StatsBase.predict(mod,df[test,:]);
A = getxval(invars,df,10,100);
auc_all = mean(A[:,:auc])
println(string("auc_all = ",auc_all))

auc_all = 0.8558448631888499


We'll test out a 10-fold cross validation here, using our subset of features to get a better estimate at the model performance.

In [11]:
A = getxval(invars,df,10,100);
println(string("Mean auc_all = ",mean(A[:,:auc])))
println(string("Sigma auc_all = ",std(A[:,:auc])))

Mean auc_all = 0.8552118392504191
Sigma auc_all = 0.06141305635745399


In [12]:
auc_mu = [];
auc_sig = [];
mae_mu = [];
mae_sig = [];
for i = 1:length(invars)
    A = getxval([invars[i]],df,10,100);
    push!(auc_mu,mean(A[:,:auc]));
    push!(auc_sig,std(A[:,:auc]));
    push!(mae_mu,mean(A[:,:wmae]));
    push!(mae_sig,std(A[:,:wmae]));
end
univar = DataFrame(var = invars,auc_mu=auc_mu,auc_sig=auc_sig,mae_mu=mae_mu,mae_sig=mae_sig)

Unnamed: 0_level_0,var,auc_mu,auc_sig,mae_mu,mae_sig
Unnamed: 0_level_1,String,Any,Any,Any,Any
1,at_buy_boolean,0.677859,0.0682013,0.00484321,0.00352954
2,at_freq_buy,0.670247,0.0438923,0.432685,0.237136
3,at_freq_last24_buy,0.505725,0.0111246,0.159584,0.291535
4,at_freq_last24_sv,0.622402,0.0443789,0.24167,0.276638
5,at_freq_sv,0.782871,0.0389873,0.271568,0.283953
6,expected_time_buy,0.522684,0.0260083,0.139977,0.30017
7,expected_time_sv,0.591086,0.0719549,0.00816677,0.000769281
8,last_buy,0.660039,0.0316149,0.035861,0.0130342
9,last_sv,0.808985,0.050895,0.0108524,0.0126308
10,num_checkins,0.572228,0.0408544,0.00112348,0.000706379


The above table shows that using the feature ```:last_sv``` gives a mean AUC of 0.82 (sigma = 0.04), which is not too dissimilar to using all of the features (mean AUC = 0.84, sigma = 0.05). This could suggest that a number of the features could be discarded from the model and still give equivalent performance. Fewer features can reduce any overfitting that is occurring.

Let's take a look at the wMAE plot for the ```:num_checkins``` variable. The wMAE is very low, and when plotting the mean predicted and true values in each bin we can see that difference is minimal, with the exception of those bins with relatively few data points.

In [13]:
f = Term(Symbol("y_buy")) ~ sum(term.(Symbol.([invars[10]])));
mod = glm(f, df[train,:], Binomial(), LogitLink())
y_pred = StatsBase.predict(mod,df[test,:]);


In [14]:
getmae(y_pred,df[test,:y_buy],100,true)

8.217284975435745e-5

The next part of the exercise is to incrementally include features, again from a subset of the total feature list, and estimate the performance. The features were specified in the exercise code, so I'm not sure if they were ordered or whether it was somewhat random. Let's take a look and see what happens to the metrics as we increase the number of features.

In [15]:
invars = lowercase.(["LAST_SV", "AT_FREQ_SV", "AT_FREQ_BUY", "AT_BUY_BOOLEAN", "LAST_BUY", "AT_FREQ_LAST24_SV", "EXPECTED_TIME_SV", "num_checkins", "EXPECTED_TIME_BUY", "AT_FREQ_LAST24_BUY"])
auc_mu = [];
auc_sig = [];
mae_mu = [];
mae_sig = [];
for i = 1:length(invars)
    A = getxval(invars[1:i],df,10,100);
    push!(auc_mu,mean(A[:,:auc]));
    push!(auc_sig,std(A[:,:auc]));
    push!(mae_mu,mean(A[:,:wmae]));
    push!(mae_sig,std(A[:,:wmae]));
end
kvar = DataFrame(auc_mu=auc_mu,auc_sig=auc_sig,mae_mu=mae_mu,mae_sig=mae_sig)

Unnamed: 0_level_0,auc_mu,auc_sig,mae_mu,mae_sig
Unnamed: 0_level_1,Any,Any,Any,Any
1,0.815733,0.0382346,0.00622961,0.0042377
2,0.835401,0.0529824,0.460318,0.335143
3,0.852924,0.055567,0.311956,0.203376
4,0.861305,0.0610648,0.299228,0.226448
5,0.86347,0.0423776,0.229312,0.235094
6,0.86287,0.0378146,0.235826,0.159008
7,0.864778,0.0309709,0.408496,0.359644
8,0.861331,0.0321193,0.339832,0.305737
9,0.863408,0.0324725,0.177487,0.0898247
10,0.862315,0.0241336,0.254379,0.139366


From the table above, it appears that there is a benefit to including a few more features, however the mean AUC appears to plateau after introducing 3 or so features. The relatively high sigma values make it difficult to compare these, however.

The final thing we'll do is to examine the ROC curves for model trained using 1 of 3 features: ```:last_sv```, ```:last_buy```, and ```:num_checkins```.

In [16]:
invars = lowercase.([ "LAST_SV", "LAST_BUY", "num_checkins"]);
f = Term(Symbol("y_buy")) ~ sum(term.(Symbol.([invars[1]])));
mod = glm(f, df[train,:], Binomial(), LogitLink())
y_pred = StatsBase.predict(mod,df[test,:]);
auc_1,n_tp_norm,n_fp_norm = getauc(y_pred,df[test,:]);
p1 = plot(n_fp_norm, n_tp_norm, seriestype = :line, title = "ROC Curves",label = string(invars[1],", AUC = ", @sprintf("%3.2f",auc_1)),xlabel="False Positive Rate",ylabel="True Positive Rate");
f = Term(Symbol("y_buy")) ~ sum(term.(Symbol.([invars[2]])));
mod = glm(f, df[train,:], Binomial(), LogitLink())
y_pred = StatsBase.predict(mod,df[test,:]);
auc_2,n_tp_norm,n_fp_norm = getauc(y_pred,df[test,:]);
plot!(p1,n_fp_norm, n_tp_norm, seriestype = :line,label = string(invars[2],", AUC = ", @sprintf("%3.2f",auc_2)));
f = Term(Symbol("y_buy")) ~ sum(term.(Symbol.([invars[3]])));
mod = glm(f, df[train,:], Binomial(), LogitLink())
y_pred = StatsBase.predict(mod,df[test,:]);
auc_3,n_tp_norm,n_fp_norm = getauc(y_pred,df[test,:]);
plot!(p1,n_fp_norm, n_tp_norm, seriestype = :line,label = string(invars[3],", AUC = ", @sprintf("%3.2f",auc_3)),legend=:bottomright);
display(p1)

From the plot we can see that the predictions using ```:last_sv``` only look reasonably good, with an AUC of 0.81. When we use ```:last_buy``` we can see that there is a large jump at a false positive rate of approximately 0.04. The next value is (1,1). The plots in the next cell shows how the true & false positive rate changes with the threshold as well as a histogram of the predicted values. What we can see is there are a large number of users that receive the predicted value. This suggests that this is a poor classifier. 

The ```:num_checkins``` is also a poor predictor when training the model solely on this feature. The ROC curve is almost equal to the straight line.

In [17]:
f = Term(Symbol("y_buy")) ~ sum(term.(Symbol.([invars[2]])));
mod = glm(f, df[train,:], Binomial(), LogitLink())
y_pred = StatsBase.predict(mod,df[test,:]);
auc_2,n_tp_norm,n_fp_norm = getauc(y_pred,df[test,:]);
thresh = sort(unique([minimum(y_pred) - 0.1 ;y_pred[:]; maximum(y_pred)+0.1]), rev=true);
p2 = plot(thresh,n_fp_norm,label = "False Positives");
plot!(thresh,n_tp_norm,label = "True Positives",xlabel="Threshold",ylabel="Rate")
display(p2)
p3 = histogram(y_pred,bins = 1000,label="Predicted Value")
display(p3)

#### Performing Logistic Regression using MLJ####
Previously I converted the R code in the exercise to Julia, however it is much more efficient to use in-built functions rather than, for example, coding cross-validation from scratch. So let's use the MLJ library to perform logistic regression and generate the ROC curves.

The first thing to do, after reloading the data, is to ensure the features (and labels) are in the correct ScientificTypes. In the previous implementation I didn't care, however the models in MLJ will only work if the correct type is specified. In this example we'll coerce all of the features to the ```Continuous``` type, except for the label which will a be Multiclass categorical variable.

In [18]:
df_mlj= DataFrame(CSV.File("dds_ch5_binary-class-dataset.txt",delim="\t"));
coerce!(df_mlj,:y_buy => Multiclass)
coerce!(df_mlj,Count => Continuous);
MLJ.schema(df_mlj)

┌[0m───────────────────────────[0m┬[0m────────────────────────────────[0m┬[0m───────────────[0m┐[0m
│[0m[22m _.names                   [0m│[0m[22m _.types                        [0m│[0m[22m _.scitypes    [0m│[0m
├[0m───────────────────────────[0m┼[0m────────────────────────────────[0m┼[0m───────────────[0m┤[0m
│[0m at_buy_boolean            [0m│[0m Float64                        [0m│[0m Continuous    [0m│[0m
│[0m at_freq_buy               [0m│[0m Float64                        [0m│[0m Continuous    [0m│[0m
│[0m at_freq_last24_buy        [0m│[0m Float64                        [0m│[0m Continuous    [0m│[0m
│[0m at_freq_last24_sv         [0m│[0m Float64                        [0m│[0m Continuous    [0m│[0m
│[0m at_freq_last24_sv_int_buy [0m│[0m Float64                        [0m│[0m Continuous    [0m│[0m
│[0m at_freq_sv                [0m│[0m Float64                        [0m│[0m Continuous    [0m│[0m
│[0m at_freq_sv_int_b

Let's unpack the variables in to X & y, again we'll keep all of the features for now, and then print all of the possible models we can implement.

Note how many of them are classifiers? That's because we have a categorical label.

In [19]:
y,X = unpack(df_mlj,==(:y_buy),!=(:y_buy));
for m in models(matching(X, y))
        println(rpad(m.name, 30), "($(m.package_name))")
end

AdaBoostClassifier            (ScikitLearn)
AdaBoostStumpClassifier       (DecisionTree)
BaggingClassifier             (ScikitLearn)
BayesianLDA                   (MultivariateStats)
BayesianLDA                   (ScikitLearn)
BayesianQDA                   (ScikitLearn)
BayesianSubspaceLDA           (MultivariateStats)
ConstantClassifier            (MLJModels)
DecisionTreeClassifier        (DecisionTree)
DeterministicConstantClassifier(MLJModels)
DummyClassifier               (ScikitLearn)
EvoTreeClassifier             (EvoTrees)
ExtraTreesClassifier          (ScikitLearn)
GaussianNBClassifier          (NaiveBayes)
GaussianNBClassifier          (ScikitLearn)
GaussianProcessClassifier     (ScikitLearn)
GradientBoostingClassifier    (ScikitLearn)
KNNClassifier                 (NearestNeighbors)
KNeighborsClassifier          (ScikitLearn)
LDA                           (MultivariateStats)
LGBMClassifier                (LightGBM)
LinearBinaryClassifier        (GLM)
LinearSVC                

Let's load up the ```LogisticClassifier``` from the MLJLinearModels as our logistic regression classifier. I could also have chosen the ```LogisticClassifier``` from ScikitLearn. I'll then perform a 10-fold cross-validation on the same subset of features as before. When comparing the average AUC to the previous section there is only a small difference and they are within 1 standard deviation of each other.

In [20]:
logreg = @load LogisticClassifier pkg="MLJLinearModels";
invars = lowercase.(["AT_BUY_BOOLEAN", "AT_FREQ_BUY", "AT_FREQ_LAST24_BUY", "AT_FREQ_LAST24_SV", "AT_FREQ_SV", "EXPECTED_TIME_BUY", "EXPECTED_TIME_SV", "LAST_BUY", "LAST_SV", "num_checkins"]);
X_1 = select(X,Symbol.(invars));
logreg_mach = machine(logreg, X_1, categorical(y));
train, test = partition(eachindex(y), 0.65, shuffle=true, rng=1234);
MLJ.fit!(logreg_mach, rows=train);
MLJ.evaluate!(logreg_mach,resampling=CV(nfolds=10),measure=[area_under_curve])

┌ Info: Training [34mMachine{LogisticClassifier} @858[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317


┌[0m──────────────────[0m┬[0m───────────────[0m┬[0m───────────────────────────────────────────────────────────────────────[0m┐[0m
│[0m[22m _.measure        [0m│[0m[22m _.measurement [0m│[0m[22m _.per_fold                                                            [0m│[0m
├[0m──────────────────[0m┼[0m───────────────[0m┼[0m───────────────────────────────────────────────────────────────────────[0m┤[0m
│[0m area_under_curve [0m│[0m 0.854         [0m│[0m [0.884, 0.816, 0.911, 0.811, 0.926, 0.902, 0.754, 0.814, 0.846, 0.88] [0m│[0m
└[0m──────────────────[0m┴[0m───────────────[0m┴[0m───────────────────────────────────────────────────────────────────────[0m┘[0m
_.per_observation = [missing]


We can again train our model using the indvidual features and calculate the mean AUC. The values match reasonably well with the previous method.

In [21]:
mlj_auc_mu=[];
mlj_auc_sigma=[];
for i in 1:length(invars)
    X_1 = select(X,Symbol(invars[i]));
    logreg_mach = machine(logreg, X_1, categorical(y));
    train, test = partition(eachindex(y), 0.65, shuffle=true, rng=1234);
    MLJ.fit!(logreg_mach, rows=train);
    cvvals = MLJ.evaluate!(logreg_mach,resampling=CV(nfolds=10),measure=[area_under_curve]);
    push!(mlj_auc_mu,cvvals.measurement[1]);
    push!(mlj_auc_sigma,std(cvvals.per_fold[1]));
end

┌ Info: Training [34mMachine{LogisticClassifier} @022[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @442[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @060[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @382[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @043[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @008[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachin

In [22]:
mlj_univar = DataFrame(invars=invars,mlj_auc_mu=mlj_auc_mu,mlj_auc_sigma=mlj_auc_sigma);
sort!(mlj_univar,:invars)

Unnamed: 0_level_0,invars,mlj_auc_mu,mlj_auc_sigma
Unnamed: 0_level_1,String,Any,Any
1,at_buy_boolean,0.674205,0.0715528
2,at_freq_buy,0.675944,0.0724728
3,at_freq_last24_buy,0.514206,0.0477182
4,at_freq_last24_sv,0.626825,0.0379018
5,at_freq_sv,0.778848,0.0563941
6,expected_time_buy,0.531775,0.0511604
7,expected_time_sv,0.59717,0.0405078
8,last_buy,0.662966,0.0864981
9,last_sv,0.809408,0.0508812
10,num_checkins,0.578356,0.0715092


We can combine the univariate performance metrics using the two methods in to a single table and compare the values. The values of mean(AUC) are very close for both methods and certainly within 1 sigma.

In [23]:
univarcomp = DataFrame(mlj_invars = mlj_univar[!,:invars], mlj_auc_mu = mlj_univar[!,:mlj_auc_mu],mlj_auc_sigma = mlj_univar[!,:mlj_auc_sigma],auc_mu = univar[!,:auc_mu],auc_sigma = univar[!,:auc_sig],invars = univar[!,:var])

Unnamed: 0_level_0,mlj_invars,mlj_auc_mu,mlj_auc_sigma,auc_mu,auc_sigma,invars
Unnamed: 0_level_1,String,Any,Any,Any,Any,String
1,at_buy_boolean,0.674205,0.0715528,0.677859,0.0682013,at_buy_boolean
2,at_freq_buy,0.675944,0.0724728,0.670247,0.0438923,at_freq_buy
3,at_freq_last24_buy,0.514206,0.0477182,0.505725,0.0111246,at_freq_last24_buy
4,at_freq_last24_sv,0.626825,0.0379018,0.622402,0.0443789,at_freq_last24_sv
5,at_freq_sv,0.778848,0.0563941,0.782871,0.0389873,at_freq_sv
6,expected_time_buy,0.531775,0.0511604,0.522684,0.0260083,expected_time_buy
7,expected_time_sv,0.59717,0.0405078,0.591086,0.0719549,expected_time_sv
8,last_buy,0.662966,0.0864981,0.660039,0.0316149,last_buy
9,last_sv,0.809408,0.0508812,0.808985,0.050895,last_sv
10,num_checkins,0.578356,0.0715092,0.572228,0.0408544,num_checkins


In [24]:
invars = lowercase.(["LAST_SV", "AT_FREQ_SV", "AT_FREQ_BUY", "AT_BUY_BOOLEAN", "LAST_BUY", "AT_FREQ_LAST24_SV", "EXPECTED_TIME_SV", "num_checkins", "EXPECTED_TIME_BUY", "AT_FREQ_LAST24_BUY"])
mlj_auc_mu = [];
mlj_auc_sigma = [];
for i = 1:length(invars)
    X_1 = select(X,Symbol.(invars[1:i]));
    logreg_mach = machine(logreg, X_1, categorical(y));
    train, test = partition(eachindex(y), 0.65, shuffle=true, rng=1234);
    MLJ.fit!(logreg_mach, rows=train);
    cvvals = MLJ.evaluate!(logreg_mach,resampling=CV(nfolds=10),measure=[area_under_curve]);
    push!(mlj_auc_mu,cvvals.measurement[1]);
    push!(mlj_auc_sigma,std(cvvals.per_fold[1]));
end
mlj_kvar = DataFrame(mlj_auc_mu=mlj_auc_mu,mlj_auc_sigma=mlj_auc_sigma);

┌ Info: Training [34mMachine{LogisticClassifier} @015[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @143[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @177[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @856[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @785[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @123[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachin

In [25]:
mlj_kvar |> pretty

┌[0m────────────────────[0m┬[0m──────────────────────[0m┐[0m
│[0m[1m mlj_auc_mu         [0m│[0m[1m mlj_auc_sigma        [0m│[0m
│[0m[90m Any                [0m│[0m[90m Any                  [0m│[0m
│[0m[90m Continuous         [0m│[0m[90m Continuous           [0m│[0m
├[0m────────────────────[0m┼[0m──────────────────────[0m┤[0m
│[0m 0.8094083307871109 [0m│[0m 0.05088116014667789  [0m│[0m
│[0m 0.8301109787229572 [0m│[0m 0.05321717400667095  [0m│[0m
│[0m 0.8461285715016634 [0m│[0m 0.05671154747415235  [0m│[0m
│[0m 0.8558152871767251 [0m│[0m 0.049116869393841356 [0m│[0m
│[0m 0.8554886274596777 [0m│[0m 0.04943146849773672  [0m│[0m
│[0m 0.8556645392263438 [0m│[0m 0.04977499913412082  [0m│[0m
│[0m 0.8572298005557343 [0m│[0m 0.05233837145960831  [0m│[0m
│[0m 0.8563541381073406 [0m│[0m 0.05335269536238837  [0m│[0m
│[0m 0.8561793635941839 [0m│[0m 0.0537325820854307   [0m│[0m
│[0m 0.8544794758905656 [0m│[0m 0.05495937936

The same trends are visible here as above when incrementally adding in features to the model, the performance does not increase after 3 or so features are included. We also get very similar plots below for the ROC curves for 3 of the features.

In [26]:
invars = lowercase.([ "LAST_SV", "LAST_BUY", "num_checkins"]);
X_1 = select(X,Symbol(invars[1]));
logreg_mach = machine(logreg, X_1, categorical(y));
MLJ.fit!(logreg_mach, rows=train);
y_hat = MLJ.predict(logreg_mach,X_1);
x_roc1,y_roc1 = roc_curve(y_hat,y);
auc_1 = sum((x_roc1[2:end] .- x_roc1[1:end-1]).*(y_roc1[2:end] .+ y_roc1[1:end-1])./2.0);
p1 = plot(x_roc1, y_roc1, seriestype = :line, title = "ROC Curves",label = string(invars[1],", AUC = ", @sprintf("%3.2f",auc_1)));


X_1 = select(X,Symbol(invars[2]));
logreg_mach = machine(logreg, X_1, categorical(y));
MLJ.fit!(logreg_mach, rows=train);
y_hat = MLJ.predict(logreg_mach,X_1);
x_roc2,y_roc2 = roc_curve(y_hat,y);
auc_2 = sum((x_roc2[2:end] .- x_roc2[1:end-1]).*(y_roc2[2:end] .+ y_roc2[1:end-1])./2.0);
plot!(p1,x_roc2, y_roc2, seriestype = :line,label = string(invars[2],", AUC = ", @sprintf("%3.2f",auc_2)));

X_1 = select(X,Symbol(invars[3]));
logreg_mach = machine(logreg, X_1, categorical(y));
MLJ.fit!(logreg_mach, rows=train);
y_hat = MLJ.predict(logreg_mach,X_1);
x_roc3,y_roc3 = roc_curve(y_hat,y);
auc_3 = sum((x_roc3[2:end] .- x_roc3[1:end-1]).*(y_roc3[2:end] .+ y_roc3[1:end-1])./2.0);
plot!(p1,x_roc3, y_roc3, seriestype = :line,label = string(invars[3],", AUC = ", @sprintf("%3.2f",auc_3)));
display(p1)

┌ Info: Training [34mMachine{LogisticClassifier} @296[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317
┌ Info: Training [34mMachine{LogisticClassifier} @581[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317


┌ Info: Training [34mMachine{LogisticClassifier} @246[39m.
└ @ MLJBase /Users/ncalvertuk/.juliapro/JuliaPro_v1.4.1-1/packages/MLJBase/b1egR/src/machines.jl:317


That's the end of this exercise showing an application of logistic regression as a binary classifier. Ideally, I'd like to try some feature selection techniques to better choose the features for the model. I did take a look at calculating the Pearsons correlation coefficient with limited success. I'd be keen to explore this more in the future.