# Tutorial: Regression with Flux

This tutorial introduces you to regression analysis. Regression analysis estimates the relationship between the outcome variable and one/multiple independent variables. It splits into linear and logistic regression and is most commonly employed for prediction and forecasting.

This tutorial is based on an amalgamation of [Flux's regression tutorial](https://github.com/FluxML/model-zoo/tree/master/other/iris) and [Gen's development branch treatise](https://probcomp.github.io/Gen/dev/getting_started/#Example-1) on Bayesian linear regression as an introductory topic for beginners in probabilistic programming.

Structure:
    1. Linear regression
    2. Logistic regression
    3. Exercise
    4. Bayesian linear regression using probabilistic programming
    5. Exercise

## 1. Linear regression

Import the plotting package and Flux with its throttle function `Flux.throttle(f, timeout)`, which lets us use the `callback` function only every `timeout` seconds.

In [None]:
using Plots
using Flux
using Flux: throttle

Generate an artificial dataset with random disturbances in the range from $[-3, 3]$ for all linear regression experiments.

In [None]:
# Generate an artificial dataset
regX = randn(100)
regY = 50 .+ 100 * regX + 20 * randn(100);

# Plot the artificial dataset
scatter(regX, regY, fmt = :png, legend=:bottomright, label="Artificial Data")

### 1.1 Exact regression using linear algebra

Recalling the regression equations

$ Y_{i} = \beta_{0} + \beta_{1} X_{i} + \epsilon_{i}$

regression can be expressed as a matrix equation

$ Y = X \beta + \epsilon$

which we can easily express for our dataset.

In [None]:
# Linear regression with internal algebra system
X = hcat(ones(length(regX)), regX)
Y = regY
intercept, slope = inv(X'*X) * (X'*Y);

Plot the results for later comparison to other approaches.

In [None]:
# Plot regression line
plot!((x) -> intercept + slope * x, -3, 3, label="fit_exact")

### 1.2 Linear Regression using Flux

Using the machine learning library [Flux](https://github.com/fluxml/flux.jl) we now express regression as a model of a single dense layer, which we train with gradient descent against the mean-squared error. Using Flux's `train!` macro we then combine these subparts into a single training routine for we use a `callback` routine to print the loss at every training iteration.

In [None]:
# create data tuples
data = zip(regX, regY)

# Define the model
model = Dense(1, 1, identity)

# Mean-squared error
loss(x, y) = Flux.mse(model([x]), y)

# Callback function
evalcb = () -> @show(sum([loss(i[1], i[2]) for i in data]))

# Training with gradient descent
opt = Descent(0.1)

# Train for 50 epochs
for i = 1:50
    Flux.train!(loss, params(model), data, opt, cb=throttle(evalcb, 10))
end

Retrieve the $\theta$ and the bias from the trained model parameters.

In [None]:
(θ, bias) = collect(params(model))

Plot the trained model against the previously found 'exact' linear algebra-based model.

In [None]:
# Plot the line trained with Flux
plot!((x) -> bias[1] + θ[1] * x, -3, 3, label="fit_flux")

## 2. Logistic regression

Logistic regression uses a logistic function to model datasets, for which we have binary measurements. The logistic function of x is:

$1/(1 + \exp^x)$

It assigns probabilities to each output, which is then most commonly scored with the `crossentropy` loss function.
Switching to logistic regression, we now use the `Iris` dataset, which is a classical pattern recognition dataset consisting of 3 classes with 50 instances each. Each class represents a type of iris plant.

In [None]:
using Flux: crossentropy, normalise, onecold, onehotbatch
using Statistics: mean

In [None]:
# Get the labels & features of the Iris dataset
labels = Flux.Data.Iris.labels()
features = Flux.Data.Iris.features();

Onehot encoding, encodes the categories of the dataset into an array, which is easier to digest for the machine as it contains binary values for the categories, i.e. 0s for not being part of the category and 1s for being a member of that category. For more explanation, please have a look at [sklearn.preprocessing.OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) documentation.

In [None]:
# Normalise features
normed_features = normalise(features, dims=2)

# Split into classes and add one hot labels
klasses = sort(unique(labels))
onehot_labels = onehotbatch(labels, klasses);

To train on our dataset, we now have to split the data into a training- and test dataset. A quite common split would be 70% training data and to hold out 30% test data to later evaluate the model.

In [None]:
# Split into trainings- and test-set
train_indices = [1:3:150; 2:3:150]

X_train = normed_features[:, train_indices]
y_train = onehot_labels[:, train_indices]

X_test = normed_features[:, 3:3:150]
y_test = onehot_labels[:, 3:3:150];

Construct the neural network in Flux with a single dense, fully-connected layer and subsequent softmax activation function to obtain probabilities as outputs. The `crossentropy`-loss measures the binary classification performance on a log loss scale skewed to high errors for weak predictive performance. For further explanation have a look at the [ML-cheatsheet](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html).

In [None]:
# Example model with 4 features and 3 probabilities as outputs
model = Chain(
    Dense(4, 3),
    softmax
)

loss(x, y) = crossentropy(model(x), y);

Use gradient descent to optimise the neural network for the `crossentropy` loss defined above. The `train!` function takes the parameters of the `model`, our `crossentropy`-loss function, the data iterator and our gradient-`descent` optimiser and runs for the maximum number of iterations, as defined in the data iterator, i.e. `200` iterations in this case.

In [None]:
# Training configuration
optimiser = Descent(0.5)
data_iterator = Iterators.repeated((X_train, y_train), 200)

Flux.train!(loss, params(model), data_iterator, optimiser)

In [None]:
# Evaluation of trained model
accuracy(x, y) = mean(onecold(model(x)) .== onecold(y))

accuracy_score = accuracy(X_test, y_test)

To verify our performance on the dataset we now calculate the `confusion_matrix`, which helps us to differentiate between statistical 'true positive' and 'false positive'. The smaller the 'false positives' and 'false negatives' are, the better our trained model.

In [None]:
# Construct the confusion matrix for the model
function confusion_matrix(X, y)
    y_hat= onehotbatch(onecold(model(X)), 1:3)
    y * y_hat'
end

display(confusion_matrix(X_test, y_test))

## 3. Exercise

- Perform linear or logistic regression on a dataset from [RDatasets.jl](https://github.com/JuliaStats/RDatasets.jl). You can query the available datasets with the following commands:
    - `RDatasets.packages()`
    - `RDatasets.datasets()`
    - and e.g. `RDatasets.datasets("plm")`, which would be good option for a linear regression model
- Play around with the conceptual framework of linear and logistic regression by testing out different network types to query their behavior when used (e..g change in training time, accuracy etc.). The types of layers available in Flux are:
    - Convolutional layers, `Conv(size, in=>out)`, and `Conv(size, in=>out, relu)`
    - Recurrent layers, `RNN(in, out)`, `LSTM(in, out)`, and `GRU(in, out)`
    - Dropout layers, `Dropout(p, dims)`, and `AlphaDropout(p)`. Dropout layers reduce overfitting by randomly dropping nodes of the network with probability `p`, to force the network to continuously adopt to this noisy training and not completely overfit to the one case at hand.

## 4. Bayesian linear regression through probabilistic programming (Outlook)

To implement Bayesian models efficiently, we have to rely on probabilistic programming languages, such as `Gen` and `Turing`, which seamlessly interface with the rest of the Julia ecosystem. This explicitly includes Flux. At its core these probabilistic programming systems construct a domain-specific language to express express probabilistic models, sample from probability distributions and utilize their integrated inference engines such as `Hamiltonian Monte-Carlo`, `Sequential Monte-Carlo`, `Variational Inference`, etc.

In [None]:
using Gen

Probabilistic programming works at its core with the abstract of a 'generative model'. Generative models describe a model defined in the Julia programming language and with `@gen` defined in the dynamic domain-specific language of Gen (to be explained in the afternoon).

In [None]:
# Define the generative model, which Gen relies on
@gen function regression_model(regX::Vector{Float64})
    slope = @trace(normal(110, 10), :slope)
    intercept = @trace(normal(50, 2), :intercept)
    for (i, x) in enumerate(regX)
        @trace(normal(slope * x + intercept, 1), "y-$i")
    end
end

Construct an inference program, which takes in the data set and runs an inference routine on our model.
For this we rely on the standard inference library at the core of Gen. This allows the scientist a higher-level
abstraction, hence making the implementation of ideas and inference routines a lot easier and more accessible to
non-experts. The inference program automatically fits 'slope' and 'intercept'.

In [None]:
function regression_inference_program(regX::Vector{Float64}, regY::Vector{Float64}, num_iters::Int)
    # Create a choicemap to constrain the set of possible y coordinates
    # to the observed y values
    constraints = choicemap()
    for (i, y) in enumerate(regY)
        constraints["y-$i"] = y
    end
    
    # Generate an initial execution trace in which the execution is constrained by
    # the pre-defined 'constraints'
    (trace, _) = generate(regression_model, (regX,), constraints)
    
    # Iteratively update the slope and the intercept using the
    # in-built metropolis_hastings sampler
    for iter=1:num_iters
        (trace, _) = metropolis_hastings(trace, select(:slope))
        (trace, _) = metropolis_hastings(trace, select(:intercept))
    end
    
    # Retrieve 'slope' and 'intercept' values from final trace
    choices = get_choices(trace)
    return (choices[:slope], choices[:intercept])
end

In [None]:
# Run the inference program
(slope, intercept) = regression_inference_program(regX, regY, 10000)
println("slope: $slope, intercept: $intercept")

Visualize the distribution after training:

In [None]:
# Plot the line trained with the inference program
plot!((x) -> intercept + slope * x, -3, 3, label="bubu")

## 5. Exercise

- Think of a way to rewrite the logistic regression in a Bayesian style as done in 4.
- Think of the above model:
   - What influence do the priors have?
   - Consider which probability distribution best represents the prior knowledge of model?
   - Experiment with the probability distributions in Gen, the available distributions are:
       - Bernoulli,            `bernoulli(prob_true::Real)`
       - Beta,                `beta(alpha::Real, beta::Real)`
       - Categorical,         `categorical(probs::Vector{Float64})`
       - Exponential,         `exponential(rate::Real)`
       - Gamma,               `gamma(shape::Real, scale::Real)`
       - Geometric,           `geometric(p::Real)`
       - Inverse Gamma,       `inv_gamma(shape::Real, scale::Real)`
       - Multivariate Normal, `mvnormal(mu::AbstractVector{Float64}, cov::AbstractMatrix{Float64})`
       - Poisson,             `poisson(lambda::Real)`