# Statistics in Julia

This notebook will demonstrate how to do basic statistics and machine learning in Julia, including descriptive statistics, regression, and machine learning.

In [None]:
using DataFrames, CSV, StatsBase, StatsModels, GLM, MLJ, Dates

In [None]:
sensors = CSV.read("data/bay_area_freeways.csv", DataFrame);

## Descriptive statistics

StatsBase provides most common and many uncommon descriptive statistical measures (Kullback-Leibler divergence, anyone?). For instance, we can use the convenience function `describe()` to compute the mean and percentiles of a column, or the function `mean()` to get just the mean, or `cor()` for a Pearson's correlation.

In [None]:
describe(sensors.avg_speed_mph)

In [None]:
mean(sensors.avg_speed_mph)

In [None]:
cor(sensors.avg_speed_mph, sensors.avg_occ)

## Combining descriptive statistics with groupby

We can also compute descriptive statistics for groups using groupby/split-apply-combine.

In [None]:
combine(groupby(sensors, [:freeway_number, :direction]), :avg_speed_mph => mean)

## Fitting a linear regression

Before we can fit a linear regression, we need to join the metadata to the sensor data, so we have enough covariates. Then, we'll use the `lm` function from [GLM.jl](https://juliastats.org/GLM.jl/stable/) to estimate a linear regression.

In [None]:
meta = CSV.read("data/sensor_meta.csv", DataFrame)
sensors = leftjoin(sensors, meta, on=:station=>:ID)

In [None]:
# also create an hour-of-day and day-of-week variable
sensors.hour = Dates.hour.(sensors.timestamp)
sensors.day_of_week = Dates.dayname.(sensors.timestamp);

In [None]:
model = lm(@formula(avg_speed_mph ~ Lanes + day_of_week + hour), sensors, contrasts=Dict(:day_of_week=>DummyCoding(base="Monday"), :hour=>DummyCoding()))

### Model fit statistics

The table printed above shows coefficients and statistics, but not any information about the model fit. We can examine the $R^2$, number of observations, etc. with function of the model object.

In [None]:
r2(model)

In [None]:
nobs(model)

## Machine learning

Sometimes a linear regression might not be the right tool for the job. [MLJ.jl](https://github.com/alan-turing-institute/MLJ.jl) provides a common interface to many different machine learning functions available in Julia. We're going to declare a sensor "congested" if the speed is less than 55 mph, and build a random forest classifier to predict if sensors are congested.

In [None]:
# first, we will load the MLJ RandomForestClassifier from the DecisionTree package
RandomForestClassifier = @load RandomForestClassifier pkg=DecisionTree

### Creating our independent and dependent variables

As before, we'll use Lanes, day of week, and hour to predict congestion. MLJ requires a data frames with independent variables, and a vector with the dependent variables. Since random forests can split ordinal data arbitrarily, we create a new numeric day of week variable to preserve ordering information.

In [None]:
sensors.day_of_week_number = Dates.dayofweek.(sensors.timestamp)
X = sensors[!, [:Lanes, :hour, :day_of_week_number]]
# Sometimes a column will have a data type that is Union{Int64, Missing} (i.e. integers with missing values),
# even if there are no missing values. Most MLJ models will not accept these columns. disallowmissing! will
# change the types of these columns to just Int64 without missings, and throw an error if there actually were
# any missing values.
disallowmissing!(X)

# We also need to tell it that this is a categorical variable, not a numeric one.
y = coerce(sensors.avg_speed_mph .< 55, Binary);


### Preventing overfitting

As with any machine learning model, we want to evaluate our estimated model using separate training and test data to avoid overfitting. MLJ provides the `evaluate` function to use various test-error-rate estimation methods.

In [None]:
evaluate(RandomForestClassifier(), X, y; resampling=CV(nfolds=5), measure=auc)

In [None]:
?RandomForestClassifier

In [None]:
evaluate(RandomForestClassifier(max_depth=10, n_trees=20), X, y; resampling=CV(nfolds=5), measure=auc)

In [None]:
mean(sensors.avg_speed_mph .< 55)