In [1]:
# Setting up a custom stylesheet in IJulia
file = open("style.css") # A .css file in the same folder as this notebook file
style = read(file, String) # Read the file
HTML("$style") # Output as HTML

# Using Julia for introductory statistics

## Introduction

Julia is well suited to general purpose data analysis.  It has many built-in statistical functions and there are many packages that greatly extend the capabilities of Julia as a scientific programming language for statistics.

In this section we will take a look at introductory statistics using Julia 1.0.  There are differences between this version of Julia and version 0.6.  Not all the packages that are available for version 0.6 and prior, are quite ready for version 1.0.  At the time of recording the packages that are used in this section do all compile and can be used.

In the first part of this section of the course, we will take a look at creating our own data for statistical analysis.  It is great to be able to generate simulated data, especially when you are just starting out and might not have access to proper datasets.

When viewing a new dataset, it is alway good to start by describing it.  Human beings are not designed to look at large tables of data and understand what it is trying to tell us.  Using summarizing, or descriptive, statistics helps us to gain an insight into the data before we start to analyze it.

This section will also look at visualizing data.  It many cases, this allows for an even better understanding of the data.

The `HypothesisTests` and `GLM` packages allow us to do many common statistical tests and we will have a look at Student's _t_ test, linear regression models, and the $\chi^2$ test for independence.

We will conclude with a look at exporting our data in the form of a spreadsheet.  Let's start, though, by importing the packages that we will be using.

## Adding packages

If the packages that are listed below are not installed on your system, then do the following for each package, i.e. `PyPlot`.

```
using Pkg;
Pkg.add("PyPlot")
```

In [None]:
import Pkg

Pkg.add("Distributions")
Pkg.add("StatsBase")
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("HypothesisTests")
Pkg.add("StatsPlots")
Pkg.add("GLM")
Pkg.add("GR")

In [None]:
using Distributions    # Create random variables
using StatsBase        # Basic statistical support
using CSV              # Reading and writing CSV files
using DataFrames       # Create a data structure
using HypothesisTests  # Perform statistical tests
using StatsPlots       # Statistical plotting
using GLM              # General linear models

We recommend using GR backend instead of PyPlot backend.

In [None]:
gr()                   # Use GR Backend

## Creating random variable

We mentioned in the introduction that the creation of simulated data is a great way to start learning how to use Julia for statistics.  In the code below, we create five variables with random data point values.

In [None]:
age = rand(18:80, 100);  # Uniform distribution
wcc = round.(rand(Distributions.Normal(12, 2), 100), digits = 1)  # Normal distribution & round to one decimal place
crp = round.(Int, rand(Distributions.Chisq(4), 100)) .* 10  # Chi-squared distribution with broadcasting & alternate round()
treatment = rand(["A", "B"], 100); # Uniformly weighted
result = rand(["Improved", "Static", "Worse"], 100);  # Uniformly weighted

## Descriptive statistics

While there are common statistical function in Julia such as `mean()` and `std()`, it is more convenient to use the `describe()` function from the `StatsBase` package.

In [None]:
# Mean of the age variable
mean(age)

In [None]:
# Median of age variable
median(age)

In [None]:
# Standard deviation of age
std(age)

In [None]:
# Variance of age
var(age)

In [None]:
# Mean of wcc
mean(wcc)

In [None]:
# Standard deviation of wcc
std(wcc)

In [None]:
# Descriptive statistics of the age variable
StatsBase.describe(age)

In [None]:
# The summarystats() function omits the length and type
StatsBase.summarystats(wcc)

## Creating a dataframe

When creating simulated data, it is best to store it in a dataframe object for easier manipulation.

In [None]:
data = DataFrame(Age = age, WCC = wcc, CRP = crp, Treatment = treatment, Result = result);

In [None]:
# Number of rows and columns
size(data)

In [None]:
# First six rows. Note that head() method has been deprecated.
first(data, 6)

We can create dataframe objects by selecting only subjects according to their data point values for a particular variable. 

**Note that dataframe slicing using `data[data[:Treatment] .== "A", :]` has been deprecated from the video lecture. Use below slicing method.**

In [None]:
dataA = data[data[:, :Treatment] .== "A", :]   # Only patient in treatment group A
dataB = data[data[:, :Treatment] .== "B", :];  # Only patient in treatment group B

In [None]:
# Show first 5 rows from dataA
first(dataA, 5)

In [None]:
# Show last 5 rows from dataB
last(dataB, 5)

## Descriptive statistics using the dataframe object

The `describe()` function will attempt to provide descriptive statistics of the a data object.

In [None]:
describe(data)

We can count the number of the elements in the sample space of a categorical variable using the `combine()` function. 

**Note that previously used `by()` method has been deprecated**. Use `combine()` instead with the grouped dataframes.

In [None]:
# Define grouped data
grouped_df = groupby(data, :Treatment);

In [None]:
# Counting the number of patients in groups A and B
combine(grouped_df, nrow => :N)

In [None]:
# The size argument will give the same output other than adding the number of variables i.e. 5 columns
# size() returns tuple containing row and column numbers
combine(size, grouped_df)

The usual descriptive statistics of a numerical variable can be calculated after separation by a categorical variable.

In [None]:
# Mean age of groups A and B patients
combine(grouped_df, :Age => mean)

In [None]:
# Standard deviation of groups A and B patients
combine(grouped_df, :Age => std)

By using the `summarystats()` function we can get all the descriptive statistics.

In [None]:
combine(grouped_df, :Age => describe)

## Visualizing the data

The Plots package works well with the DataFrames package by allowing macro function from the latter.  In the code cell below, we look at the age distribution of the two treatment groups.

Note that `@df` macro (from StatsPlots) is used to pass the columns to the function.

In [None]:
@df data density(:Age, group = :Treatment, title = "Distribution of ages by treatment group",
    xlab = "Age", ylab = "Distribution",
    legend = :topright)

We can do the same for the results groups.

In [None]:
@df data density(:Age, group = :Result, title = "Distribution of ages by result group",
    xlab = "Age", ylab = "Distribution",
    legend = :topright)

We can even discriminate between all of the groups.

In [None]:
@df data density(:Age, group = (:Treatment, :Result), title = "Distribution of ages by treatment and result groups",
    xlab = "Age", ylab = "Distribution",
    legend = :topright)

Let's create a box-and-whisker plot of the white cell count per treatment group and then per result group.

In [None]:
@df data boxplot(:Treatment, :WCC, lab = "WCC", title = "White cell count by treatment group",
    xlab = "Groups", ylab = "WCC")

In [None]:
@df data boxplot(:Result, :WCC, lab = "WCC", title = "White cell count by result group",
    xlab = "Groups", ylab = "WCC")

Finally, we will check on the correlation between the numerical variables using a correlation plot and a corner plot.

In [None]:
@df data corrplot([:Age :WCC :CRP], grid = false)  # No comma's between arguments in list

In [None]:
@df data cornerplot([:Age :WCC :CRP], grid = false, compact = true)

## Inferential statistics

We will begin by using Student's _t_ test to compare the mean of a numerical variable between two groups. 

In [None]:
# Difference in age between patients in groups A and B
HypothesisTests.EqualVarianceTTest(dataA[:, :Age], dataB[:, :Age])

In [None]:
# Only the p value for the difference in white cell count between patients in groups A and B
pvalue(EqualVarianceTTest(dataA[:, :WCC], dataB[:, :WCC]))

In [None]:
# Difference in c-reactive protein level between patients in groups A and B for unequal variances
UnequalVarianceTTest(dataA[:, :CRP], dataB[:, :CRP])

We can create a variety of linear models using the `GLM.fit()` function.

In [None]:
# Simple model predicting CRP
fit(LinearModel, @formula(CRP ~ 1), data)

In [None]:
# Adding Age as a predictor variable
fit(LinearModel, @formula(CRP ~ Age), data)

In [None]:
# Adding Age and WCC as predictor variables
fit(LinearModel, @formula(CRP ~ Age + WCC), data)

We can conduct a $\chi^2$ test for independence using the `HypothesisTests.ChisqTest()` function.  First we need to look at the counts.  Below we calculate the number of unique values for the Result variable sample space for patients in groups A and B.

Again note, that we use `combine()` instead of deprecated `by()` method.

In [None]:
combine(groupby(dataA, :Result), nrow => :N)

In [None]:
combine(groupby(dataB, :Result), nrow => :N)

In [None]:
# Enter the data in similar order here
observed = reshape([22, 17, 18, 18, 11, 14], (2, 3))
observed

In [None]:
ChisqTest(observed)

## Exporting a CSV file

Finally we can export our dataframe object as a spreadsheet file.

In [None]:
CSV.write("ProjectData_1_point_0.csv", data);

-----