In [36]:
# Setting up a custom stylesheet in IJulia
file = open("style.css") # A .css file in the same folder as this notebook file
style = read(file, String) # Read the file
HTML("$style") # Output as HTML

# Using Julia for introductory statistics

## Introduction

Julia is well suited to general purpose data analysis.  It has many built-in statistical functions and there are many packages that greatly extend the capabilities of Julia as a scientific programming language for statistics. In this section we will take a look at introductory statistics using Julia 1.0.  There are differences between this version of Julia and version 0.6.  Not all the packages that are available for version 0.6 and prior, are quite ready for version 1.0.  At the time of recording the packages that are used in this section do all compile and can be used.

In the first part of this section of the course, we will take a look at creating our own data for statistical analysis.  It is great to be able to generate simulated data, especially when you are just starting out and might not have access to proper datasets. When viewing a new dataset, it is alway good to start by describing it.  Human beings are not designed to look at large tables of data and understand what it is trying to tell us.  Using summarizing, or descriptive, statistics helps us to gain an insight into the data before we start to analyze it.

This section will also look at visualizing data.  It many cases, this allows for an even better understanding of the data.

The `HypothesisTests` and `GLM` packages allow us to do many common statistical tests and we will have a look at Student's _t_ test, linear regression models, and the $\chi^2$ test for independence.

We will conclude with a look at exporting our data in the form of a spreadsheet.  Let's start, though, by importing the packages that we will be using.

## Adding packages

If the packages that are listed below are not installed on your system, then do the following for each package, i.e. `PyPlot`.

```
using Pkg;
Pkg.add("PyPlot")
```

In [None]:
import Pkg

Pkg.add("Distributions")
Pkg.add("StatsBase")
Pkg.add("CSV")
Pkg.add("DataFrames")
Pkg.add("HypothesisTests")
#Pkg.add("StatsPlots")
Pkg.add("GLM")
Pkg.add("GR")

In [40]:
using Distributions    # Create random variables using common distributions
using StatsBase        # Basic statistical functions 
using CSV              # Reading and writing CSV files
using DataFrames       # Create and export data structure
using HypothesisTests  # Perform statistical tests
#using StatsPlots       # Statistical plotting
using GLM              # General linear models (linear regression)
using Plots

We recommend using GR backend instead of PyPlot backend.

In [41]:
gr()                   # Use GR Backend

Plots.GRBackend()

## Creating random variable

We mentioned in the introduction that the creation of simulated data is a great way to start learning how to use Julia for statistics.  In the code below, we create five variables with random data point values.

In [42]:
age = rand(18:80, 100);  # Uniform distribution
wcc = round.(rand(Distributions.Normal(12, 2), 100), digits = 1)  # Normal distribution & round to one decimal place
crp = round.(Int, rand(Distributions.Chisq(4), 100)) .* 10  # Chi-squared distribution with broadcasting & alternate round()
treatment = rand(["A", "B"], 100); # Uniformly weighted
result = rand(["Improved", "Static", "Worse"], 100);  # Uniformly weighted

## Descriptive statistics

While there are common statistical function in Julia such as `mean()` and `std()`, it is more convenient to use the `describe()` function from the `StatsBase` package.

In [None]:
# Mean of the age variable
mean(age)

In [None]:
# Median of age variable
median(age)

In [None]:
# Standard deviation of age
std(age)

In [None]:
# Variance of age
var(age)

In [None]:
# Mean of wcc
mean(wcc)

In [None]:
# Standard deviation of wcc
std(wcc)

In [65]:
# test 
foo = rand(Normal(80,10),100000000)
var(foo)

100.0198292874263

In [None]:
# Descriptive statistics of the age variable
StatsBase.describe(age)

In [None]:
# The summarystats() function omits the length and type
StatsBase.summarystats(wcc)

## Creating a dataframe

When creating simulated data, it is best to store it in a dataframe object for easier manipulation.

In [43]:
data = DataFrame(Age = age, WCC = wcc, CRP = crp, Treatment = treatment, Result = result);

In [None]:
# Number of rows and columns
size(data)

In [None]:
# First six rows. Note that head() method has been deprecated.
first(data, 6)

We can create dataframe objects by selecting only subjects according to their data point values for a particular variable. 

**Note that dataframe slicing using `data[data[:Treatment] .== "A", :]` has been deprecated from the video lecture. Use below slicing method.**

In [44]:
dataA = data[data[:, :Treatment] .== "A", :]   # Only patient in treatment group A
dataB = data[data[:, :Treatment] .== "B", :];  # Only patient in treatment group B

In [None]:
# Show first 5 rows from dataA
first(dataA, 5)

In [None]:
# Show last 5 rows from dataB
last(dataB, 5)

## Descriptive statistics using the dataframe object

The `describe()` function will attempt to provide descriptive statistics of the a data object.

In [None]:
describe(data)

We can count the number of the elements in the sample space of a categorical variable using the `combine()` function. 

In [None]:
# Define grouped data and create a data frame that spearates in A and B:
grouped_df = groupby(data, :Treatment);

In [30]:
# Counting the number of patients in groups A and B
combine(grouped_df, nrow => :N)

Unnamed: 0_level_0,Treatment,N
Unnamed: 0_level_1,String,Int64
1,A,44
2,B,56


In [31]:
# The size argument will give the same output other than adding the number of variables i.e. 5 columns
# size() returns tuple containing row and column numbers
combine(size, grouped_df)

Unnamed: 0_level_0,Treatment,x1
Unnamed: 0_level_1,String,Tuple…
1,A,"(44, 5)"
2,B,"(56, 5)"


The usual descriptive statistics of a numerical variable can be calculated after separation by a categorical variable.

In [32]:
# Mean age of groups A and B patients
combine(grouped_df, :Age => mean)

Unnamed: 0_level_0,Treatment,Age_mean
Unnamed: 0_level_1,String,Float64
1,A,44.25
2,B,47.8393


In [33]:
# Standard deviation of groups A and B patients
combine(grouped_df, :Age => std)

Unnamed: 0_level_0,Treatment,Age_std
Unnamed: 0_level_1,String,Float64
1,A,17.4664
2,B,18.1065


By using the `summarystats()` function we can get all the descriptive statistics.

In [34]:
combine(grouped_df, :Age => describe)

Summary Stats:
Length:         44
Missing Count:  0
Mean:           44.250000
Minimum:        18.000000
1st Quartile:   31.000000
Median:         42.000000
3rd Quartile:   56.250000
Maximum:        79.000000
Type:           Int64
Summary Stats:
Length:         56
Missing Count:  0
Mean:           47.839286
Minimum:        18.000000
1st Quartile:   29.750000
Median:         52.000000
3rd Quartile:   61.250000
Maximum:        80.000000
Type:           Int64


Unnamed: 0_level_0,Treatment,Age_describe
Unnamed: 0_level_1,String,Nothing
1,A,
2,B,


## Visualizing the data

The Plots package works well with the DataFrames package by allowing macro function from the latter.  In the code cell below, we look at the age distribution of the two treatment groups.

Note that `@df` macro (from StatsPlots) is used to pass the columns to the function.

In [35]:
@df data density(:Age, group = :Treatment, title = "Distribution of ages by treatment group",
    xlab = "Age", ylab = "Distribution",
    legend = :topright)

LoadError: LoadError: UndefVarError: @df not defined
in expression starting at In[35]:1

We can do the same for the results groups.

In [None]:
@df data density(:Age, group = :Result, title = "Distribution of ages by result group",
    xlab = "Age", ylab = "Distribution",
    legend = :topright)

We can even discriminate between all of the groups.

In [None]:
@df data density(:Age, group = (:Treatment, :Result), title = "Distribution of ages by treatment and result groups",
    xlab = "Age", ylab = "Distribution",
    legend = :topright)

Let's create a box-and-whisker plot of the white cell count per treatment group and then per result group.

In [None]:
@df data boxplot(:Treatment, :WCC, lab = "WCC", title = "White cell count by treatment group",
    xlab = "Groups", ylab = "WCC")

In [None]:
@df data boxplot(:Result, :WCC, lab = "WCC", title = "White cell count by result group",
    xlab = "Groups", ylab = "WCC")

Finally, we will check on the correlation between the numerical variables using a correlation plot and a corner plot.

In [None]:
@df data corrplot([:Age :WCC :CRP], grid = false)  # No comma's between arguments in list

In [None]:
@df data cornerplot([:Age :WCC :CRP], grid = false, compact = true)

## Inferential statistics

We will begin by using Student's _t_ test to compare the mean of a numerical variable between two groups. 

In [49]:
# Difference in age between patients in groups A and B
HypothesisTests.EqualVarianceTTest(dataA[:, :Age], dataB[:, :Age])

Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -1.08604
    95% confidence interval: (-8.245, 6.072)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7640

Details:
    number of observations:   [56,44]
    t-statistic:              -0.30106838138103614
    degrees of freedom:       98
    empirical standard error: 3.6072833555525734


In [50]:
# Only the p value for the difference in white cell count between patients in groups A and B
pvalue(EqualVarianceTTest(dataA[:, :WCC], dataB[:, :WCC]))

0.1566890596091755

In [51]:
# Difference in c-reactive protein level between patients in groups A and B for unequal variances
UnequalVarianceTTest(dataA[:, :CRP], dataB[:, :CRP])

Two sample t-test (unequal variance)
------------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -3.21429
    95% confidence interval: (-14.03, 7.598)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.5566

Details:
    number of observations:   [56,44]
    t-statistic:              -0.5899930320160106
    degrees of freedom:       97.03077710484648
    empirical standard error: 5.448006230348987


We can create a variety of linear models using the `GLM.fit()` function.

In [52]:
# Simple model predicting CRP
fit(LinearModel, @formula(CRP ~ 1), data) #mean of CRP will be the preductor

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

CRP ~ 1

Coefficients:
─────────────────────────────────────────────────────────────────────
             Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────
(Intercept)   40.7     2.74232  14.84    <1e-26    35.2586    46.1414
─────────────────────────────────────────────────────────────────────

In [53]:
# Adding Age as a predictor variable
fit(LinearModel, @formula(CRP ~ Age), data)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

CRP ~ 1 + Age

Coefficients:
─────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)  38.8218        7.98113  4.86    <1e-05  22.9835    54.6601
Age           0.0389595     0.15537  0.25    0.8025  -0.269368   0.347287
─────────────────────────────────────────────────────────────────────────

In [54]:
# Adding Age and WCC as predictor variables
fit(LinearModel, @formula(CRP ~ Age + WCC), data)

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

CRP ~ 1 + Age + WCC

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  47.1167      19.7515     2.39    0.0190   7.91556   86.3179
Age           0.0213542    0.160636   0.13    0.8945  -0.297464   0.340172
WCC          -0.614678     1.33776   -0.46    0.6469  -3.26976    2.04041
──────────────────────────────────────────────────────────────────────────

We can conduct a $\chi^2$ test for independence using the `HypothesisTests.ChisqTest()` function.  First we need to look at the counts.  Below we calculate the number of unique values for the Result variable sample space for patients in groups A and B.

Again note, that we use `combine()` instead of deprecated `by()` method.

In [55]:
combine(groupby(dataA, :Result), nrow => :N)

Unnamed: 0_level_0,Result,N
Unnamed: 0_level_1,String,Int64
1,Static,22
2,Worse,13
3,Improved,21


In [56]:
combine(groupby(dataB, :Result), nrow => :N)

Unnamed: 0_level_0,Result,N
Unnamed: 0_level_1,String,Int64
1,Static,18
2,Improved,18
3,Worse,8


In [58]:
# Enter the data in similar order here
observed = reshape([22, 13, 21, 18, 8, 1], (2, 3)) # follow the order of the first combine Static Worse Imporved
observed

2×3 Matrix{Int64}:
 22  21  8
 13  18  1

In [59]:
ChisqTest(observed)

Pearson's Chi-square Test
-------------------------
Population details:
    parameter of interest:   Multinomial Probabilities
    value under h_0:         [0.259109, 0.162578, 0.288721, 0.181158, 0.066628, 0.0418058]
    point estimate:          [0.26506, 0.156627, 0.253012, 0.216867, 0.0963855, 0.0120482]
    95% confidence interval: [(0.1687, 0.3835), (0.06024, 0.2751), (0.1566, 0.3715), (0.1205, 0.3353), (0.0, 0.2148), (0.0, 0.1305)]

Test summary:
    outcome with 95% confidence: fail to reject h_0
    one-sided p-value:           0.1465

Details:
    Sample size:        83
    statistic:          3.8414003208120864
    degrees of freedom: 2
    residuals:          [0.106519, -0.134473, -0.605451, 0.764344, 1.05029, -1.32592]
    std. residuals:     [0.225584, -0.225584, -1.33923, 1.33923, 1.79141, -1.79141]


## Exporting a CSV file

Finally we can export our dataframe object as a spreadsheet file.

In [None]:
CSV.write("ProjectData_1_point_0.csv", data);

-----