## Statistics
Having a solid understanding of statistics in data science allows us to understand our data better, and allows us to create a quantifiable evaluation of any future conclusions.

In [1]:
using Statistics
using StatsBase
using RDatasets
using Plots
using StatsPlots
using KernelDensity
using Distributions
using LinearAlgebra
using HypothesisTests
using PyCall
using MLBase

LoadError: ArgumentError: Package StatsBase [2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91] is required but does not seem to be installed:
 - Run `Pkg.instantiate()` to install all recorded dependencies.


In this notebook, we will use eruption data on the faithful geyser. The data will contain wait times between every consecutive times the geyser goes off and the length of the eruptions.
<img src="data/faithful.JPG" width="400">

Let's get the data first...

In [2]:
D = dataset("datasets","faithful")
@show names(D)
D

LoadError: UndefVarError: dataset not defined

In [3]:
describe(D)

LoadError: UndefVarError: describe not defined

In [4]:
eruptions = D[!,:Eruptions]
scatter(eruptions,label="eruptions")
waittime = D[!,:Waiting]
scatter!(waittime,label="wait time")

LoadError: UndefVarError: D not defined

### Statistics plots
As you can see, this doesn't tell us much about the data... Let's try some statistical plots

In [6]:
boxplot(["eruption length"],eruptions,legend=false,size=(200,400),whisker_width=1,ylabel="time in minutes")

LoadError: UndefVarError: boxplot not defined

Statistical plots such as a box plot (and a violin plot as we will see in notebook `12. Visualization`), can provide a much better understanding of the data. Here, we immediately see that the median time of each eruption is about 4 minutes.

The next plot we will see is a histogram plot.

In [7]:
histogram(eruptions,label="eruptions")

LoadError: UndefVarError: histogram not defined

You can adjust the number of bins manually or by passing a one of the autobinning functions.

In [8]:
?histogram

search:

Couldn't find [36mhistogram[39m
Perhaps you meant BitArray


No documentation found.

Binding `histogram` does not exist.


In [9]:
histogram(eruptions,bins=:sqrt,label="eruptions")

LoadError: UndefVarError: histogram not defined

### 🔵Kernel density estimates
Next, we will see how we can fit a kernel density estimation function to our data. We will make use of the `KernelDensity.jl` package. 

In [10]:
p=kde(eruptions)

LoadError: UndefVarError: kde not defined

If we want the histogram and the kernel density graph to be aligned we need to remember that the "density contribution" of every point added to one of these histograms is `1/(nb of elements)*bin width`. Read more about kernel density estimates on its wikipedia page https://en.wikipedia.org/wiki/Kernel_density_estimation

In [11]:
histogram(eruptions,label="eruptions")
plot!(p.x,p.density .* length(eruptions), linewidth=3,color=2,label="kde fit") # nb of elements*bin width

LoadError: UndefVarError: histogram not defined

In [12]:
histogram(eruptions,bins=:sqrt,label="eruptions")
plot!(p.x,p.density .* length(eruptions) .*0.2, linewidth=3,color=2,label="kde fit") # nb of elements*bin width

LoadError: UndefVarError: histogram not defined

Next, we will take a look at one probablity distribution, namely the normal distribution and verify that it generates a bell curve.

In [13]:
myrandomvector = randn(100_000)
histogram(myrandomvector)
p=kde(myrandomvector)
plot!(p.x,p.density .* length(myrandomvector) .*0.1, linewidth=3,color=2,label="kde fit") # nb of elements*bin width

LoadError: UndefVarError: histogram not defined

### 🔵Probability distributions
Another way to generate the same plot is via using the `Distributions` package and choosing the probability distribution you want, and then drawing random numbers from it. As an example, we will use `d = Normal()` below.

In [14]:
d = Normal()
myrandomvector = rand(d,100000)
histogram(myrandomvector)
p=kde(myrandomvector)
plot!(p.x,p.density .* length(myrandomvector) .*0.1, linewidth=3,color=2,label="kde fit") # nb of elements*bin width

LoadError: UndefVarError: Normal not defined

In [15]:
b = Binomial(40) 
myrandomvector = rand(b,1000000)
histogram(myrandomvector)
p=kde(myrandomvector)
plot!(p.x,p.density .* length(myrandomvector) .*0.5,color=2,label="kde fit") # nb of elements*bin width

LoadError: UndefVarError: Binomial not defined

Next, we will try to fit a given set of numbers to a distribution.

In [16]:
x = rand(1000)
d = fit(Normal, x)
myrandomvector = rand(d,1000)
histogram(myrandomvector,nbins=20,fillalpha=0.3,label="fit")
histogram!(x,nbins=20,linecolor = :red,fillalpha=0.3,label="myvector")

LoadError: UndefVarError: fit not defined

In [17]:
x = eruptions
d = fit(Normal, x)
myrandomvector = rand(d,1000)
histogram(myrandomvector,nbins=20,fillalpha=0.3)
histogram!(x,nbins=20,linecolor = :red,fillalpha=0.3)

LoadError: UndefVarError: eruptions not defined

### 🔵Hypothesis testing
Next, we will perform hypothesis testing using the `HypothesisTests.jl` package.

In [18]:
?OneSampleTTest

search:

Couldn't find [36mOneSampleTTest[39m
Perhaps you meant names


No documentation found.

Binding `OneSampleTTest` does not exist.


In [19]:
myrandomvector = randn(1000)
OneSampleTTest(myrandomvector)

LoadError: UndefVarError: OneSampleTTest not defined

In [20]:
OneSampleTTest(eruptions)

LoadError: UndefVarError: OneSampleTTest not defined

A note about p-values: Currently using the pvalue of spearman and pearson correlation from Python. But you can follow the formula here to implement your own.
https://stackoverflow.com/questions/53345724/how-to-use-julia-to-compute-the-pearson-correlation-coefficient-with-p-value

Hint: Sometimes there are some issues getting Python and Julia to communicate as desired. One of the problems that might come up `Cannot load libmkl_intel_thread.dylib` can be solved by: 
```
using Conda
Conda.rm("mkl")
Conda.add("nomkl")
```

In [21]:
scipy_stats = pyimport("scipy.stats")
@show scipy_stats.spearmanr(eruptions,waittime)
@show scipy_stats.pearsonr(eruptions,waittime)

LoadError: UndefVarError: pyimport not defined

In [22]:
scipy_stats.pearsonr(eruptions,waittime)

LoadError: UndefVarError: scipy_stats not defined

In [23]:
corspearman(eruptions,waittime)

LoadError: UndefVarError: corspearman not defined

In [24]:
cor(eruptions,waittime)

LoadError: UndefVarError: eruptions not defined

In [25]:
scatter(eruptions,waittime,xlabel="eruption length",
    ylabel="wait time between eruptions",legend=false,grid=false,size=(400,300))

LoadError: UndefVarError: scatter not defined

Interesting! This means that the next time you visit Yellowstone National part ot see the faithful geysser and you have to wait for too long for it to go off, you will likely get a longer eruption! 

### 🔵AUC and Confusion matrix
Finally, we will cover basic tools you will need such as AUC scores or confusion matrix. We use the `MLBase` package for that.

In [26]:
gt = [1, 1, 1, 1, 1, 1, 1, 2]
pred = [1, 1, 2, 2, 1, 1, 1, 1]
C = confusmat(2, gt, pred)   # compute confusion matrix
C ./ sum(C, dims=2)   # normalize per class
sum(diag(C)) / length(gt)  # compute correct rate from confusion matrix
correctrate(gt, pred)
C = confusmat(2, gt, pred)   

LoadError: UndefVarError: confusmat not defined

In [27]:
gt = [1, 1, 1, 1, 1, 1, 1, 0];
pred = [1, 1, 0, 0, 1, 1, 1, 1]
ROC = MLBase.roc(gt,pred)
recall(ROC)
precision(ROC)

LoadError: UndefVarError: MLBase not defined

# Finally...
After finishing this notebook, you should be able to:
- [ ] generate statistics plots such as box plot, histogram, and kernel densities
- [ ] generate distributions in Julia, and draw random numbers accordingly
- [ ] fit a given set of numbers to a distribution
- [ ] compute basic evaluation metrics such as AUC and confusion matrix
- [ ] run hypothesis testing
- [ ] compute correlations and p-values