# Vectors and _N_-Dimensional Arrays

In [None]:
v = [4, 5, 6]

In [None]:
u = [1.2, 4.5, 6.5]

In [None]:
w = ["dog", "cat", "bird"]

### Grow Vector in Place

In [None]:
a = Int[]                       # initialize empty vector to be filled with Ints

In [None]:
push!(a, 12)

In [None]:
push!(a, 1000)

In [None]:
push!(a, 3.14)                  # ERROR: trying to push Float into vector of Ints

In [None]:
s = String[]                    # initialize empty vector of strings

In [None]:
push!(s, "this")
push!(s, "is")
push!(s, "a string vector")

### Indexing a Vector

In [None]:
s[2]                            # get second element

In [None]:
s[3]                            # third element

In [None]:
s[1] = "that"                   # assign to first element

In [None]:
s

### Slicing a Vector

In [None]:
s[1:2]                          # get subset of vector (1st and 3rd element)

In [None]:
s[2:end]                        # subset from 2nd element to last element

## _N_-Dimensional Arrays

In [None]:
A = Array{Float64}(5, 3)        # initialize 5-by-3 matrix (start as all zeros)

In [None]:
Z = zeros(5, 3)                 # same as above

In [None]:
B = [1 2 3;
     4 5 6;
     7 8 9]

### Indexing and Slicing Matrix (or _N_-dimensional array)

In [None]:
B[3, 1]                         # gets entry in third row first column

In [None]:
B[2, :]                         # gets all of second row

In [None]:
B[:, 3]                         # gets all of third column

## Reading Data from Flat File

In [None]:
d = readdlm("somedata.csv", ',')

In [None]:
d1 = readdlm("otherdata.tsv", '\t')

In [None]:
d2 = readcsv("somedata.csv")                                 # equivalent to readdlm() with ','

In [None]:
d3 = readcsv("somedata.csv", header = true)                  # treat first line as column headings

In [None]:
typeof(d3)

In [None]:
d3[1]

In [None]:
d3[2]

In [None]:
d4 = readcsv("somedata.csv", header = true, skipstart = 3)   # skip first 3 lines 

_Note_: There are a variety of other optional arguments you can pass to the `readdlm()` family of functions. You can see more of these by using `?readdlm` at the Julia prompt, or (even better) you can go read the documentation online.  

---

---

# DataFrame Objects

### Advantages of `DataFrame` objects
- _de facto_ object for data analysis
- NA type for missing data
- factor-like type for model fitting 

In [None]:
using DataFrames

In [None]:
df1 = readtable("somedata.csv")

In [None]:
df2 = readtable("somedata.csv", makefactors = true)

In [None]:
df3 = readtable("otherdata.tsv", makefactors = true)

In [None]:
df4 = readtable("otherdata.tsv", nastrings = ["999"])

In [None]:
df5 = readtable("somedata.csv", nastrings = ["", "NA", "999"])

In [None]:
df1[:age]                     # entire `age` column

In [None]:
df1[3, :condition]            # third participant's `condition`

In [None]:
df1[4, :]                     # all of row 4

---

## Challenge Question 1

Using the `stagec` dataset from the rpart package in R, find the Gleason score of oldest patient.

In [None]:
using RDatasets  

stagec = dataset("rpart", "stagec")       # might need to be run twice

---

# Descriptive Statistics
Useful Packages:
- StatsBase

In [None]:
x = randn(100)                # generate 100 draws from standard Gaussian

In [None]:
@show mean(x)
@show median(x)
@show mode(x)
@show std(x)
@show var(x)
@show minimum(x)
@show maximum(x);

### Descriptives with `DataFrame`

In [None]:
describe(stagec)                     # get many useful descriptive stats



In [None]:
@show mean(stagec[:Age])             # get mean of Age column
@show mean(stagec[:Gleason])         # returns NA (not what we want)
@show mean(dropna(stagec[:G2]))      # dropna() removes NA values

---

---

# Inferential Statistics

Useful Packages:
- HypothesisTests (Binomial test, _t_-tests, $\chi^2$-test, and many more...)
- GLM (linear and generalized linear models)
- MixedModels (Multi-level (or Mixed-effects) models)

## Binomial Test

The binomial test is a statistical test of dichotomous data's (e.g., "success" and "failure") deviation from expected distribution. Binomial tests can be used to answer questions of this form: _If the true probability of success is $P$ then what is the probability of the data we have observed?_

In [None]:
# Coin Tossing Example:
# Simulate data from Binomial, test 
# hypothesis data came from fair coin

using HypothesisTests
using Distributions

binom = Binomial(1, 0.6)                        # initialize Distribution object Binom(1, 0.6) 

srand(137)                                      # set seed for reproducibility
x = rand(binom, 10)                             # take 10 random draws from our distribution

xbool = convert(Array{Bool,1}, x)               # cast x to vector of Booleans
BinomialTest(xbool, 0.5)                        # test null hypothesis that p = 0.5


## Student's _T_-test 

Extremely common statistical test for differences in means between two groups on some continous variable. _T_-tests are often used to investigate the effects of some new treatment versus a control group.

In [None]:
# Life Expectancy Example:
# Simulate data from Gaussian, test whether smokers 
# and non-smokers have same life expectancy 

non_smokers_gaussian = Normal(75, 5)
smokers_gaussian = Normal(65, 7)

srand(137)

non_smokers = rand(non_smokers_gaussian, 30)     # take 30 random draws from Gaussian
smokers = rand(smokers_gaussian, 30)

EqualVarianceTTest(smokers, non_smokers)         # two-sample t-test (assumes equal variances)

## Pearson's Correlation
Pearson's correlation is a measure of the linear relationship between two continous variables. The resulting correlation coefficient, _r_, ranges between 1 and -1. Positive values of _r_ indicate a positive association between the variables (e.g., hours of studying and GPA), while negative values indicate a negative relation between variables (e.g., beers-per-week and GPA).

In [None]:
# correlation between two variables in vectors
x = randn(100)
y = randn(100)

cor(x, y)

In [None]:
# pairwise correlation of all variables in a matrix
A = randn(100, 5)
cor(A)

In [None]:
# correlation of variables in a DataFrame object
cor(stagec[:Age], stagec[:Grade])

---

## Challenge Question 2
Using the `stagec` dataset from above, determine if patients with tetraploidy tumors have higher Gleason scores than patients with diploid tumors.

## Challenge Question 3
_Option 1_: Create a box-and-whisker plot of the two groups' data described in Question 2 above. <br>

_Option 2_: Create a plot of the distributions of the two groups' data described in Question 2.   

---

## Linear and Generalized Linear Models
Linear regression models and their generalizations represent one of the most powerful and fundamental classes of models in all of statistics. These models are tremendously useful for making inferences as well as for making predictions. They are ubiquitous across scientific disciplines. Furthermore, linear and generalized linear models also serve as the foundation for some of the most promising advances in machine learning and artificial intelligence.

### Linear Regression Models
Useful Packages:
- GLM
- MixedModels (for multi-level models)
- Mamba (for Bayesians)

In [None]:
using GLM

fm1 = lm(G2 ~ 1 + Age, stagec)                               # fit linear model regressing G2 on Age

In [None]:
coef(fm1)                                                    # get model coefficients (i.e., Betas)

In [None]:
stderr(fm1)                                                  # standard error of coefficients

In [None]:
confint(fm1)                                                 # confidence intervals for coefficients

In [None]:
predict(fm1)                                                 # predicted value for each observation (i.e., y_hat)

In [None]:
residuals(fm1)                                               # residuals (i.e., y - y_hat)

In [None]:
# fit indices
deviance(fm1)                                                

In [None]:
aic(fm1)                                                     

In [None]:
bic(fm1)

In [None]:
# make predictions with fitted model
new_data = DataFrame(Age = [10, 20, 30, 40, 50])
predict(fm1, new_data)

In [None]:
# Adding predictors
fm2 = lm(G2 ~ 1 + Age + Gleason, stagec)                     # add Gleason score as predictor

In [None]:
deviance(fm2)                                                

In [None]:
aic(fm2)

In [None]:
bic(fm2)

In [None]:
fm3 = lm(G2 ~ 1 + Age + Gleason + Grade, stagec)            

In [None]:
# Adding interaction terms
fm4 = lm(G2 ~ 1 + Age + Gleason*EET, stagec)                     

### Binomial Logistic Regression
Binomial logistic regression models are used when fitting models to dichotomous outcome variables (e.g., 0 and 1). 

In [None]:
fm5 = glm(PgStat ~ 1 + Gleason, stagec, Binomial())          # defaults to logit-link

In [None]:
coef(fm5) 

In [None]:
stderr(fm5)

In [None]:
confint(fm5)

In [None]:
deviance(fm5)                                                

In [None]:
aic(fm5)                                                     

In [None]:
bic(fm5)

### Poisson Regression
Poisson regression is useful for modeling outcome variables that are in the form of count data.

In [None]:
prussian = dataset("pscl", "prussian")                       # load Prussian horse kick data

In [None]:
fm6 = glm(Y ~ 1 + Year + Corp, prussian, Poisson())          # defaults to log link 

In [None]:
coef(fm6) 
stderr(fm6)
confint(fm6)
deviance(fm6)                                                
aic(fm6)                                                     
bic(fm6)

---

## Challenge Question 4
A key assumption of linear regression is the normality of error terms. A simple but effective way to check this assumption is to examine a plot of the distribution of the fitted model's residual. 

Create a plot to check the assumption of normality of error terms for model `fm3` above.


---

---

# Machine Learning

Useful Packages:
- DecisionTree
- RandomForest
- Lasso
- GLMNet (for ridge regression, lasso, and elastic net)
- LARS (lasso and elastic net)
- Clustering
- GradientBoosting
- XGBoost
- Mocha (deep neural nets)

## Bagging
Bagging is "bootstrap aggregation", and involves fitting a many individual classification or regression trees. Thus, bagging can be used with both categorical and continous data. The use of many trees improves the prediction accuracy of your fitted model over a single tree by decreasing the chance of overfitting your data (see bias/variance tradeoff).

In [None]:
using DecisionTree

In [None]:
# take only complete cases
stagec_comp = stagec[complete_cases(stagec), :]

is_tetraploid = stagec_comp[:Ploidy] .== "tetraploid"

stagec_comp[:tetra] = is_tetraploid

# must convert to Array
y = convert(Array{Float64,1}, stagec_comp[:G2])
X = convert(Array{Float64,2}, stagec_comp[[:Age, :Grade, :Gleason, :EET, :tetra]])

fm7 = build_forest(y, X, 5, 500)

In [None]:
apply_forest(fm7, [55.0, 3.0, 2.0, 1.0, 1.0])

## Random Forest
Random forests were developed after bagging, and are a generalization of the idea of taking bootstrap samples from your dataset and fitting many trees. Random forests differ from bagged trees in that for each split point in the tree, the algorithm only considers some subset of the predictors as candidates on which to split. This has the effect of further reducing the correlation between trees beyond what is already achieved by the bootstrapping. This reduces overfitting and improves prediction accuracy for new data.

In [None]:
fm8 = build_forest(y, X, 3, 500)

In [None]:
# This is a quick function to obtain the mean-squared
# error of a fitted random forest (or bagged tree) model.

function mse(fitted, y, X)
    yhat = apply_forest(fitted, X)
    sqerr = (y .- yhat).^2
    out = mean(sqerr)
    return out
end

In [None]:
@show mse(fm7, y, X)
@show mse(fm8, y, X);

## Lasso
The lasso (Least Angle Shrinkage and Selection Operator) is a form of regularized regression which penalizes the L1 norm of the vector of regression coefficients. This has the effect of shrinking the least important regression coefficients to zero. In this respect it is similar to ridge regression, which shrinks regression coefficients towards zero by penalizing the L2 (Euclidian) norm.

In [None]:
using Lasso
swiss = dataset("datasets", "swiss")

In [None]:
Xswiss = convert(Array{Float64, 2}, swiss[:, 2:5])
yswiss = convert(Array{Float64, 1}, swiss[:, 6])
fm9 = fit(LassoPath, Xswiss, yswiss)

In [None]:
fieldnames(fm9)

In [None]:
@show fm9.λ
full(fm9.coefs)

---

## Challenge Question 5
Using the `aldh2` dataset from the `gap` package in R, try fitting a few random forest (or bagged tree) models that predict whether a given patient is an alcohol using their genetic information. <br>

What is the prediction accuracy of your best model? What were the meta-paremeters of your best-fitting model? <br> 

The data can be loaded using the code below.

In [None]:
aldh2 = dataset("gap", "aldh2")

---

---

# Recommended Resources

### Statistical Inference
- Casella and Berger (2002) _Statistical Inference_
- Wasserman (2004) _All of Statistics_

### Linear Models
- Gelman and Hill (2007) _Data Analysis Using Regression and Multilevel/Hierarchical_
- Rencher and Schaalje (2008) _Linear Models in Statistics_

### Generalized Linear Models
- Agresti (2002) _Categorical Data Analysis_
- Hosmer and Lemeshow (2000) _Applied Logistic Regression_

### Machine Learning
- Hastie, Tibshirani, & Friedman (2001) _Elements of Statistical Learning_
- James, Witten, Hastie, & Tibshirani (2015) _An Introduction to Statistical Learning_
- Kuhn and Johnson (2013) _Applied Predictive Modeling_