# Vectors and _N_-Dimensional Arrays

In [1]:
v = [4, 5, 6]

3-element Array{Int64,1}:
 4
 5
 6

In [2]:
u = [1.2, 4.5, 6.5]

3-element Array{Float64,1}:
 1.2
 4.5
 6.5

In [3]:
w = ["dog", "cat", "bird"]

3-element Array{String,1}:
 "dog" 
 "cat" 
 "bird"

### Grow Vector in Place

In [4]:
a = Int[]                       # initialize empty vector to be filled with Ints

0-element Array{Int64,1}

In [5]:
push!(a, 12)

1-element Array{Int64,1}:
 12

In [6]:
push!(a, 1000)

2-element Array{Int64,1}:
   12
 1000

In [7]:
push!(a, 3.14)                  # ERROR: trying to push Float into vector of Ints

LoadError: LoadError: InexactError()
while loading In[7], in expression starting on line 1

In [8]:
s = String[]                    # initialize empty vector of strings

0-element Array{String,1}

In [9]:
push!(s, "this")
push!(s, "is")
push!(s, "a string vector")

3-element Array{String,1}:
 "this"           
 "is"             
 "a string vector"

### Indexing a Vector

In [10]:
s[2]                            # get second element

"is"

In [11]:
s[3]                            # third element

"a string vector"

In [12]:
s[1] = "that"                   # assign to first element

"that"

In [13]:
s

3-element Array{String,1}:
 "that"           
 "is"             
 "a string vector"

### Slicing a Vector

In [14]:
s[1:2]                          # get subset of vector (1st and 3rd element)

2-element Array{String,1}:
 "that"
 "is"  

In [15]:
s[2:end]                        # subset from 2nd element to last element

2-element Array{String,1}:
 "is"             
 "a string vector"

## _N_-Dimensional Arrays

In [16]:
A = Array{Float64}(5, 3)        # initialize 5-by-3 matrix (start as all zeros)

5×3 Array{Float64,2}:
 2.35213e-314  2.35213e-314  2.35231e-314
 2.35213e-314  2.35213e-314  2.35231e-314
 2.35213e-314  2.35214e-314  2.35231e-314
 2.35213e-314  2.35231e-314  2.35231e-314
 2.35213e-314  2.35231e-314  2.35231e-314

In [17]:
Z = zeros(5, 3)                 # same as above

5×3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

In [18]:
B = [1 2 3;
     4 5 6;
     7 8 9]

3×3 Array{Int64,2}:
 1  2  3
 4  5  6
 7  8  9

### Indexing and Slicing Matrix (or _N_-dimensional array)

In [19]:
B[3, 1]                         # gets entry in third row first column

7

In [20]:
B[2, :]                         # gets all of second row

3-element Array{Int64,1}:
 4
 5
 6

In [21]:
B[:, 3]                         # gets all of third column

3-element Array{Int64,1}:
 3
 6
 9

## Reading Data from Flat File

In [22]:
d = readdlm("somedata.csv", ',')

11×4 Array{Any,2}:
   "id"    "age"  "gender"  "condition"    
  1      23       "m"       "bronchitis"   
  2      54       "m"       "heart disease"
  3      43       "f"       "diabetes"     
  4      27       "f"       "lupus"        
  5      52       "m"       "melanoma"     
  6      48       "f"       "influenza"    
  7      13       "m"       "asthma"       
  8      71       "f"       "arthritis"    
  9      59       "m"       "stroke"       
 10      31       "f"       "epilepsy"     

In [23]:
d1 = readdlm("otherdata.tsv", '\t')

11×4 Array{Any,2}:
   "id"    "age"  "gender"     "condition"    
  1      23       "m"       999               
  2      54       "m"          "heart disease"
  3      43       "f"          "diabetes"     
  4      27       "f"          "lupus"        
  5      52       "m"          "melanoma"     
  6      48       "f"          "influenza"    
  7      13       "m"       999               
  8      71       "f"          "arthritis"    
  9      59       "m"          "stroke"       
 10      31       "f"          "epilepsy"     

In [24]:
d2 = readcsv("somedata.csv")                                 # equivalent to readdlm() with ','

11×4 Array{Any,2}:
   "id"    "age"  "gender"  "condition"    
  1      23       "m"       "bronchitis"   
  2      54       "m"       "heart disease"
  3      43       "f"       "diabetes"     
  4      27       "f"       "lupus"        
  5      52       "m"       "melanoma"     
  6      48       "f"       "influenza"    
  7      13       "m"       "asthma"       
  8      71       "f"       "arthritis"    
  9      59       "m"       "stroke"       
 10      31       "f"       "epilepsy"     

In [25]:
d3 = readcsv("somedata.csv", header = true)                  # treat first line as column headings

(
Any[1 23 "m" "bronchitis"; 2 54 "m" "heart disease"; … ; 9 59 "m" "stroke"; 10 31 "f" "epilepsy"],

AbstractString["id" "age" "gender" "condition"])

In [26]:
typeof(d3)

Tuple{Array{Any,2},Array{AbstractString,2}}

In [27]:
d3[1]

10×4 Array{Any,2}:
  1  23  "m"  "bronchitis"   
  2  54  "m"  "heart disease"
  3  43  "f"  "diabetes"     
  4  27  "f"  "lupus"        
  5  52  "m"  "melanoma"     
  6  48  "f"  "influenza"    
  7  13  "m"  "asthma"       
  8  71  "f"  "arthritis"    
  9  59  "m"  "stroke"       
 10  31  "f"  "epilepsy"     

In [28]:
d3[2]

1×4 Array{AbstractString,2}:
 "id"  "age"  "gender"  "condition"

In [29]:
d4 = readcsv("somedata.csv", header = true, skipstart = 3)   # skip first 3 lines 

(
Any[4 27 "f" "lupus"; 5 52 "m" "melanoma"; … ; 9 59 "m" "stroke"; 10 31 "f" "epilepsy"],

AbstractString["3" "43" "f" "diabetes"])

_Note_: There are a variety of other optional arguments you can pass to the `readdlm()` family of functions. You can see more of these by using `?readdlm` at the Julia prompt, or (even better) you can go read the documentation online.  

---

---

# DataFrame Objects

### Advantages of `DataFrame` objects
- _de facto_ object for data analysis
- NA type for missing data
- factor-like type for model fitting 

In [30]:
using DataFrames

In [31]:
df1 = readtable("somedata.csv")

Unnamed: 0,id,age,gender,condition
1,1,23,m,bronchitis
2,2,54,m,heart disease
3,3,43,f,diabetes
4,4,27,f,lupus
5,5,52,m,melanoma
6,6,48,f,influenza
7,7,13,m,asthma
8,8,71,f,arthritis
9,9,59,m,stroke
10,10,31,f,epilepsy


In [32]:
df2 = readtable("somedata.csv", makefactors = true)

Unnamed: 0,id,age,gender,condition
1,1,23,m,bronchitis
2,2,54,m,heart disease
3,3,43,f,diabetes
4,4,27,f,lupus
5,5,52,m,melanoma
6,6,48,f,influenza
7,7,13,m,asthma
8,8,71,f,arthritis
9,9,59,m,stroke
10,10,31,f,epilepsy


In [33]:
df3 = readtable("otherdata.tsv", makefactors = true)

Unnamed: 0,id,age,gender,condition
1,1,23,m,999
2,2,54,m,heart disease
3,3,43,f,diabetes
4,4,27,f,lupus
5,5,52,m,melanoma
6,6,48,f,influenza
7,7,13,m,999
8,8,71,f,arthritis
9,9,59,m,stroke
10,10,31,f,epilepsy


In [34]:
df4 = readtable("otherdata.tsv", nastrings = ["999"])

Unnamed: 0,id,age,gender,condition
1,1,23,m,
2,2,54,m,heart disease
3,3,43,f,diabetes
4,4,27,f,lupus
5,5,52,m,melanoma
6,6,48,f,influenza
7,7,13,m,
8,8,71,f,arthritis
9,9,59,m,stroke
10,10,31,f,epilepsy


In [35]:
df5 = readtable("somedata.csv", nastrings = ["", "NA", "999"])

Unnamed: 0,id,age,gender,condition
1,1,23,m,bronchitis
2,2,54,m,heart disease
3,3,43,f,diabetes
4,4,27,f,lupus
5,5,52,m,melanoma
6,6,48,f,influenza
7,7,13,m,asthma
8,8,71,f,arthritis
9,9,59,m,stroke
10,10,31,f,epilepsy


In [36]:
df1[:age]                     # entire `age` column

10-element DataArrays.DataArray{Int64,1}:
 23
 54
 43
 27
 52
 48
 13
 71
 59
 31

In [37]:
df1[3, :condition]            # third participant's `condition`

"diabetes"

In [38]:
df1[4, :]                     # all of row 4

Unnamed: 0,id,age,gender,condition
1,4,27,f,lupus


---

## Challenge Question 1

Using the `stagec` dataset from the rpart package in R, find the Gleason score of oldest patient.

In [40]:
using RDatasets  

stagec = dataset("rpart", "stagec")       # might need to be run twice

Unnamed: 0,PgTime,PgStat,Age,EET,G2,Grade,Gleason,Ploidy
1,6.1,0,64,2,10.26,2,4,diploid
2,9.4,0,62,1,,3,8,aneuploid
3,5.2,1,59,2,9.99,3,7,diploid
4,3.2,1,62,2,3.57,2,4,diploid
5,1.9,1,64,2,22.56,4,8,tetraploid
6,4.8,0,69,1,6.14,3,7,diploid
7,5.8,0,75,2,13.69,2,,tetraploid
8,7.3,0,71,2,,3,7,aneuploid
9,3.7,1,73,2,11.77,3,6,diploid
10,15.9,0,64,2,27.27,3,7,tetraploid


---

# Descriptive Statistics
Useful Packages:
- StatsBase

In [41]:
x = randn(100)                # generate 100 draws from standard Gaussian

100-element Array{Float64,1}:
  1.81681 
 -0.106883
 -0.304894
  0.627641
 -0.646006
 -1.80745 
 -1.6134  
  0.847962
 -0.210035
  2.46234 
  1.37391 
  0.261386
  0.834375
  ⋮       
 -0.124578
 -2.00369 
 -0.391279
  0.177285
  0.917718
 -0.557587
  0.867045
  1.32187 
  0.79401 
  1.50355 
  0.97725 
  1.51719 

In [44]:
@show mean(x)
@show median(x)
@show mode(x)
@show std(x)
@show var(x)
@show minimum(x)
@show maximum(x);

mean(x) = 0.18428586576159323
median(x) = 0.18172760982856606
mode(x) = 1.8168122293214692
std(x) = 0.9388154447625697
var(x) = 0.8813744393247417
minimum(x) = -2.2547483697966633
maximum(x) = 2.4623405681806694


### Descriptives with `DataFrame`

In [46]:
describe(stagec)                     # get many useful descriptive stats



PgTime
Min      0.3
1st Qu.  3.7
Median   5.9
Mean     6.323972602739725
3rd Qu.  7.9
Max      17.7
NAs      0
NA%      0.0%

PgStat
Min      0.0
1st Qu.  0.0
Median   0.0
Mean     0.3698630136986301
3rd Qu.  1.0
Max      1.0
NAs      0
NA%      0.0%

Age
Min      47.0
1st Qu.  59.0
Median   63.0
Mean     63.0
3rd Qu.  67.0
Max      75.0
NAs      0
NA%      0.0%

EET
Min      1.0
1st Qu.  1.75
Median   2.0
Mean     1.75
3rd Qu.  2.0
Max      2.0
NAs      2
NA%      1.37%

G2
Min      2.4
1st Qu.  9.215
Median   13.01
Mean     14.27525179856115
3rd Qu.  16.715
Max      54.93
NAs      7
NA%      4.79%

Grade
Min      1.0
1st Qu.  2.0
Median   3.0
Mean     2.6095890410958904
3rd Qu.  3.0
Max      4.0
NAs      0
NA%      0.0%

Gleason
Min      3.0
1st Qu.  5.0
Median   6.0
Mean     6.34965034965035
3rd Qu.  7.0
Max      10.0
NAs      3
NA%      2.05%

Ploidy
Length  146
Type    Pooled String
NAs     0
NA%     0.0%
Unique  3

mean(stagec[:Age]) = 63.0
mean(stagec[:Gleason]) = NA
mean(dropna

14.27525179856115

In [49]:
@show mean(stagec[:Age])             # get mean of Age column
@show mean(stagec[:Gleason])         # returns NA (not what we want)
@show mean(dropna(stagec[:G2]))      # dropna() removes NA values

mean(stagec[:Age]) = 63.0
mean(stagec[:Gleason]) = NA
mean(dropna(stagec[:G2])) = 14.27525179856115


14.27525179856115

---

---

# Inferential Statistics

Useful Packages:
- HypothesisTests (Binomial test, _t_-tests, $\chi^2$-test, and many more...)
- GLM (linear and generalized linear models)
- MixedModels (Multi-level (or Mixed-effects) models)

## Binomial Test

The binomial test is a statistical test of dichotomous data's (e.g., "success" and "failure") deviation from expected distribution. Binomial tests can be used to answer questions of this form: _If the true probability of success is $P$ then what is the probability of the data we have observed?_

In [50]:
# Coin Tossing Example:
# Simulate data from Binomial, test 
# hypothesis data came from fair coin

using HypothesisTests
using Distributions

binom = Binomial(1, 0.6)                        # initialize Distribution object Binom(1, 0.6) 

srand(137)                                      # set seed for reproducibility
x = rand(binom, 10)                             # take 10 random draws from our distribution

xbool = convert(Array{Bool,1}, x)               # cast x to vector of Booleans
BinomialTest(xbool, 0.5)                        # test null hypothesis that p = 0.5


Binomial test
-------------
Population details:
    parameter of interest:   Probability of success
    value under h_0:         0.5
    point estimate:          0.6
    95% confidence interval: (0.26237807660694507,0.8784477418801727)

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.7539062500000002 (not significant)

Details:
    number of observations: 10
    number of successes:    6


## Student's _T_-test 

Extremely common statistical test for differences in means between two groups on some continous variable. _T_-tests are often used to investigate the effects of some new treatment versus a control group.

In [51]:
# Life Expectancy Example:
# Simulate data from Gaussian, test whether smokers 
# and non-smokers have same life expectancy 

non_smokers_gaussian = Normal(75, 5)
smokers_gaussian = Normal(65, 7)

srand(137)

non_smokers = rand(non_smokers_gaussian, 30)     # take 30 random draws from Gaussian
smokers = rand(smokers_gaussian, 30)

EqualVarianceTTest(smokers, non_smokers)         # two-sample t-test (assumes equal variances)

Two sample t-test (equal variance)
----------------------------------
Population details:
    parameter of interest:   Mean difference
    value under h_0:         0
    point estimate:          -8.22222496690523
    95% confidence interval: (-11.279591398549222,-5.164858535261239)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           1.3800085142432315e-6 (extremely significant)

Details:
    number of observations:   [30,30]
    t-statistic:              -5.3832511878465485
    degrees of freedom:       58
    empirical standard error: 1.5273715975706403


## Pearson's Correlation
Pearson's correlation is a measure of the linear relationship between two continous variables. The resulting correlation coefficient, _r_, ranges between 1 and -1. Positive values of _r_ indicate a positive association between the variables (e.g., hours of studying and GPA), while negative values indicate a negative relation between variables (e.g., beers-per-week and GPA).

In [52]:
# correlation between two variables in vectors
x = randn(100)
y = randn(100)

cor(x, y)

0.11194734281527548

In [57]:
# pairwise correlation of all variables in a matrix
A = randn(100, 5)
cor(A)

5×5 Array{Float64,2}:
  1.0         0.0309501    -0.0593928    -0.130261    -0.24591    
  0.0309501   1.0          -0.0321801    -0.0366651    0.000577063
 -0.0593928  -0.0321801     1.0           0.00831268   0.000747505
 -0.130261   -0.0366651     0.00831268    1.0          0.154578   
 -0.24591     0.000577063   0.000747505   0.154578     1.0        

In [54]:
# correlation of variables in a DataFrame object
cor(stagec[:Age], stagec[:Grade])

-0.08010120432272935

---

## Challenge Question 2
Using the `stagec` dataset from above, determine if patients with tetraploidy tumors have higher Gleason scores than patients with diploid tumors.

## Challenge Question 3
Create a plot of the distributions of the two groups' data described in Question 2 above.   

---

## Linear and Generalized Linear Models
Linear regression models and their generalizations represent one of the most powerful and fundamental classes of models in all of statistics. These models are tremendously useful for making inferences as well as for making predictions. They are ubiquitous across scientific disciplines. Furthermore, linear and generalized linear models also serve as the foundation for some of the most promising advances in machine learning and artificial intelligence.

### Linear Regression Models
Useful Packages:
- GLM
- MixedModels (for multi-level models)
- Mamba (for Bayesians)

In [58]:
using GLM

fm1 = lm(G2 ~ 1 + Age, stagec)                               # fit linear model regressing G2 on Age

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Array{Float64,2}}

Formula: G2 ~ 1 + Age

Coefficients:
             Estimate Std.Error  t value Pr(>|t|)
(Intercept)   5.18789    7.5146 0.690375   0.4911
Age          0.144359  0.118861  1.21452   0.2266


In [59]:
coef(fm1)                                                    # get model coefficients (i.e., Betas)

2-element Array{Float64,1}:
 5.18789 
 0.144359

In [60]:
stderr(fm1)                                                  # standard error of coefficients

2-element Array{Float64,1}:
 7.5146  
 0.118861

In [61]:
confint(fm1)                                                 # confidence intervals for coefficients

2×2 Array{Float64,2}:
 -9.67171    20.0475  
 -0.0906806   0.379399

In [62]:
predict(fm1)                                                 # predicted value for each observation (i.e., y_hat)

139-element Array{Float64,1}:
 14.4269
 13.7051
 14.1382
 14.4269
 15.1487
 16.0148
 15.7261
 14.4269
 14.5712
 13.5607
 15.293 
 14.86  
 14.7156
  ⋮     
 13.7051
 12.6946
 15.0043
 14.2825
 14.86  
 13.7051
 13.4164
 14.1382
 14.2825
 15.7261
 12.5502
 13.272 

In [63]:
residuals(fm1)                                               # residuals (i.e., y - y_hat)

139-element Array{Float64,1}:
  -4.16688 
  -3.71508 
 -10.5682  
   8.13312 
  -9.00868 
  -2.32483 
  -3.95611 
  12.8431  
   4.76876 
   1.25927 
  -5.07304 
   0.800042
   3.0744  
   ⋮       
  -2.01508 
  -2.79457 
  -1.99432 
  -9.47252 
  -0.149958
  -4.69508 
  -2.51637 
  -3.41816 
  -9.14252 
  31.1939  
  -2.96021 
  -4.26201 

In [64]:
# fit indices
deviance(fm1)                                                

9233.743609282785

In [65]:
aic(fm1)                                                     

983.7291928323847

In [66]:
bic(fm1)

992.5326146317767

In [67]:
# make predictions with fitted model
new_data = DataFrame(Age = [10, 20, 30, 40, 50])
predict(fm1, new_data)

5-element DataArrays.DataArray{Float64,1}:
  6.63149
  8.07508
  9.51867
 10.9623 
 12.4059 

In [68]:
# Adding predictors
fm2 = lm(G2 ~ 1 + Age + Gleason, stagec)                     # add Gleason score as predictor

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Array{Float64,2}}

Formula: G2 ~ 1 + Age + Gleason

Coefficients:
             Estimate Std.Error  t value Pr(>|t|)
(Intercept)  -8.91812    8.0433 -1.10876   0.2695
Age          0.176913  0.116485  1.51877   0.1312
Gleason       1.93296  0.503399  3.83982   0.0002


In [69]:
deviance(fm2)                                                

8189.27753135993

In [70]:
aic(fm2)

951.2692270563475

In [71]:
bic(fm2)

962.9198465992918

In [72]:
fm3 = lm(G2 ~ 1 + Age + Gleason + Grade, stagec)            

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Array{Float64,2}}

Formula: G2 ~ 1 + Age + Gleason + Grade

Coefficients:
             Estimate Std.Error  t value Pr(>|t|)
(Intercept)  -9.19135   8.15383 -1.12724   0.2617
Age          0.178597  0.117116  1.52496   0.1297
Gleason       1.76439  0.872139  2.02306   0.0451
Grade        0.472942   1.99455 0.237116   0.8129


In [73]:
# Adding interaction terms
fm4 = lm(G2 ~ 1 + Age + Gleason*EET, stagec)                     

DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredQR{Float64}},Array{Float64,2}}

Formula: G2 ~ 1 + Age + Gleason + EET + Gleason & EET

Coefficients:
               Estimate Std.Error   t value Pr(>|t|)
(Intercept)    -4.97759   16.2193 -0.306893   0.7594
Age            0.172599  0.118692   1.45417   0.1483
Gleason         1.12909   2.11073  0.534929   0.5936
EET            -2.25982   7.73058 -0.292321   0.7705
Gleason & EET   0.48899   1.17624  0.415723   0.6783


### Binomial Logistic Regression
Binomial logistic regression models are used when fitting models to dichotomous outcome variables (e.g., 0 and 1). 

In [74]:
fm5 = glm(PgStat ~ 1 + Gleason, stagec, Binomial())          # defaults to logit-link

DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: PgStat ~ 1 + Gleason

Coefficients:
             Estimate Std.Error  z value Pr(>|z|)
(Intercept)  -4.93489   1.06153 -4.64884    <1e-5
Gleason      0.685106  0.159095  4.30627    <1e-4


In [75]:
coef(fm5) 

2-element Array{Float64,1}:
 -4.93489 
  0.685106

In [76]:
stderr(fm5)

2-element Array{Float64,1}:
 1.06153 
 0.159095

In [77]:
confint(fm5)

2×2 Array{Float64,2}:
 -7.01546   -2.85433 
  0.373285   0.996927

In [78]:
deviance(fm5)                                                

166.63636935531574

In [79]:
aic(fm5)                                                     

170.63636935531554

In [80]:
bic(fm5)

176.56205861583535

### Poisson Regression
Poisson regression is useful for modeling outcome variables that are in the form of count data.

In [81]:
prussian = dataset("pscl", "prussian")                       # load Prussian horse kick data

Unnamed: 0,Y,Year,Corp
1,0,75,G
2,2,76,G
3,2,77,G
4,1,78,G
5,0,79,G
6,0,80,G
7,1,81,G
8,1,82,G
9,0,83,G
10,3,84,G


In [82]:
fm6 = glm(Y ~ 1 + Year + Corp, prussian, Poisson())          # defaults to log link 

DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Poisson{Float64},GLM.LogLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Y ~ 1 + Year + Corp

Coefficients:
               Estimate Std.Error    z value Pr(>|z|)
(Intercept)    -1.81457   1.08726   -1.66893   0.0951
Year          0.0187643 0.0124307    1.50951   0.1312
Corp: I      3.85039e-9  0.353544 1.08908e-8   1.0000
Corp: II      -0.287682  0.381869  -0.753353   0.4512
Corp: III     -0.287682  0.381875  -0.753341   0.4512
Corp: IV      -0.693147  0.433007   -1.60078   0.1094
Corp: IX      -0.207639   0.37339  -0.556092   0.5781
Corp: V       -0.374693  0.391672  -0.956651   0.3387
Corp: VI      0.0606246  0.348312   0.174053   0.8618
Corp: VII     -0.287682  0.381876   -0.75334   0.4512
Corp: VIII    -0.826679  0.453155   -1.82427   0.0681
Corp: X      -0.0645385  0.359393  -0.179576   0.8575
Corp: XI       0.446287 

In [None]:
coef(fm6) 
stderr(fm6)
confint(fm6)
deviance(fm6)                                                
aic(fm6)                                                     
bic(fm6)

---

## Challenge Question 4
A key assumption of linear regression is the normality of error terms. A simple but effective way to check this assumption is to examine a plot of the distribution of the fitted model's residual. 

Create a plot to check the assumption of normality of error terms for model `fm3` above.


---

---

# Machine Learning

Useful Packages:
- DecisionTree
- RandomForest
- Lasso
- GLMNet (for ridge regression, lasso, and elastic net)
- LARS (lasso and elastic net)
- Clustering
- GradientBoosting
- XGBoost
- Mocha (deep neural nets)

## Bagging
Bagging is "bootstrap aggregation", and involves fitting a many individual classification or regression trees. Thus, bagging can be used with both categorical and continous data. The use of many trees improves the prediction accuracy of your fitted model over a single tree by decreasing the chance of overfitting your data (see bias/variance tradeoff).

In [None]:
using DecisionTree

In [None]:
# take only complete cases
stagec_comp = stagec[complete_cases(stagec), :]

is_tetraploid = stagec_comp[:Ploidy] .== "tetraploid"

stagec_comp[:tetra] = is_tetraploid

# must convert to Array
y = convert(Array{Float64,1}, stagec_comp[:G2])
X = convert(Array{Float64,2}, stagec_comp[[:Age, :Grade, :Gleason, :EET, :tetra]])

fm7 = build_forest(y, X, 5, 500)

In [None]:
apply_forest(fm7, [55.0, 3.0, 2.0, 1.0, 1.0])

## Random Forest
Random forests were developed after bagging, and are a generalization of the idea of taking bootstrap samples from your dataset and fitting many trees. Random forests differ from bagged trees in that for each split point in the tree, the algorithm only considers some subset of the predictors as candidates on which to split. This has the effect of further reducing the correlation between trees beyond what is already achieved by the bootstrapping. This reduces overfitting and improves prediction accuracy for new data.

In [None]:
fm8 = build_forest(y, X, 3, 500)

In [None]:
# This is a quick function to obtain the mean-squared
# error of a fitted random forest (or bagged tree) model.

function mse(fitted, y, X)
    yhat = apply_forest(fitted, X)
    sqerr = (y .- yhat).^2
    out = mean(sqerr)
    return out
end

In [None]:
@show mse(fm7, y, X)
@show mse(fm8, y, X);

## Lasso
The lasso (Least Angle Shrinkage and Selection Operator) is a form of regularized regression which penalizes the L1 norm of the vector of regression coefficients. This has the effect of shrinking the least important regression coefficients to zero. In this respect it is similar to ridge regression, which shrinks regression coefficients towards zero by penalizing the L2 (Euclidian) norm.

In [None]:
using Lasso
swiss = dataset("datasets", "swiss")

In [None]:
Xswiss = convert(Array{Float64, 2}, swiss[:, 2:5])
yswiss = convert(Array{Float64, 1}, swiss[:, 6])
fm9 = fit(LassoPath, Xswiss, yswiss)

In [None]:
fieldnames(fm9)

In [None]:
@show fm9.λ
full(fm9.coefs)

---

## Challenge Question 5
Using the `aldh2` dataset from the `gap` package in R, try fitting a few random forest (or bagged tree) models that predict whether a given patient is an alcohol using their genetic information. <br>

What is the prediction accuracy of your best model? What were the meta-paremeters of your best-fitting model? <br> 

The data can be loaded using the code below.

In [None]:
aldh2 = dataset("gap", "aldh2")

---

---

# Recommended Resources

### Statistical Inference
- Casella and Berger (2002) _Statistical Inference_
- Wasserman (2004) _All of Statistics_

### Linear Models
- Gelman and Hill (2007) _Data Analysis Using Regression and Multilevel/Hierarchical_
- Rencher and Schaalje (2008) _Linear Models in Statistics_

### Generalized Linear Models
- Agresti (2002) _Categorical Data Analysis_
- Hosmer and Lemeshow (2000) _Applied Logistic Regression_

### Machine Learning
- Hastie, Tibshirani, & Friedman (2001) _Elements of Statistical Learning_
- James, Witten, Hastie, & Tibshirani (2015) _An Introduction to Statistical Learning_
- Kuhn and Johnson (2013) _Applied Predictive Modeling_