# Vectors and _N_-Dimensional Arrays

In [None]:
v = [4, 5, 6]

In [None]:
u = [1.2, 4.5, 6.5]

In [None]:
w = ["dog", "cat", "bird"]

### Grow Vector in Place

In [None]:
a = Int[]                       # initialize empty vector to be filled with Ints

In [None]:
push!(a, 12)

In [None]:
push!(a, 1000)

In [None]:
push!(a, 3.14)                  # ERROR: trying to push Float into vector of Ints

In [None]:
s = String[]                    # initialize empty vector of strings

In [None]:
push!(s, "this")
push!(s, "is")
push!(s, "a string vector")

###Indexing a Vector

In [None]:
s[2]                            # get second element

In [None]:
s[3]                            # third element

In [None]:
s[1] = "that"                   # assign to first element

### Slicing a Vector

In [None]:
s[1:2]                          # get subset of vector (1st and 3rd element)

In [None]:
s[2:end]                        # subset from 2nd element to last element

## _N_-Dimensional Arrays

In [None]:
A = Array{Float64}(5, 3)        # initialize 5-by-3 matrix (start as all zeros)

In [None]:
Z = zeros(5, 3)                 # same as above

In [None]:
B = [1 2 3;
     4 5 6;
     7 8 9]

### Indexing and Slicing Matrix (or _N_-dimensional array)

In [None]:
B[3, 1]                         # gets entry in third row first column

In [None]:
B[2, :]                         # gets all of second row

In [None]:
B[:, 3]                         # gets all of third column

## Reading Data from Flat File

In [None]:
d = readdlm("somedata.csv", ',')

In [None]:
d1 = readdlm("otherdata.csv", '\t')

In [None]:
d2 = readcsv("somedata.csv")                                 # equivalent to readdlm() with ','

In [None]:
d3 = readcsv("somedata.csv", header = true)                  # treat first line as column headings

In [None]:
d4 = readcsv("somedata.csv", header = true, skipstart = 3)   # skip first 3 lines 

---

---

# DataFrame Objects

### Advantages of `DataFrame` objects
- _de facto_ object for data analysis
- NA type for missing data
- factor-like type for model fitting 

In [None]:
using DataFrames

In [None]:
df1 = readtable("somedata.csv")

In [None]:
df2 = readtable("somedata.csv", makefactors = true)

In [None]:
df3 = readtable("somedata.csv", makefactors = true, seperator = '\t')

In [None]:
df4 = readtable("somedata.csv", nastring = "999")

In [None]:
df5 = readtable("somedata.csv", nastring = ["", "NA", "999")

In [None]:
df1[:age]                     # entire `age` column
df1[3, :condition]            # third participant's `condition`
df1[4, :]                     # all of row 4

## Challenge Question 1

Using the `stagec` dataset from the rpart package in R, find the Gleason score of oldest patient. As a hint, you may consider using the `indmax()` function.

In [None]:
using RDatasets  

stagec = dataset("rpart", "stagec")

---

# Descriptive Statistics
Useful Packages:
- StatsBase

In [None]:
x = randn(100)                # generate 100 draws from standard Gaussian

In [None]:
mean(x)
median(x)
mode(x)
std(x)
var(x)
minimum(x)
maximum(x)

### Descriptives with `DataFrame`

In [None]:
describe(stagec)               # get many useful descriptive stats

mean(stagec[:Age])             # get mean of Age column

mean(stagec[:Gleason])         # returns NA (not what we want)
mean(dropna(stagec[:G2]))      # dropna() removes NA values

---

---

# Inferential Statistics

### Useful Packages:
- HypothesisTests (Binomial test, t-tests, $\chi^2$-test, and many more...)
- GLM (linear and generalized linear models)
- MixedModels (Multi-level (or Mixed-effects) models)

## Binomial Test

The binomial test is a statistical test of dichotomous data's (e.g., "success" and "failure") deviation from expected distribution. Binomial tests can be used to answer questions of this form: _If the true probability of success is $P$, then what is the probability of the data we have observed?_

In [None]:
# Coin Tossing Example:
# Simulate data from Binomial, test 
# hypothesis data came from fair coin

using HypothesisTests
using Distributions

binom = Binomial(1, 0.6)                        # initialize Distribution object Binom(1, 0.6) 

srand(137)                                      # set seed for reproducibility
x = rand(binom, 10)                             # take 10 random draws from our distribution

xbool = convert(Array{Bool,1}, x)               # cast x to vector of Booleans
BinomialTest(xbool, 0.5)                        # test null hypothesis that p = 0.5


## Student's _T_-test 

Extremely common statistical test for differences in means between two groups on some continous variable. _T_-tests are often used to investigate the effects of some new treatment versus a control group.

In [None]:
# Life Expectancy Example:
# Simulate data from Gaussian, test whether smokers 
# and non-smokers have same life expectancy 

non_smokers_gaussian = Normal(75, 5)
smokers_gaussian = Normal(65, 7)

srand(137)

non_smokers = rand(non_smokers_gaussian, 30)     # take 30 random draws from Gaussian
smokers = rand(smokers_gaussian, 30)

EqualVarianceTTest(smokers, non_smokers)         # two-sample t-test (assumes equal variances)

## Pearson's Correlation
Pearson's correlation is a measure of the linear relationship between two continous variables. The resulting correlation coefficient, $r$, ranges between $1$ and $-1$. Positive values of $r$ indicate a positive association between the variables (e.g., hours of studying and GPA), while negative values indicate a negative relation between variables (e.g., beers-per-week and GPA).

In [None]:
# correlation between two variables in vectors
x = randn(100)
y = randn(100)

cor(x, y)


# pairwise correlation of all variables in a matrix
A = randn(100, 5)
cor(A)


# correlation of variables in a DataFrame object
cor(stagec[:Age], stagec[:Grade])

## Challenge Question 2
Using the `stagec` dataset from above, determine if patients with tetraploidy tumors have higher Gleason scores than patients with diploid tumors.

---

## Linear and Generalized Linear Models
Linear regression models and their generalizations represent one of the most powerful and fundamental classes of models in all of statistics. These models are tremendously useful for making inferences as well as for making predictions. They are ubiquitous across scientific disciplines. Furthermore, linear and generalized linear models also serve as the foundation for some of the most promising advances in machine learning and artificial intelligence.

### Linear Regression Models

In [None]:
using GLM

fm1 = lm(G2 ~ 1 + Age, stagec)                               # fit linear model regressing G2 on Age

In [None]:
# basic info for fitted model
coef(fm1) 
stderr(fm1)
confint(fm1)
residuals(fm1)

In [None]:
# fit indices
deviance(fm1)                                                
aic(fm1)                                                     
bic(fm1)

In [None]:
# make predictions with fitted model
new_data = DataFrame(Age = [10, 20, 30, 40, 50])
predict(fm1, new_data)

In [None]:
# Adding predictors
fm2 = lm(G2 ~ 1 + Age + Gleason, stagec)                     # add Gleason score as predictor
deviance(fm2)                                                
aic(fm2)                                                     
bic(fm2)

In [None]:
fm3 = lm(G2 ~ 1 + Age + Gleason + Grade, stagec)            

In [None]:
# Adding interaction terms
fm4 = lm(G2 ~ 1 + Age + Gleason*EET, stagec)                     

### Binomial Logistic Regression

In [None]:
fm5 = glm(PgStat ~ 1 + Gleason, stagec, Binomial())          # defaults to logit-link

In [None]:
coef(fm5) 
stderr(fm5)
confint(fm5)

In [None]:
deviance(fm5)                                                
aic(fm5)                                                     
bic(fm5)

### Poisson Regression

In [None]:
prussian = dataset("pscl", "prussian")                       # load Prussian horse kick data

In [None]:
fm6 = glm(Y ~ 1 + Year + Corp, prussian, Poisson())          # defaults to log link 

In [None]:
coef(fm6) 
stderr(fm6)
confint(fm6)
deviance(fm6)                                                
aic(fm6)                                                     
bic(fm6)