### What is the difference between a blind study and a double blind study?

A blind study is when the subject does not know which treatment he receives; a double blind study refers to when neither the subject and the data collectors, e.g. doctors or nurses, know which treatment is administered.

### How would you determine that there is true correlation between two varibales rather than simply random chance?
The purpose of hypothesis tests, also known as significance tests, is to help you learn if random chance might be responsible for an observed phenomena; the hypothesis that chance is responsible is called the null hypothesis. Hypothesis test can be either one-way or two-way. One way is for A is better than B tests; two way is for A is different to B.

### How does a permutation test work?

A permutation test involves combining the results from groups A, B and C into a single dataset; take a random sample (without replacement) the size of A, another the size of B and another the size of C; whichever statistic was calculated for the original samples, calculate for the random samples, repeat multiple times. If the observed difference lies within the permutated difference then it is within the range of what chance might produce, in other words it is statistically insignificant.

### What would a p-value of 0.01 suggest?

The frequency with which the chance model produces a result more extreme than the observed result is the p-value. So $p=0.01$ means that only 1\% of model predictions are more extreme than the observed results. Usually, 95\% is the standard significance level for statistical models, so if the p-values is less than 0.025 the model can be considered reliable.

### Suppose there is a non-linear relationship between predictor and response, how could your regression model be extended to capture the non-linear effects?

Splines. Polynomial regression is also an option but that can only capture a certain amount of non-linearity so that's ok for a quadratic relationship but for higher order terms you get a "wiggliness" in your regression which isn't useful. Splines are piecewise continuous polymernomial segments that allow us to smoothly interpolate between data points and are a better approach. The polynomial segments are connected at a series of fixed points in the predictor variable, these are called knots.

### How should you decide whether a model should have more or fewer predictors?

A simpler statistical model is preferable to a more complicated one. The AIC is a statistical metric that penalizes the addition of unnecessary predictors to a statistical model. For a data set of $n$ records and $P$ variables
$$
\text{AIC} = 2P + n \log \left( \frac{\text{RSS}}{n}\right),
$$
where RSS is the residual sum of squares
$$
\text{RSS} = \sum_{i=1}^n (Y_i - \hat{Y_i})^2.
$$
The model with the smallest AIC is best.

### Is there a distinction between machine learning and classical statistics?

There is very little difference between them. Machine learning is about creating efficient algorithms which can be applied to large data sets for the purposes of predicting outcomes. Regression and classification modelling are examples of machine learning methods, both involves training the model on known data then using it to predict outcomes in an unknown data set, this is called supervised learning methods. Classical statistics, strictly speaking, is more about probability theory and the underlying structure of the model; bagging and random forest methods closer to statistics than machine learning.

### What technique would you use if you needed to predict a binary outcome?

K-nearest neighbors would be the best approach. In a data set, a neighbor is just a record that has similar predictors to the test record. The KNN method just finds K records that are most similar to our test, looks at the outcome of those K and it takes the most common outcome as its prediction for the test outcome. Decision trees are another method for predicting a binary outcome. The idea of decision trees is to repeatedly divide and subdivide the data with the aim of making the outcome after each subdivision more and more homogenous; you will end up a set of if-else rules that will guide a test record to a predicted outcome.

### Why should predictors be standardized when using the K nearest neighbors?

The KNN method calculates the Euclidean distance of the test vector to each record in the training set then takes the K records with the smallest distance. This is problematic if the predictors are on a different scale to others. Suppose we have two predictors in our model $y \tilde x + w$, the euclidean distance is 
$$
d=\sqrt{(y-x)^2 + (y-w)^2}.
$$
But if $0 \leq x \leq 1$ and $0 \leq w \leq 100,000$ then
$$
d \approx \left|y-w\right|.
$$
This can be fixed by standardizing the variables, i.e. subtract the mean and divide by standard deviation,
$$
w^* = \frac{w - \overline{w}}{s}.
$$
The standardized distance is
$$
d = \sqrt{(y-x^*)^2 + (y - w^*)^2}.
$$

### How would you decide which number to use as K?

K should be an odd number to avoid ties. If you're dealing with a data set that is highly structured and little noise then a small K is best, data sets with more noise and less structure should have a larger K but no more than 20.

### In the design of a tree model, how is homogeneity measured?

Homogeneity, also known as class purity, of a partition can be measured by either the GINI imputity (not GINI coefficient) or entropy. If the proportion of misclassified records in partition $A$ is $p$ then the GINI impurity is
$$
I(A) = p(1-p),
$$
and entropy is
$$
I(A) = -p \log_2 (p) - (1-p)\log_2(1-p).
$$
The proposed partition with the lowest GINI impurity or entropy is best for the tree model.

### What is meant by bootstrapping in data science?

Repeatedly taking a random sample from a known data set (with replacement) to generate a sampling distribution.

### In the calculation of standard deviation, why is the divisor $n-1$ rather than $n$?

We divide by the degrees of freedom which is $n-1$ which is the difference between the number of records and number of constraints. Consider the day of the week as a variable, if you that it is not Monday, not Tuesday, not Wednesday, not Thursday, not Saturday and not Sunday then you know the variable must be Friday so there are six degrees of freedom for this variable. Most data science problems will typically involve a large number of records so this distinction is not very important.

### How is Pearson's correlation coefficient calculated?

For variables $x_i$ and $y_i$, $i \in [1,n]$
$$
P = \frac{\sum_i (x_i - \overline{x})(y_i - \overline{y})}{(n-1) s_x s_y},
$$
where $s_x$ and $s_y$ are the standard deviation of $x$ and $y$ respectively.

# Chapter 7: Unsupervised Learning

### What is cross-validation?

Split the data into a number of different groups, called folds. For each fold, a model is trained on the data not in the fold then tested on the data in the fold.

### What is the difference between supervised and unsupervised learning?

Statistical methods that extract meaning from data without training a model on labelled data is called unsupervised learning. Supervised learning is when a model is trained on data to predict an outcome from a set of predictor variables.

### How does Principal Component Analysis work?

The idea of PCA is to reduce the number of numerical predictor variables to a smaller set of predictors, called principal components, which are a weighted linear combination of the predictors. The principal components explain most of the variability of the full set of predictors and thereby reducing the dimension of the data. For example, suppose there are 2 variables $X_1$ and $X_2$ and therefore two principal components

\begin{align}
Z_1 = w_{1,1} X_1 + w_{1,2} X_2,\\
Z_2 = w_{2,1} X_1 + w_{2,2} X_2,
\end{align}

the weights $w_{i,1}$, $w_{i,2}$ are called the component loadings.The first principal component, $Z_1$, is the linear combination that best explains the total variation; $Z_2$ is orthogonal to $Z_1$ and explains as much of the remaining variation as it can.

### How is the covariance matrix of two variables $x$ and $z$ calculated?

\begin{equation}
\hat{\Sigma} = \left[
\begin{matrix}
s_x^2 & s_{x,z}\\
s_{z,x} & s_z^2
\end{matrix}
\right],
\end{equation}

where covarianve is calculated as
\begin{equation}
s_{x,z} = \frac{\sum_{i=1}^n (x_i - \overline{x})(z_i - \overline{z})}{n-1}.
\end{equation}

### How would you go about splitting a data set of two variables $x$ and $y$ into four clusters?

Each record $(x_i,y_i)$ must be assigned to cluster $k$ with $n_k$ the number of records in cluster $k$. The cluster mean is
\begin{equation}
(\overline{x_k},\overline{y_k}) = \left( \frac{1}{n_k} \sum_{i} x_i, \frac{1}{n_k} \sum_{i} y_i \right),
\end{equation}

for all $i$ in cluster $k$. The sum of squares of cluster $k$ is

\begin{equation}
\text{SS}_k = \sum_i (x_i - \overline{x_k})^2 + (y_i - \overline{y_k})^2.
\end{equation}

Which cluster should $(x_i,y_i)$ be assigned to? The ideal assignment is such that the total sum of squares
\begin{equation}
\sum_{k=1}^4 \text{SS}_k
\end{equation}

is minimized. This can be done in R via the \texttt{kmeans} function.

### How do hierarchical clustering methods decide which number of cluster to use?

Hierarchical clustering calculates the *Bayesian Information Criterial* (BIC) for a number of clusters and selects the one with the greatest negative value. BIC is similar to AIC, there is a penalty for the number of parameters in the model.