\section{Resampling: permutation testing and bootstrap}
Frequentist statistics. Resampling methods provide a set of flexible tools for computationally simulated sampling

\subsection{p-values and null hypothesis significance testing}
Null hypothesis significance testing (NHST) is one of the most widely used statistical methods.

T-test. The test statistic is 
\[
t = \frac{\overline{x_i} - \overline{x_2}}{\sqrt{s_1^2/N_1 + s_2^2/N_2}}
\]

The general NHST procedure
\begin{enumerate}
    \item Set up a null hypothesis $H_0$ and an alternative hypothesis $H_1$ and select test statistic
    \item Evaluate the test statistic under the observed data $t(X)$
    \item $p = Pr(t is at east as extreme as t(X) | H_0)$
    \item Reject $H_0$ if the p-value is smaller than the pre-selected significance level
\end{enumerate}

Permutation testing should only be used if no analytical solution is available

\subsection{Permutation testing}
\[
p = P(t_i is at least as extreme as t(X)|H_0) = \frac{B+1}{M+1}
\]

\begin{minted}[breaklines]{Python}
import numpy as np
import numpy.random as npr
npr.seed(42)
 
# Set the number of permutations
N_perm = 1000
 
# Generate two samples
x1 = npr.normal(size=20)
x2 = npr.normal(size=20) + 0.5
 
# Concatenate to form a whole
x = np.concatenate((x1, x2))
print("mean(x1):", np.mean(x1))

print("mean(x2):", np.mean(x2))

truediff = np.abs(np.mean(x1) - np.mean(x2))
 
# Repeatedly randomly permute to mix the groups
meandiffs = np.zeros(N_perm)
for i in range(N_perm):
    z = npr.permutation(x)
    meandiffs[i] = np.abs(np.mean(z[0:20]) - np.mean(z[20:]))
 
print('p-value:', (np.sum(truediff <= meandiffs)+1)/(len(meandiffs)+1))
\end{minted}

\subsection{Bootstrap sampling}
Translating results from an individual observed sample to the larger population. 

The Bootstrap principle: approximate $p(x)$ by sampling from the empirical data distribution $p_e(x)$ with replacement. 

Original data set $X = (x_1, ..., x_N)$ with $N$ samples, $M$ pseudo-data-sets $X_i$. For each pseudo-data-set $X_i$, one obtains an estimate $\theta_i^*$ of the parameter $\theta$. Confidence intervals, such as two-sided $1-\alpha$ confidence interval can be obtained as $[h_{\alpha/2}, h_{1-\alpha/2}]$, where $h_{\alpha}$ denotes the $\alpha$ quantile of the bootstrap estimates $\theta_i^*$

\begin{minted}[breaklines]{Python}
import numpy as np
import numpy.random as npr
 
npr.seed(42)
#Normally distributed data
x = npr.normal(size=100)
 
print(np.mean(x))

means = np.zeros(1000)
for i in range(1000):
	#Indexes are sampled, low=100, size=100
    I = npr.randint(100, size=100)
    means[i] = np.mean(x[I])
print(np.percentile(means, [2.5, 97.5]))
\end{minted}

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{Pictures/ScreenShot144.png}
\end{figure}

\subsection{Exercises}
\subsubsection{Playing with the example from the lecture}
Experiment with the below example shown at the lecture. How large does the difference in the means need to be to yield a statistically significant result? How does this change when the number of samples is changed?

\begin{minted}[breaklines]{Python}
import numpy as np
import numpy.random as npr

npr.seed(42)

# Set the number of permutations
N_perm = 1000

# Generate two samples
x1 = npr.normal(size=20)
x2 = npr.normal(size=20) + 0.5

# Concatenate to form a whole
x = np.concatenate((x1, x2))
print("mean(x1):", np.mean(x1))
print("mean(x2):", np.mean(x2))
truediff = np.abs(np.mean(x1) - np.mean(x2))

# Repeatedly randomly permute to mix the groups
meandiffs = np.zeros(N_perm)
for i in range(N_perm):
    z = npr.permutation(x)
    meandiffs[i] = np.abs(np.mean(z[0:20]) - np.mean(z[20:]))
print('p-value:', (np.sum(truediff <= meandiffs)+1)/(len(meandiffs)+1))
\end{minted}

\subsubsection{Working with real data in Python}
1. Load the data set using the data set using the code below and print the values. Note the column names at the top.

2. Plot a histogram of the values of the ages of the data subjects. (Hint: extract the age column and use .values to turn it to a NumPy Array.)

3. Compute the correlation coefficient (np.corrcoef) of the age and systolic blood pressure (SBP) values in the data.

4. Create index vectors for male and female participants. Plot separate histograms of the cholesterol (CHOL) values of the male and female subjects.

\begin{minted}[breaklines]{Python}
import pandas as pd
import numpy as np
import numpy.random as npr

# load the data from CSV file using pandas
fram = pd.read_csv('http://www.helsinki.fi/~ahonkela/teaching/compstats1/fram.txt', sep='\t')
fram

%matplotlib inline
import matplotlib.pyplot as plt

plt.hist(fram['AGE'].values)

# note: .values is optional here, most numpy functions work without it too
np.corrcoef(fram['AGE'].values, fram['SBP'].values)

Imale = (fram['SEX'] == 'male').values
Ifemale = (fram['SEX'] == 'female').values
plt.hist(fram.iloc[Imale]['CHOL'].values)
plt.show()
plt.hist(fram.iloc[Ifemale]['CHOL'].values)
\end{minted}

\subsubsection{Simple permutation testing on real data}
Permutation testing of AGE of male and female participants, absolute difference of the means as the test statistic

1. Implement a permutation test to check if the ages (AGE) of male and female participants in the data set are statistically significantly different using the absolute difference of the means as the test statistic. How do you interpret the result?

2. Try a similar test with other variables in the data set. How do you interpret the results?

\begin{minted}[breaklines]{Python}
import pandas as pd
import numpy as np
import numpy.random as npr

# load the data from CSV file using pandas
fram = pd.read_csv('http://www.helsinki.fi/~ahonkela/teaching/compstats1/fram.txt', sep='\t')

# Create index vectors (np.array) for male and female samples
Ifemale = (fram['SEX'] == 'female').values
Imale = (fram['SEX'] == 'male').values

# Compute and print the means for males and females
malemean = np.mean(fram.iloc[Imale]['AGE'].values)
femalemean = np.mean(fram.iloc[Ifemale]['AGE'].values)
print(malemean, femalemean)

# Check the number of females for generating the permutations
numfemales = sum(Ifemale)

# Number of permutations to try
numrepeats = 1000
# Initialise an array to store the results in
mean_differences = np.zeros(numrepeats)
npr.seed(100)

for i in range(numrepeats):
    # Permute the indices
    I = npr.permutation(len(Ifemale))
    # Split to "women" and "men"
    permmean1 = np.mean(fram['AGE'].values[I[0:numfemales]])
    permmean2 = np.mean(fram['AGE'].values[I[numfemales:]])
    # Compute the absolute value of the difference
    mean_differences[i] = np.abs(permmean1 - permmean2)

(np.sum(np.abs(malemean - femalemean) <= mean_differences) + 1)/(len(mean_differences)+1)

malemean = np.mean(fram['CHOL'].values[Imale])
femalemean = np.mean(fram['CHOL'].values[Ifemale])
print(malemean, femalemean)

numrepeats = 1000
mean_differences = np.zeros(numrepeats)
npr.seed(100)

for i in range(numrepeats):
    I = npr.permutation(len(Ifemale))
    permmean1 = np.mean(fram['CHOL'].values[I[0:numfemales]])
    permmean2 = np.mean(fram['CHOL'].values[I[numfemales:]])
    mean_differences[i] = np.abs(permmean1 - permmean2)

max(mean_differences)
(np.sum(np.abs(malemean - femalemean) <= mean_differences) + 1)/(len(mean_differences)+1)
\end{minted}

\subsubsection{Bootstrap sampling}
1. Compute the bootstrap confidence interval for the mean of the variable 'SBP' in the example data set. Compare that with the theoretical interval (see e.g. http://onlinestatbook.com/2/estimation/mean.html).

2. Compute the bootstrap confidence interval for the median of the variable 'SBP' in the example data set.

3. Compute the bootstrap confidence interval for the correlation between the variables 'SBP' and 'CHOL' in the example data set.

\begin{minted}[breaklines]{Python}
#1
npr.seed(42)
sbp = fram['SBP']
#Number of bootsrap iterations
n_bootstrap = 10000
#number of elements in data vector
n = len(sbp)
#Bootsrap means, resample whole sample
bootstrap_means = np.array([np.mean(npr.choice(sbp, replace=True, size=n)) for i in range(n_bootstrap)])
#Confidence interval
print('1. bootstrap interval:', np.percentile(bootstrap_means, [2.5, 97.5]))

#Calculating theoretical interval
m = np.mean(sbp)
sd = np.std(sbp)
c_int_theo = [m-1.96*sd/np.sqrt(n), m+1.96*sd/np.sqrt(n)]
print('1. theoretical interval:', c_int_theo)

#2
#Repeat for medians
n_bootstrap = 10000
bootstrap_medians = np.array([np.median(npr.choice(sbp, replace=True, size=n)) for i in range(n_bootstrap)])
print('2. bootstrap interval:', np.percentile(bootstrap_medians, [2.5, 97.5]))

#3
chol = fram['CHOL']
n_bootstrap = 10000
bootstrap_correlate = np.zeros(n_bootstrap)
for i in range(n_bootstrap):
    indices = npr.choice(range(n), replace=True, size=n)
    bootstrap_correlate[i] = np.corrcoef(sbp[indices], chol[indices])[0,1]

print('3. bootstrap interval:', np.percentile(bootstrap_correlate, [2.5, 97.5]))
\end{minted}

\subsubsection{Using bootstrap to study properties of estimators}
1. Simulate data sets of 1000 samples with zero mean and unit variance from the Gaussian and the Laplace distribution.

2. Estimate the mean, median and standard deviation and their bootstrap confidence intervals. What do you observe?

3. How do the results change when you change the initial data set size?

\begin{minted}[breaklines]{Python}
#1
#Generate the data sets
x = npr.randn(1000)
y = npr.laplace(loc=0.0, scale=1./np.sqrt(2.0), size=1000)

#bootstrap function
def bootstrap(data, n_bootstrap, statistic, conf):
    stat = statistic(data)
    n = len(data)
    bootstrap = np.array([statistic(npr.choice(data, replace=True, size=n)) for i in range(n_bootstrap)])
    print(stat, np.percentile(bootstrap, [100*conf, 100*(1-conf)]))

statistics = [np.mean, np.median, np.std]
data_names = ['Gaussian', 'Laplace']
stat_names = ['Mean', 'Median', 'SD']
conf = 0.025
n_bootstrap=1000
#enumerate is a counter
for i, data in enumerate([x,y]):
    for j, statistic in enumerate(statistics):
        print(data_names[i], stat_names[j])
        bootstrap(data, n_bootstrap, statistic, conf)
\end{minted}

\subsubsection{More complex permutations}
1. Run permutation test on 'DBP' the same way as in Problem 1, i.e. without stratification to low and high weight groups. What is the p-value you obtain? Try to increase the number of permutations until the p-value seems reasonably accurate. (Note: this should not take more than a few seconds to run!)

2. Find the four subgroups of the data: males and females with FRW smaller/larger than median. How large are these groups?

3. Construct vectors of indices to high/low FRW groups.

4. Permute the male/female labels within both FRW groups but do not mix the FRW groups. Repeat to compute the p-value by checking how often in the permutation you obtain results as extreme as those in the real data. Compare your result with that from case 1.

\begin{minted}[breaklines]{Python}
import pandas as pd
import numpy as np
import numpy.random as npr

# load the data from CSV file using pandas
fram = pd.read_csv('http://www.helsinki.fi/~ahonkela/teaching/compstats1/fram.txt', sep='\t')

# Create index vectors (np.array) for male and female samples
Ifemale = (fram['SEX'] == 'female').values
Imale = (fram['SEX'] == 'male').values

malemean = np.mean(fram['DBP'].values[Imale])
femalemean = np.mean(fram['DBP'].values[Ifemale])
print(malemean, femalemean)

numrepeats = 10000
mean_differences = np.zeros(numrepeats)
npr.seed(100)

for i in range(numrepeats):
    I = npr.permutation(len(Ifemale))
    permmean1 = np.mean(fram['DBP'].values[I[0:numfemales]])
    permmean2 = np.mean(fram['DBP'].values[I[numfemales:]])
    mean_differences[i] = np.abs(permmean1 - permmean2)

(np.sum(np.abs(malemean - femalemean) <= mean_differences) + 1)/(len(mean_differences)+1)

medFRW = fram['FRW'].median()
Ifemale = (fram['SEX'] == 'female').values
Imale = (fram['SEX'] == 'male').values
highFRW = (fram['FRW'] > medFRW).values
lowFRW = (fram['FRW'] <= medFRW).values

malemean = np.mean(fram['DBP'].values[Imale])
femalemean = np.mean(fram['DBP'].values[Ifemale])
print(malemean, femalemean)

IhighFRW = np.where(highFRW)[0]
IlowFRW = np.where(lowFRW)[0]

numrepeats = 10000
mean_differences = np.zeros(numrepeats)
npr.seed(100)

numhighFRWfemales = np.sum(Ifemale & highFRW)
numlowFRWfemales = np.sum(Ifemale & lowFRW)

for i in range(numrepeats):
    I1 = npr.permutation(IhighFRW)
    I2 = npr.permutation(IlowFRW)
    permmean1 = np.mean(np.concatenate((fram['DBP'].values[I1[0:numhighFRWfemales]], 
                                        fram['DBP'].values[I2[0:numlowFRWfemales]])))
    permmean2 = np.mean(np.concatenate((fram['DBP'].values[I1[numhighFRWfemales:]],
                                        fram['DBP'].values[I2[numlowFRWfemales:]])))
    #print(permmean2, permmean1)
    mean_differences[i] = np.abs(permmean1 - permmean2)

(np.sum(np.abs(malemean - femalemean) <= mean_differences) + 1)/(len(mean_differences)+1)
\end{minted}

\section{Density estimation and cross validation}
Parametric models provide a common approach for obtaining a smooth probability model over some data. For example, given a sample $X$, we can fit a probability model $p(x)$ if we assume that $X$ follow the normal distribution. 

Kernel density estimation: obtaining a smooth probability density from a sample of points without assuming a specific functional form for the density. 

\subsection{Kernel density estimation}
Given a sample $X = (x_1, x_2, \dots, x_n)$ and a kernel $K(x)$ that satisfies
\[
\int_{-\infty}^{\infty}K(x) dx = 1
\]
the kernel density estimate $f_h(x)$ is 
\[
f_h(x) = \frac{1}{nh}\sum_{i=1}^n K \Bigg( \frac{x - x_i}{h} \Bigg)
\]
where $h$ denotes the width of the kernel, often called kernel bandwidth.

Gaussian kernel
\[
K_{gauss}(x) = \frac{1}{\sqrt{2 \pi}} \exp{\Bigg( \frac{-x^2}{2} \Bigg) }
\]
Uniform kernel
\[
K_{uniform}(x) = \begin{cases}
1/2, & |x| < 1 \\
0, & otherwise
\end{cases}
\]
Epanechnikov kernel
\[
K_{Epanechnikov}(x) = \begin{cases}
3/4(1-x^2), & |x| < 1 \\
0, & otherwise
\end{cases}
\]

Gaussian kernel
\begin{minted}[breaklines]{Python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#import data as d
d = pd.read_csv('http://www.helsinki.fi/~ahonkela/teaching/compstats1
/toydata.txt').values

#plot the histogram
plt.hist(d, 30, normed=True)

#The K_gauss formula
def K_gauss(x):
    return 1/np.sqrt(2*np.pi)*np.exp(-0.5*x**2)
    
#f_h(x) kernel density formula 
def kernel_density(t, x, h):
    y = np.zeros(len(t))
    for i in range(len(t)):
        y[i] = np.mean(K_gauss((t[i] - x)/ h)) / h
    return y
 
t = np.linspace(-2, 10, 100)
plt.plot(t, kernel_density(t, d, 3.0), label='h=3.0')
plt.plot(t, kernel_density(t, d, 1.0), label='h=1.0')
plt.plot(t, kernel_density(t, d, 0.3), label='h=0.3')
plt.plot(t, kernel_density(t, d, 0.1), label='h=0.1')
plt.legend()
plt.show()
\end{minted}

\subsubsection{Multivariate density estimation}
$d$-dimensional cases:
\[
f_h(x) = \frac{1}{nh^d}\sum_{i=1}^n K \Bigg( \frac{x - x_i}{h} \Bigg) 
\]
\[
K_{gauss}(x) = (2\pi)^{-d/2}\exp \Bigg( \frac{-||x||^2}{2} \Bigg)
\]
\[
K_{uniform}(x) = \begin{cases}
\frac{1}{2^d}, & if ||x||_{\infty} < 1 \\
0, & otherwise
\end{cases}
\]
where $||x||_{\infty} = max_{i = 1, ..., d} |x_i|$

\subsection{Cross-validation (CV)}
Cross-validation: general, simple method for estimating unknown parameters from data

\begin{enumerate}
    \item Randomly partition the data set $X = (x_1, x_2, ..., x_n)$ to a training set $X_{train}$ (n-m samples) and a test set $X_{test}$ (m samples)
    \item Fit the model using $X_{train}$
    \item Test the fitted model by using $X_{test}$
    \item Repeat $t$ times and average the results
\end{enumerate}

Some example realisations of the CV procedure
\begin{enumerate}
    \item $k$-fold CV: fixed partition to $k$ equal subsets, testing on every subset once
    \item LOO leave-one-out CV: $m=1$, $t=n$ testing on every sample once
\end{enumerate}

\subsubsection{CV for kernel bandwidth selection}
\begin{minted}[breaklines]{Python}
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
 #cross-validation
def cross_validate_density(d):
    hs = np.linspace(0.1, 1.0, 10)
    logls = np.zeros(len(hs))
    for j in range(len(d)):
        for i in range(len(hs)):
            logls[i] += np.sum(np.log(kernel_density(d[j], 
            np.delete(d, j), hs[i])))
    return (hs, logls)
 
hs, logls = cross_validate_density(d)
 
plt.hist(d, 30, normed=True)
h_opt = hs[np.argmax(logls)]
print("Optimal h:", h_opt)

t = np.linspace(-2, 10, 100)
plt.plot(t, kernel_density(t, d, h_opt))
plt.show()
\end{minted}

\subsection{Other methods for kernel bandwidth selection}
Density estimate $\hat{f}_h(x)$ and true density $f(x)$.

ISE, integrated squared error:
\[
ISE(h) = \int_{- \infty}^{\infty} \Bigg( \hat{f}_h (x) - f(x) \Bigg)^2 dx
\]
MISE, mean integrated squared error:
\[
MISE(h) = E \Bigg[ \int_{- \infty}^{\infty}(\hat{f}_h(x) - f(x))^2 dx \Bigg]
\]

\subsection{Exercises}
\subsubsection{Simple 1D density estimation}
Generate a random sample of 1000 samples from the standard zero mean unit variance
\begin{minted}[breaklines]{Python}
%matplotlib inline
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt

#Kernel function
def gauss_kernel(x, mu, h):
    return 1/np.sqrt(2*np.pi)/h*np.exp(-0.5*(x-mu)**2/h**2)

#Single kernel density
def single_kernel_density(t, x, h):
    return np.mean(gauss_kernel(t, x, h))

#Kernel density function
def kernel_density_function(t, a, sigma):
    y = np.zeros(len(t))
    for i in range(len(t)):
        y[i] = single_kernel_density(t[i], a, sigma)
    return y

t = np.linspace(-5, 5, 100)
x = npr.randn(1000)
plt.hist(x, 30, normed=True, alpha = 0.2)
plt.plot(t, gauss_kernel(t, 0.0, 1.0), label='True pdf')

y = kernel_density_function(t, x, 0.03)
plt.plot(t, y, label='h = 0.01')
y = kernel_density_function(t, x, 0.1)
plt.plot(t, y, label='h = 0.1')
y = kernel_density_function(t, x, 0.3)
plt.plot(t, y, label='h = 0.3')
y = kernel_density_function(t, x, 1.0)
plt.plot(t, y, label='h = 1.0')
plt.legend(loc='upper right')
\end{minted}

\subsubsection{Cross-validation to define the bandwidth $h$}
Apply cross-validation to find the optimal bandwidth in the previous exercise
Plot the estimated density together with the true density and the normed sample histogram.

\begin{minted}[breaklines]{Python}
#LOO function
def loocv_kernel_density(x):
    hs = np.linspace(0.1, 1.0, 100)
    logls = np.zeros(len(hs))
    for j in range(len(x)):
        for i in range(len(hs)):
            logls[i] += np.log(kernel_density_function(np.array([x[j]]), np.delete(x, j), hs[i]))
    return (hs, logls)

hs, logls = loocv_kernel_density(x)
myh = hs[np.argmax(logls)]
y = kernel_density_function(t, x, myh)
plt.plot(t, y, label='h = %.3f'%myh)
y_gauss = gauss_kernel(t, 0, 1.0)
plt.plot(t, y_gauss, label='True pdf')
plt.hist(x, 50, normed=True, alpha = 0.2)
plt.legend(loc='upper right')
\end{minted}

\subsubsection{More density estimation}
1. Generate a few data sets with different numbers of samples: 30, 100, 300 with the same normal distribution as above and apply the same procedure. Does the optimal bandwidth change and if so, how?

2. Generate data from different distributions (at least uniform and Laplace) and apply the same procedure. What do you observe?

3. Generate data from the d-dimensional multivariate normal distribution with zero mean and unit covariance and  ��=3,10,20 . 

4. Estimate the density using the above procedure. Evaluate it at the points  (0,…,0)  and  (1,…,1)  and compare with the ground truth. What do you observe

\begin{minted}{Python}
#1

n_samples = np.array([30,100,300])
for n in n_samples:
    x = npr.randn(n)
    hs, logls = loocv_kernel_density(x)
    myh = hs[np.argmax(logls)]
    y = kernel_density_function(t, x, myh)
    plt.plot(t, y, label='h = %.3f'%myh)
    y_gauss = gauss_kernel(t, 0, 1.0)
    plt.plot(t, y_gauss, label='True pdf')
    plt.hist(x, 50, normed=True, alpha = 0.2)
    plt.legend(loc='upper right')
    plt.show()

    
#2
from scipy.stats import laplace
n_samples = 300
x = npr.laplace(size=n_samples, scale=1/np.sqrt(2))
hs, logls = loocv_kernel_density(x)
myh = hs[np.argmax(logls)]
y = kernel_density_function(t, x, myh)
plt.plot(t, y, label='h = %.3f'%myh)
y_lap = laplace.pdf(t, scale=1./np.sqrt(2))
plt.plot(t, y_lap, label='True pdf')
plt.hist(x, 50, normed=True, alpha = 0.2)
plt.legend(loc='upper right')
plt.title('Laplace')
plt.show()

x = npr.uniform(size=n_samples, low=-5., high=5.)
hs, logls = loocv_kernel_density(x)
myh = hs[np.argmax(logls)]
y = kernel_density_function(t, x, myh)
plt.plot(t, y, label='h = %.3f'%myh)
plt.plot(t, np.ones(len(t))/10, label='True pdf')
plt.hist(x, 50, normed=True, alpha = 0.2)
plt.legend(loc='upper right')
plt.title('Uniform')
plt.show()

#3
#Multidimensional normal kernel
def dgauss_kernel(x, mu, h):
    d = len(x)
    return (2*np.pi*h**2)**(-d/2)*np.exp(-0.5*((x-mu)/h**2).dot(x-mu))

#Single kernel density
def dsingle_kernel_density(t, x, h):
    return np.mean([dgauss_kernel(t, e, h) for e in x])

#Full kernel density
def dkernel_density_function(t, a, sigma):
    y = dsingle_kernel_density(t, a, sigma)
    return y

#LOO cross-validation
def dloocv_kernel_density(x):
    hs = np.linspace(0.1, 1.0, 100)
    logls = np.zeros(len(hs))
    for j in range(len(x)):
        for i in range(len(hs)):
            logls[i] += np.log(dkernel_density_function(x[j], np.delete(x, j), hs[i]))
    return hs[np.argmax(logls)]

ds = np.array([3,10,20])

for d in ds:
    x = npr.normal(size=[30, d])
    t0 = np.zeros(d)
    t1 = np.ones(d)
    myh = dloocv_kernel_density(x)
    print('d = ',d)
    print('True density and estimate at origin', 
    dkernel_density_function(t0,np.zeros(d),np.ones(d)),
    dkernel_density_function(t0,x,myh))
    print('True density and estimate at (1,...,1)', 
    dkernel_density_function(t1,np.zeros(d),np.ones(d)),
    dkernel_density_function(t1,x,myh))

\end{minted}

\subsubsection{Variants of cross-validation}
1. $k$ -fold cross-validation: partition the data to  $k$  disjoint blocks. Fit the density to  $k−1$  blocks and compute the log-probability of the remaining block. Repeat  $k$ times leaving each block for testing at the time. Implement  $k$-fold cross-validation and test it with different values of $k$ . 

2.Monte Carlo cross-validation: partition the data to a training set for fitting the density and a test set for evaluating the probaility of desired size. A split of 80\% training / 20\% testing is typical, but this can vary depending on data characteristics. Repeat until the desired level of accuracy has been reached. Implement Monte Carlo cross-validation and test it with different parameters.

3. Which method seems most useful in getting accurate results quickly?

\begin{minted}[breaklines]{Python}
#1

def disjoint(x, k):
    return np.arange(len(x)) % k

def k_foldcv(x, k):
    N = len(x)
    I = npr.permutation(N)
    hs = np.linspace(0.1, 1.0, 100)
    logls = np.zeros(len(hs))
    for j in range(k):
        testI = np.zeros(N, np.bool)
        testI[((j*N)//k):(((j+1)*N)//k)] = True
        trainI = ~testI
        for i in range(len(hs)):
            logls[i] += np.sum(np.log(kernel_density_function(x[I[testI]], x[I[trainI]], hs[i])))
    return (hs, logls)

x = npr.randn(100)

for k in [2, 5, 100]:
    hs, logls = k_foldcv(x, k)
    myh = hs[np.argmax(logls)]
    y = kernel_density_function(t, x, myh)
    plt.plot(t, y, label='h = %f'%myh)
    y_gauss = gauss_kernel(t, 0, 1.0)
    plt.plot(t, y_gauss, label='True pdf')
    plt.hist(x, 50, normed=True, alpha = 0.2)
    plt.legend(loc='upper right')
    plt.show()
\end{minted}

\section{Machine learning resampling}
Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. 

Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. 

Model assessment: evaluating a model's performance 

Model selection: process of selecting the proper level of flexibility for a model 

\section{Cross-validation}
Test error: average error that results from using a statistical learning method to predict the response on a new observation.

\subsection{The validation set approach}
Divide the set of observations into two parts: training set and test set.

The validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set.

The validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.

\subsection{Leave-one-out cross-validation}
LOOCV. A single observation $(x_1, y_1)$ is used for the validation set, and the remaining observations $\{(x_2, y_2), \dots, (x_n, y_n)\}$ make up the training set. $MSE_1 = (y_1 - \hat{y})^2$ provides an approximately unbiased estimate for the test error. We can repeat the procedure by selecting for the validation data $n$ times, which produces $n$ squared errors, $MSE_1, \dots, MSE_n$. The LOOCV estimate for the for the test MSE is the average:
\[
CV(n) = \frac{1}{n}
\]

\subsection{$k$-fold cross-validation}
An alternative to LOOCV is $k$-fold CV. This involves randomly dividing the set of observations into $k$ groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining $k-1$ folds. This process results in $k$ estimates of the test error, and the $k$-fold CV estimate is computed by averaging these values
\[
CV(k) = \frac{1}{k}\sum_{i=1}^k MSE_i
\]

When we perform cross-validation, our goal might be to determine how well a given statistical learning procedure can be expected to perform on independent data. But at other times we are interested only in the location of the minimum point in the estimated test MSE curve. 

\subsection{Bias-variance trade-off for $k$-fold cross-validation}
$k$-fold CV often gives more accurate estimates of the test error rate than does LOOCV. This has to do with a bias-variance trade-off.

Typically, given these considerations, one perfroms $k$-fold cross-validation using $k=5$ and $k=10$, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance. 

\subsection{Cross-validation on classification problems}
In this setting, cross-validation works just as described earlier in this chapter, except that rather than using MSE to quantify test error, we instead use the number of misclassified observations. The LOOCV error rate takes the form
\[
CV(n) = \frac{1}{n}\sum_{i=1}^n Err_i
\]
where $Err_i = I(y_i \neq \hat{y}_i)$. 

\section{The bootstrap}
The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method. 

Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set. 

We randomly select $n$ observations from the data set in order to produce a bootstrap data set, $Z^{*1}$. We can use $Z^{*1}$ to produce a new bootstrap estimate for $\alpha$, which we call $\hat{\alpha}^{*1}$. The standard deviation of these bootstrap estimates is an approximation of the standard error of $\hat{\alpha}$.


