## General instructions

Please fill out the answers to the questions below in text blocks and coding blocks, as appropriate for each question. For some programming questions, some hints have already been provided for you. Add additional blocks if you need them (e.g. to explain your answers). Try to answer each question succinctly. Submit the completed notebook after filling in all the questions and please make sure that the answers are visible without needing to execute each code block (i.e. so the code block has already been executed). 

## Part 1: Basic Algebra and Matrices

### Task 1.1 (2 points)

Suppose that you are given a data matrix (X) that summarises the expenditure of 10 different hospitals across a 6 month period, where the the hospitals are stored one per row and the months are stored one per column. Write down a vector that you can multiply this matrix with to yield the following quantities. In other words, write down the vector v that causes the matrix-vector product **X * v** to yield the following:

1. The average expenditure across hospitals for the whole 6 month period
3. The difference in the total expenditure between the first three and the last three months 

In [None]:
import numpy as np
X = np.random.rand(10,6) # dummy data
print(X)
v1 =  
v2 = 
np.dot(X,v1.T)
np.dot(X,v2.T)

### Task 1.2 (1 point)

Write a short piece of code that uses an eigendecomposition to determine the rank of the following matrix. Check your answer by computing the rank directly using the function np.linalg.matrix_rank():

In [None]:
import numpy as np

X = np.array([[ 2.,  9.,  7.,  9., 11.],
              [ 9.,  9.,  7.,  7., 11.],
              [ 7.,  4.,  7.,  9.,  1.],
              [ 8.,  2.,  9.,  4., -5.],
              [ 4.,  1.,  8.,  8., -6.],
              [ 6.,  7., 10.,  4.,  4.]])

### Task 1.3 (2 points)

Give brief answers to the following questions: 

1. What is the 'curse of dimensionality and why does it pose a problem for big data analytics? 
2. When would you expect the curse of dimensionality to come into play? 'large n', 'large p' or 'large n and large p'? Explain your answer 

## Part 2: Machine learning and statistics

### Task 2.1 (1 point)

Explain the procedure you would undertake to use cross-validation to optimise the parameters of a machine learning model (e.g. the regularisation parameter in a penalised linear model)

### Task 2.2 (1 point)

Suppose that you are a manager of a large health service aiming to test 100,000 people in your area for the SARS-CoV-2 virus. You can assume that the prevalence in the population is 1% and we will ignore transmission of the virus for this exercise (i.e. we will assume that the prevalence is fixed).  The first-generation nasal swab test has a sensitivity of approximately 75% and a specificity of 99.5%.  An alternative antibody test is available that has greatly improved sensitivity (85%) and only slightly worse specificity (97.5%).
First, write a small block of python code to estimate the accuracy of the test under the scenarios above. 



In [None]:
N = 100000      # total number in the population
num_pos =       # number of positive cases
num_neg =       # number of negative cases

# first scenario
sens = 
spec = 

tp = 
tn = 
accuracy = 
print('scenario 1 =', accuracy)

# second scenario
sens = 
spec = 

tp = 
tn = 
acc = 
print('scenario 2 =', accuracy)

### Task 2.3 (1 point)

Which of the tests would you prefer? Give reasons for your answer. Can you think of factors that would change your preference? 

### Task 2.4 (1 point)

What is the relationship between sample size, effect size and statistical power? 

### Task 2.5 (2 points)

Matrix decomposition techniques are important ways to reduce dimensionality in big data cohorts. Provide brief answers to the following questions: 

1. Explain how an eigendecomposition can be used to perform principal components analysis (PCA). What do the eigenvectors and eigenvalues represent in this context? 
2. Explain how linked independent component analysis works to integrate multi-modal data. What do the components reflect? How are the components different from just concatenating the data and running PCA?  

## Part 3: Analysis of Parkinson's disease dataset

For this part of the assignment, we will work with electronic measurements of voice characteristics from 42 people with early-stage Parkinson's disease. These participants were included in a six-month trial of a telemonitoring device for remote symptom progression monitoring. The motivation is that Parkinson's disease affects the characteristics of the voice in a way that might be associated with disease progression. See [here](https://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring) for a description of the data. Note that the UPDRS (Unified Parkinson's Disease Rating Scale) is a standard scale for rating the symptoms of Parkinson's disease across different domains.

For this assignment, we have split the dataset into two parts, which you can download here:

In [4]:
!wget -nc https://raw.githubusercontent.com/predictive-clinical-neuroscience/BigDataCourse/main/data/parkinsons_updrs_part1.csv
!wget -nc https://raw.githubusercontent.com/predictive-clinical-neuroscience/BigDataCourse/main/data/parkinsons_updrs_part2.csv

File ‘parkinsons_updrs_part1.csv’ already there; not retrieving.

--2020-12-01 15:52:13--  https://raw.githubusercontent.com/predictive-clinical-neuroscience/BigDataCourse/main/data/parkinsons_updrs_part2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.36.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.36.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 871968 (852K) [text/plain]
Saving to: ‘parkinsons_updrs_part2.csv’


2020-12-01 15:52:14 (2,24 MB/s) - ‘parkinsons_updrs_part2.csv’ saved [871968/871968]



### Task 3.0 (1 point, bonus question)
Load the data and count the number of rows and columns

### Task 3.1 (1 point)

Your first task is to evaluate the basic demographic characteristics, using sns.distplot(). Then answer the following question:

* Do you think the dataset is (approximately) representative of the general population of people with Parkinson's disease? Why or why not?

In [None]:
import seaborn as sns

# Plot age

# Plot sex


### Task 3.2 (1 point)

Your next task is to fit a GLM to predict motor symptom severity ('motor_UPDRS') on the basis of the 16 biomedical voice measurements using only the first part of the Parkinson dataset. Don't forget that the symptom severity does not have a zero mean. Print out the estimated regression coefficients

In [None]:
# Select 'motor_UPDRS' from dataset
y1 = 

# Select relevant columns
cols = 

# Make the design matrix
M1 = 
print(M1)

# Calculate beta
beta1 = 
print(beta1)

### Task 3.3 (1 point)

Now, evaluate how accurately this model can predict the true symptom scores. To do this compute the correlation between the true and predicted symptom scores as well as the explained variance score. Print these values. 

Hint: the explained variance can be computed as 1-var(y-yhat)/var(y) where y and yhat are the true and predicted labels respectively.

In [None]:
# Calculate your predicted scores
yhat1 = 

# Make a scatter plot (optional)

# use np.corrcoef()
corr = 
ev = 
print()

### Task 3.4 (1 point)

Now compute the predictions on the second dataset using the coefficients estimated on the first dataset. Compute and print the correlation and explained variance as above

### Task 3.5 (2 points)

Now, we are going to interpret these results. Please answer the following questions:

1. Can you see evidence for an association between symptom scores and the voice measurements? Is this a strong association? 
2. Can you see evidence for overfitting occurring? why or why not? 

### Task 3.6 (1 point)

Now write a small piece of code to compute the accuracy the other way around (i.e. estimating GLM coefficients using the second dataset, then making predictions on the first dataset). 

### Task 3.7 (1 point)

Now compare these with the results you have obtained above. Are they the same? If not, why do you think they are different?