# DS2 - Multivariate Analysis

# Assignment 1: Principal Component Analysis (10 points)

The file `aml-chr16.dat` contains RNA-seq data of 285 patients with AML. The index contains the patient ID and the last column contains the FAB category. Perform PCA on the data set and draw biplots of the data projected on the first three components. 

Notes:

* Make sure that the different AML categories can be distinguished in the plot. 
* Only draw lines/arrows for the 25 variables that are most associated with the selected eigenvectors, i.e. have the highest absolute values (loadings) in the eigenvectors.

Consider and discuss the following aspects:

* Is it necessary to normalize the data?
* How many eigenvectors are required to capture the most significant features of the data?
* What is the interpretation of a principal component in this context?


# Assignment 2: Linear Discriminant Analysis (10 points)

The file `aml-PCA10.csv` contains the projections of the AML data on the first ten principal components. The order is the same as that of `aml-chr16.dat`. From this file, consider only the FAB categories 5 and 6. Perform LDA on these and then project the data on the discriminant axis, with the FAB category as y-axis.

Consider and discuss the following aspects:

* Is the assumption of equal covariance matrices reasonable?
* How good is the separation between the two groups?
* What is the interpretation of the discriminant axis in the context of this dataset?
* What is the interpretation of the discriminant axis in the context of the original data (expressions)?

Note: the inverse of the pooled covariance matrix can be obtained with `np.linalg.inv`

# Assignment 3: Multivariate Linear Regression (5 points)

The file `1g59.pdb` contains a structure of tRNA bound to a protein. Having observed that the conformational spaces of protein backbones are highly restricted, the question is whether the same holds for RNA (and DNA). To investigate this, we try to predict the positions of the nucleoside connection point (N1 or N9, connected to the sugar moiety), using the positions of the Phosphorous atoms of the backbone. In the code in the block below, the file is read in and the positions of the phosphorous atoms and connecting atoms are extracted. These are ordered in the regressor matrix X and the regressand matrix Y:

In [107]:
with open('1g59.pdb') as pdb:
    atoms = [ 
        line for line in pdb 
        if (line.startswith('ATOM') and 
            line[21] == 'B' and 
            line[12:16] in (' P  ', ' N1 ', ' N9 '))
    ]

X = np.array([(a[30:38], a[38:46], a[46:54]) for a in atoms]).astype(float)
Y = X[1:-3:2]
X = X[::2]
X = np.stack((X[:-2], X[1:-1], X[2:]), axis=1)
X -= X[:, 1, :][:, np.newaxis, :]
X = X[:, [0,2], :].reshape((-1, 6))

print(X.shape, Y.shape)

(91, 6) (91, 3)


In the next block, implement the multivariate linear regression model (without using sklearn) and investigate the accuracy of the approach. Please note this is a rather 'wicked problem', as it has not been fully precooked ;)

Consider and discuss the following aspects:

* How accurate is the model?
* How feasible is this approach to predict positions of atoms based only on the backbone of RNA/DNA?
