# Multi-variate analysis

Including principal component analysis and cluster analysis


## Installation of libraries and necessary software

Copy the files _ExampleFile.csv_ and _FcmClustPEst.R_ into the folder that contains this jupyter notebook or upload them to http://localhost:8888/tree

Install the necessary libraries (only needed once) by executing (shift-enter) the following cell:


In [None]:
install.packages("DAAG", repos='http://cran.us.r-project.org')
install.packages("MASS", repos='http://cran.us.r-project.org')
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Biobase")
install.packages("e1071", repos='http://cran.us.r-project.org')
install.packages("matrixStats", repos='http://cran.us.r-project.org')

## Loading data and libraries
This requires that the installation above has been finished without error

In [None]:
library(DAAG)
library(MASS)
library(Biobase)
library(e1071)
library(matrixStats)

# load data file (you need to place the file into the same folder)
ExampleData <- read.csv("ExampleFile.csv")
source("FcmClustPEst.R")







### Exercise 1
Carry out principal component analysis and linear discriminant analysis (using ```footlgth``` as discriminator) for the ```possum``` data. Rows with missing values need to be removed before. Plot the scores of the PCA with different colors for the locations the possums were trapped  (defined by ```site```). Compare the outcome to the scaling plot of the LDA.


In [None]:
data(possum)
par(mfrow=c(1,2))
A <- possum[,5:ncol(possum)]
## Principal component analysis
#table(rowSums(is.na(A)))
## Number of rows without missing values
sum(complete.cases(A))
## data.frame without missing values
A <- possum[complete.cases(possum),]
B <- A[,5:ncol(possum)]
## PCA ...

## Linear discriminant analysis ...


##### Question I:  <u>How many percent of the variance are already described by principal component 1?</u>

_Answer_

##### Question II:  <u>Which are the most discriminating traits?</u>

_Answer_

##### Question III:  <u>Which sites (provide numbers) can be separated in the scoring plot of the PCA? And in the LDA?</u>

_Answer_

##### Question IV:  <u>What would you use in the linear discriminant analysis to get best separation?</u>

_Answer_



### Exercise 2
Carry out hierarchical clustering, k-means and fuzzy c-means on ```geneData``` (both with 10 clusters).



In [None]:
data("geneData")

# heatmap here:

# For the visualization copy the code from the script of the lecture
kmean.out <- kmeans(geneData,10)
cmeans.out <- cmeans(geneData, 10)


##### Question I:  <u>Read the help describing ```geneData```. What does this dataset contain?</u>

_Answer_

##### Question II:  <u>Why should fuzzy c-means be superior to k-means?</u>

_Answer_

##### Question III:  <u>How many parameters are required for fuzzy c-means? How are they called?</u>

_Answer_

##### Question IV:  <u>Which difference do you see between all 3 clustering methods?</u>

_Answer_

##### Question V:  <u>What is a membership value?</u>

_Answer_



### Exercise 3
Extract the columns 114-117 from _ExampleFile.csv_ and take the logarithm. Normalize the data to the median and apply the cluster analysis (all from last exercise) on the resulting data. The script might take a while to finish.

The file contains quantitative data from a proteomics experiment. Each row corresponds to a "peptide-spectrum match", i.e. a spectrum for which a peptide sequence was assigned. Columns X114-X117 are the quantifications for 4 different biological samples.


In [None]:
# Show first lines of example file
head(ExampleData)

colnames(ExampleData)
ExampleDataLog <- as.matrix(log2(ExampleData[,19:22]))

# Normalization by median
NormalizedData <- t(t(ExampleDataLog) - colMedians(ExampleDataLog,na.rm=T))

# remove rows with missing values for kmeans and cmeans
NormalizedRedData <- NormalizedData[complete.cases(NormalizedData),]

# heatmap here


# kmeans + cmeans (10 clusters)
StandardizedData <- t(scale(t(NormalizedRedData)))



##### Question I:  <u>What does the function colMedians give?</u>

_Answer_

##### Question II:  <u>What are the different protein accessions?</u>

_Answer_

##### Question III:  <u>Why do we need to take the logarithm?</u>

_Answer_

##### Question IV:  <u>How do we check whether the median normalization made sense?</u>

_Answer_

##### Question V:  <u>Which samples are most similar and how does this show?</u>

_Answer_

##### Question VI:  <u>Why do we have to _scale_ the data before using k-means and fuzzy c-means?</u>

_Answer_



### Exercise 4
Carry out fuzzy c-means using the parameter estimation from the lecture on ```StandardizedData```. Compare the results to the ones in the exercise above.

In [None]:
PExpr <- new("ExpressionSet",expr=as.matrix(StandardizedData))
parameters <- FcmClustPEst(PExpr)

# fuzzy c-means clustering here:


##### Question I:  <u>?</u>

_Answer_



### Exercise 1
TODO: simple hierarchical clustering on highly regulated data (VSClust metabolomics) and test different distance measures

##### Question I:  <u>?</u>

_Answer_

