# Eigengenes for flu progression

Despite much progress in recent centuries, infectious disease continues to be one of humanity's most persistent problems. The recent COVID-19 outbreak is of course no exception to that. To understand and ultimately contain infectious diseases, it is helpful to understand how the human immune system responds to them, especially at the cellular and molecular level.

In a study published in *PLoS genetics* a cohort of 16 healthy human volunteers received intranasal inoculation of influenza H3N2/Wisconsin and 9 of these subjects developed mild to severe symptoms based on standardized symptom scoring [1]. In the week following inoculation, the patient's blood was drawn every 8 hours for microarray analysis. The resulting dataset contains 268 samples: 16 patients and about 17 samples from each patient. Patients were classified as "asymptomatic" if the Jackson score (which is a symptom score that indicates actual infection) was less than 6 over the first five days of observation and viral shedding was not documented after the first 24 hours subsequent to inoculation. The other patients were labeled "symptomatic".

In this practical session you will analyse this data set using PCA. You will chart the course of flu infection and try to identify genes that are involved.

> [1] Huang, Yongsheng, et al. Temporal dynamics of host molecular responses differentiate symptomatic and asymptomatic influenza A infection. PLoS genetics 7.8 (2011).

Let's start by loading the data.

In [0]:
import pandas as pd

data = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/Machine_Learning_course_UGent_D012554_data/master/practicum/PCA/flu.csv")

In [0]:
data.head(5)

The microarray used in this study measures 11959 normalized gene expression values. There are three more columns in the data set:

- 'subject': identifies the volunteer
- 'type': the symptom annotation
- 'time': the timepoint of the microarray measurement

Let's plot a heatmap of the expression values for some of the genes of subject 2:

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(22,8))
sns.heatmap(data[data.subject==2].iloc[:,200:500],square=False,xticklabels=False,yticklabels=range(0,128,8))
plt.show()

In this heatmap the genes are on the x-axis (gene names not shown here), the time (in hours) is on the y-axis. The color levels indicate the expression measurements of the genes. 

*Do you see something interesting? Yes? What is it?*

---(Answer here)

---(Solution)


*Make the same plot for subject 1.*

In [0]:
plt.figure(figsize=(22,8))
###Start code here

###End code here
plt.show()

*Do you see something strange?*

---(Answer here)

---(Solution) 


*Pop the column 'subject' in a variable called `subjects`.*

*Pop the column 'type' in a variable called`types`.* 

*Pop the column 'time' in a variable called `times`.*

In [0]:
###Start code here
subjects = 
types = 
times = 
###End code here

*When you apply PCA on the data set `data` to reduce the 11959 genes (the features), how many principal components do we require to explain at least 80% of the data? Check the scikit-learn webpage for help on how to obtain the explained variance ratio in the PCA module.*

In [0]:
import numpy as np
from sklearn.decomposition import PCA

pca = PCA(n_components=100)
pca.fit_transform(data)

###Start code here

###End code here

Let's reduce the number of principal components to just two and store them in a Pandas DataFrame called `data_projected`:

In [0]:
pca = PCA(n_components=2)
data_projected = pd.DataFrame(pca.fit_transform(data),columns=['PC1','PC2'])

*How much of the variance do these two principal components explain?*

In [0]:
###Start code here

###End code here

Let's add the variables `subjects`, `types` and `times` as columns to `data_projected`:

In [0]:
data_projected['subject'] = subjects
data_projected['type'] = types
data_projected['time'] = times

*Use the Seaborn `.lmplot()` method to plot the projected data:*

In [0]:
###Start code here

###End code here
plt.show()

*Make the same plot, but color the data points by 'type' (use the 'hue' argument):*


In [0]:
###Start code here

###End code here
plt.show()

*What do you see?*

---(Answer)

---(Solution) 

*Color the microarray measurements by 'subject' and split by 'type' (use the 'col' argument):* 


In [0]:
###Start code here

###End code here
plt.show()

*Do the volunteers cluster together?*

---(Answer)

---(Solution) 

The first two principal components can separate many of the symptomatic subjects from the asymptomatic ones. However, not all of them. Probably 'time' has something to do with this...

Let's take a look at what we have found. We could say that we reduced the 11959 genes to just two 'eigengenes' (PC1 and PC2). These  eigengenes explain most of the variation observed over all subjects and time points. 

*Use the seaborn function `catplot()` to plot the data set `data_projected` as a pointplot with on the x-axis column 'time' and on the y-axis the values for the first eigengene. Color the eigengene by column 'subject' and spit the plot by column 'type':*

In [0]:
###Start code here

###End code here

plt.show()

*What do you see?*

---(Answer)

---(Solution)

*Do the same for the second eigengene PC2.*

In [0]:
###Start code here

###End code here

plt.show()

*Do we see something similar?*

---(Answer)

---(Solution)

*Now create the same plot for the first eigengene but remove the coloring by column 'subject'.This will create a summary plot where each point is the mean value for the first eigengene over all volunteers. The error bars estimate the variation between the volunteers (95% confidence interval):*

In [0]:
###Start code here

###End code here

plt.show()

*Do the same for the second eigengene.*

In [0]:
###Start code here

###End code here

By now you should have noticed how both eigengenes show very different behavior between the symptomic and the asymptomic subjects (on average). We can even see when after inoculation of influenza H3N2/Wisconsin gene expression starts to change for the symptopic subjects.

The eigengenes of course don't have much biological meaning. They are a weighted linear combination of the original genes. So, by looking at the magnitude of these weights we can learn about the contribution of each gene to each eigengene. 

*Create a Pandas DataFrame called `pca_weights` with two columns:*

- *'genes' that contains the gene names*
- *'weights' that contains the absolute value of the weights in the first eigengene*

In [0]:
###Start code here

###End code here

print(pca_weights)

*Sort this DataFrame by column 'weights' in descending order*:

In [0]:
###Start code here

###End code here

*So? Did we find something interesting? How about the second eigengene? Find the 5 most interesting genes for the second eigengene.*

In [0]:
###Start code here

###End code here