## Protein-protein interaction graphs

### Getting started

Create a "R \[conda env:r_3.6\]" notebook in the "my_notebooks/week14" folder. Name this notebook "ppi_SARSCov2".

At any time if you want to stop, remember to "Save and Checkpoint" your notebook before doing "Close and Halt".

### Quick introduction to R data frame

The data frame is a special data type used to store dataset tables. Think of rows as cases, columns as variables. Each column is a vector.

Creating a dataframe - see how we are defining each column as a vector.

In [1]:
dfr1 = data.frame(ID=1:4,
                  FirstName=c("John","Jim","Jane","Jill"),
                  Beverage=c("Coffee","Tea","Tea","Coffee"),
                  Age=c(22,33,44,55) )

In [2]:
dfr1

ID,FirstName,Beverage,Age
1,John,Coffee,22
2,Jim,Tea,33
3,Jane,Tea,44
4,Jill,Coffee,55


What's the dimension of this data frame?  The first number is the number of rows, and the second number is the number of columns.

In [3]:
dim(dfr1)

There are many ways to get data out of a data frame.

In [4]:
dfr1[1,]   # First row, all columns

ID,FirstName,Beverage,Age
1,John,Coffee,22


In [5]:
dfr1[,1]   # First column, all rows

In [6]:
dfr1$Age   # Age column, all rows, 1st way
dfr1[,'Age'] # Age column, all rows, 2nd way

In [7]:
dfr1[1:2,3:4] # Rows 1 and 2, columns 3 and 4 - the beverage and age of John & Jim

Beverage,Age
Coffee,22
Tea,33


In [8]:
dfr1[c(1,3),] # Rows 1 and 3, all columns

Unnamed: 0,ID,FirstName,Beverage,Age
1,1,John,Coffee,22
3,3,Jane,Tea,44


We can get the row and column names.

In [9]:
colnames(dfr1)

In [10]:
rownames(dfr1)

We can also set the row or column names.

In [11]:
rownames(dfr1) = c("Person1","Person2","Person3","Person4")

In [12]:
dfr1

Unnamed: 0,ID,FirstName,Beverage,Age
Person1,1,John,Coffee,22
Person2,2,Jim,Tea,33
Person3,3,Jane,Tea,44
Person4,4,Jill,Coffee,55


We can *subset* the data frame based on certain criteria.  Who likes coffee?  Note the equality sign *==*.

In [13]:
dfr1[dfr1$Beverage=='Coffee',]

Unnamed: 0,ID,FirstName,Beverage,Age
Person1,1,John,Coffee,22
Person4,4,Jill,Coffee,55


## Reading a network into R

Upload the file "Gordon_Nat2020_SuppTable2.txt" to your "my_notebooks/week14" folder.  This file contains the experimentally determined interactions between the proteins in the SARS-Cov2 genome and the proteins in human cells (*an interactome*; reference Gordon et al., Nature 2020 https://www.nature.com/articles/s41586-020-2286-9). Each line in this file represents an interaction between a SARS-Cov2 viral protein and a human protein. 

We will first read this file into R as a data frame, then convert it to a network object.

In [14]:
net_df = read.table('Gordon_Nat2020_SuppTable2.txt',
                    sep='\t',
                    header=TRUE,
                    comment.char="",
                    quote="",
                    stringsAsFactors=FALSE)

How many interactions are there?

In [15]:
dim(net_df)

Let's take a look at the first three interactions.

In [16]:
net_df[1:3,]

Bait,Preys,PreyGene,MIST,Saint_BFDR,AvgSpec,FoldChange,Uniprot.Protein.ID,Uniprot.Protein.Description,Uniprot.Function,Structures..PDB.,Uniprot.Function.in.Disease
SARS-CoV2 E,O00203,AP3B1,0.9635501,0,4.67,46.67,AP3B1_HUMAN,AP-3 complex subunit beta-1 (Adaptor protein complex AP-3 subunit beta-1) (Adaptor-related protein complex 3 subunit beta-1) (Beta-3A-adaptin) (Clathrin assembly protein complex 3 beta-1 large chain),""" Subunit of non-clathrin- and clathrin-associated adaptor protein complex 3 (AP-3) that plays a role in protein sorting in the late-Golgi/trans-Golgi network (TGN) and/or endosomes. The AP complexes mediate both the recruitment of clathrin to membranes and the recognition of sorting signals within the cytosolic tails of transmembrane cargo molecules. AP-3 appears to be involved in the sorting of a subset of transmembrane proteins targeted to lysosomes and lysosome-related organelles. In concert with the BLOC-1 complex, AP-3 is required to target cargos into vesicles assembled at cell bodies for delivery into neurites and nerve terminals.""",,""" Hermansky-Pudlak syndrome 2 (HPS2) [MIM:608233]: A form of Hermansky-Pudlak syndrome, a genetically heterogeneous autosomal recessive disorder characterized by oculocutaneous albinism, bleeding due to platelet storage pool deficiency, and lysosomal storage defects. This syndrome results from defects of diverse cytoplasmic organelles including melanosomes, platelet dense granules and lysosomes. Ceroid storage in the lungs is associated with pulmonary fibrosis, a common cause of premature death in individuals with HPS. HPS2 differs from the other forms of HPS in that it includes immunodeficiency in its phenotype and patients with HPS2 have an increased susceptibility to infections. {ECO:0000269|PubMed:10024875}. Note=The disease is caused by mutations affecting the gene represented in this entry."""
SARS-CoV2 E,O60885,BRD4,0.9784884,0,2.67,26.67,BRD4_HUMAN,Bromodomain-containing protein 4 (Protein HUNK1),""" Chromatin reader protein that recognizes and binds acetylated histones and plays a key role in transmission of epigenetic memory across cell divisions and transcription regulation. Remains associated with acetylated chromatin throughout the entire cell cycle and provides epigenetic memory for postmitotic G1 gene transcription by preserving acetylated chromatin status and maintaining high-order chromatin structure (PubMed:23589332, PubMed:23317504, PubMed:22334664). During interphase, plays a key role in regulating the transcription of signal-inducible genes by associating with the P-TEFb complex and recruiting it to promoters. Also recruits P-TEFb complex to distal enhancers, so called anti-pause enhancers in collaboration with JMJD6. BRD4 and JMJD6 are required to form the transcriptionally active P-TEFb complex by displacing negative regulators such as HEXIM1 and 7SKsnRNA complex from P-TEFb, thereby transforming it into an active form that can then phosphorylate the C-terminal domain (CTD) of RNA polymerase II (PubMed:23589332, PubMed:19596240, PubMed:16109377, PubMed:16109376, PubMed:24360279). Promotes phosphorylation of 'Ser-2' of the C-terminal domain (CTD) of RNA polymerase II (PubMed:23086925). According to a report, directly acts as an atypical protein kinase and mediates phosphorylation of 'Ser-2' of the C-terminal domain (CTD) of RNA polymerase II; these data however need additional evidences in vivo (PubMed:22509028). In addition to acetylated histones, also recognizes and binds acetylated RELA, leading to further recruitment of the P-TEFb complex and subsequent activation of NF-kappa-B (PubMed:19103749). Also acts as a regulator of p53/TP53-mediated transcription: following phosphorylation by CK2, recruited to p53/TP53 specific target promoters (PubMed:23317504). {ECO:0000269|PubMed:16109376, ECO:0000269|PubMed:16109377, ECO:0000269|PubMed:19103749, ECO:0000269|PubMed:19596240, ECO:0000269|PubMed:22334664, ECO:0000269|PubMed:22509028, ECO:0000269|PubMed:23086925, ECO:0000269|PubMed:23317504, ECO:0000269|PubMed:23589332, ECO:0000269|PubMed:24360279}.; FUNCTION: [Isoform B]: Acts as a chromatin insulator in the DNA damage response pathway. Inhibits DNA damage response signaling by recruiting the condensin-2 complex to acetylated histones, leading to chromatin structure remodeling, insulating the region from DNA damage response by limiting spreading of histone H2AX/H2A.x phosphorylation. {ECO:0000269|PubMed:23728299}.""",2I8N;2LSP;2MJV;2N3K;2NCZ;2ND0;2ND1;2NNU;2OSS;2OUO;2YEL;2YEM;3MXF;3P5O;3SVF;3SVG;3U5J;3U5K;3U5L;3UVW;3UVX;3UVY;3UW9;3ZYU;4A9L;4BJX;4BW1;4BW2;4BW3;4BW4;4C66;4C67;4CFK;4CFL;4CL9;4CLB;4DON;4E96;4F3I;4GPJ;4HBV;4HBW;4HBX;4HBY;4HXK;4HXL;4HXM;4HXN;4HXO;4HXP;4HXR;4HXS;4IOO;4IOQ;4IOR;4J0R;4J0S;4J3I;4KV1;4KV4;4LR6;4LRG;4LYI;4LYS;4LYW;4LZR;4LZS;4MEN;4MEO;4MEP;4MEQ;4MR3;4MR4;4NQM;4NR8;4NUC;4NUD;4NUE;4O70;4O71;4O72;4O74;4O75;4O76;4O77;4O78;4O7A;4O7B;4O7C;4O7E;4O7F;4OGI;4OGJ;4PCE;4PCI;4PS5;4QB3;4QR3;4QR4;4QR5;4QZS;4UIX;4UIY;4UIZ;4UYD;4WHW;4WIV;4X2I;4XY9;4XYA;4YH3;4YH4;4Z1Q;4Z1S;4Z93;4ZC9;4ZW1;5A5S;5A85;5ACY;5AD2;5AD3;5BT4;5CFW;5COI;5CP5;5CPE;5CQT;5CRM;5CRZ;5CS8;5CTL;5CY9;5D0C;5D24;5D25;5D26;5D3H;5D3J;5D3L;5D3N;5D3P;5D3R;5D3S;5D3T;5DLX;5DLZ;5DW2;5DX4;5E0R;5EGU;5EI4;5EIS;5F5Z;5F60;5F61;5F62;5F63;5FBX;5H21;5HCL;5HLS;5HM0;5HQ5;5HQ6;5HQ7;5I80;5I88;5IGK;5JWM;5KDH;5KHM;5KJ0;5KU3;5LJ1;5LJ2;5LRQ;5LUU;5M39;5M3A;5MKZ;5MLI;5N2M;5NNC;5NND;5NNE;5NNF;5NNG;5O97;5OVB;5OWM;5OWW;5T35;5TI2;5TI3;5TI4;5TI5;5TI6;5TI7;5U28;5U2C;5U2E;5U2F;5UEO;5UEP;5UEQ;5UER;5UES;5UET;5UEU;5UEV;5UEX;5UEY;5UEZ;5UF0;5ULA;5UOO;5UVS;5UVT;5UVU;5UVV;5UVW;5UVX;5UVY;5UVZ;5V67;5VBO;5VBP;5VOM;5VZS;5W55;5WA5;5WMA;5WMD;5WMG;5WUU;5XHY;5XI2;5XI3;5XI4;5Y1Y;5Y8C;5Y8W;5Y8Y;5Y8Z;5Y93;5Y94;5YOU;5YOV;5YQX;5Z1R;5Z1S;5Z1T;5Z5T;5Z5U;5Z5V;5Z8G;5Z8R;5Z8Z;5Z90;5Z9C;5Z9K;6AFR;6AJV;6AJW;6AJX;6AJY;6AJZ;6BN7;6BN8;6BN9;6BNB;6BNH;6BOY;6C7Q;6C7R;6CD4;6CD5;6CIS;6CIY;6CJ1;6CJ2;6CKR;6CKS;6CZU;6CZV;6DJC;6DL2;6DMJ;6DML;6DNE;6DUV;6E4A;6FFD;6FNX;6FO5;6FSY;6FT3;6FT4;6G0O;6G0P;6G0Q;6G0R;6G0S;6HDQ;6HOV;6I7X;6I7Y;6IN1;6MAU;6MH1;6MH7;6MNL;6PRT;6PS9;6PSB;6Q3Y;6Q3Z;6S25;6SE4,""" Note=A chromosomal aberration involving BRD4 is found in a rare, aggressive, and lethal carcinoma arising in midline organs of young people. Translocation t(15;19)(q14;p13) with NUTM1 which produces a BRD4-NUTM1 fusion protein. {ECO:0000269|PubMed:11733348, ECO:0000269|PubMed:12543779}."""
SARS-CoV2 E,P25440,BRD2,0.9065929,0,7.0,70.0,BRD2_HUMAN,Bromodomain-containing protein 2 (O27.1.1) (Really interesting new gene 3 protein),""" May play a role in spermatogenesis or folliculogenesis (By similarity). Binds hyperacetylated chromatin and plays a role in the regulation of transcription, probably by chromatin remodeling. Regulates transcription of the CCND1 gene. Plays a role in nucleosome assembly. {ECO:0000250, ECO:0000269|PubMed:18406326}.""",1X0J;2DVQ;2DVR;2DVS;2DVV;2E3K;2G4A;2YDW;2YEK;3AQA;3ONI;4A9E;4A9F;4A9H;4A9I;4A9J;4A9M;4A9N;4A9O;4AKN;4ALG;4ALH;4J1P;4MR5;4MR6;4QEU;4QEV;4QEW;4UYF;4UYG;4UYH;5BT5;5DFB;5DFC;5DFD;5DW1;5EK9;5HEL;5HEM;5HEN;5HFQ;5IBN;5IG6;5N2L;5O38;5O39;5O3A;5O3B;5O3C;5O3D;5O3E;5O3F;5O3G;5O3H;5O3I;5U5S;5U6V;5UEW;5XHE;5XHK;6CUI;6DB0;6DBC;6DDI;6DDJ;6E6J;6FFE;6FFF;6FFG;6I80;6I81;6K04;6K05;6MO7;6MO8;6MO9;6MOA,


We can see that the first column is the viral protein, the second column is the UniProt ID of the human protein, and the third column is the gene symbol of the human protein. The rest of the columns provide more information about the experimental result and information about the human protein.

The `graph_from_data_frame` function in the `igraph` package can convert a data frame into a graph/network object, provided that the first two columns are the nodes in each interaction.

In [17]:
library(igraph)
net = graph_from_data_frame(d=net_df,directed=FALSE)

ERROR: Error in graph_from_data_frame(d = net_df, directed = FALSE): could not find function "graph_from_data_frame"


The `net` variable is now a graph/network object. We can take a quick look at what's inside, and also get the edges and vertices (nodes) in the network.

In [None]:
net

In [None]:
E(net)       # The edges of the "net" object

In [None]:
V(net)

We can get the *degree* on each node, and plot the degree distribution.

In [None]:
deg = degree(net)
deg

In [None]:
hist(deg)

The `net` object can be plotted directly.

In [None]:
plot(net)

There are many options to change how your network could look like.

In [None]:
plot(net, vertex.label.color="black", vertex.label.cex=.5,vertex.size=2,vertex.label.dist=1)

## Finding functions of human proteins interacting with viral proteins

Gordon et al. found that human proteins that interact with different viral proteins are enriched for specific biological processes.  We will try to reproduce their result for interactors of the viral protein orf8.

First we want to get a *subnetwork* of all interactions that include orf8.

In [None]:
orf8_sub = net_df[net_df$Bait=='SARS-CoV2 orf8',]

In [None]:
orf8_sub[1:3,]

How many interactions are there?

In [None]:
dim(orf8_sub)

Note that `orf8_sub` is a data frame, so we can again turn it into a graph/network object by the `graph_from_data_frame` function.

In [None]:
orf8_subnet = graph_from_data_frame(d=orf8_sub, directed=FALSE) 

In [None]:
plot(orf8_subnet)

The `PreyGene` column contains the gene symbols of the interacting proteins.

In [None]:
orf8_sub$PreyGene

We now load the `clusterProfiler` package that provides functions to compute GO term enrichment and the `org.Hs.eg.db` that provides the GO annotations of human genes.

In [None]:
library(clusterProfiler)
library(org.Hs.eg.db)

The `enrichedGO` function takes a list of gene symbols, compared them to the annotated gene functions in the human genome, and return a list of enriched GO terms for the genes in your list. Recall from last lecture that we need to do FDR correction since we are doing thousands of statistical tests. `BH` is a FDR correction method.

In [None]:
ego <- enrichGO(orf8_sub$PreyGene, 
                OrgDb=org.Hs.eg.db, 
                keyType="SYMBOL", 
                ont="BP", 
                pvalueCutoff=0.05, 
                pAdjustMethod="BH", 
                qvalueCutoff=0.05, 
                readable=FALSE)

Let's take a look at the top 10 enriched GO terms.  How does this correspond to the results in Figure 3 and Extended Data Fig 3 of the Gordon et al paper?

In [None]:
ego[1:10,]