# Practice Session 03: Management of network data

In this session we will study an application of complex networks analysis to medicine. We will start with the *diseasome*, a bi-partite network connecting all known genetic diseases with genes whose mutations are implicated in that disease [1]. In Section 2 we will obtain the projection of diseasome into a disease-disease network highlighting relations among diseases.

The initial dataset `disease-genes.csv` in the data/ directory contains the following columns:

1. A disease **ID**
2. A disease **Name**
3. A comma-separated list of **Genes** involved in this disease
4. The **OMIM ID** (Online Mendelian Inheritance in Man) of this disease
5. A codification of the location of the genes in their **Chromosome**
6. A disease **Class** indicating the physiological system that is affected

[1] Goh, K. I., Cusick, M. E., Valle, D., Childs, B., Vidal, M., & Barabási, A. L. (2007). [The human disease network](http://www.pnas.org/content/104/21/8685). Proceedings of the National Academy of Sciences, 104(21), 8685-8690.

Author: <font color="blue">Your name here</font>

E-mail: <font color="blue">Your e-mail here</font>

Date: <font color="blue">The current date here</font>

## Evaluation

Because the time spend in this session may be longer than expected depending on your programming experience, the score is distributed as show bellow. So, feel free to deliver Section 2 acording to it. Despite this, the professor encourage students to afford Section 2 because interesting observations can be extracted from the disease-disease graph.

- Section 1: 70%
- Section 2: 30%

# 0. Code snippets you may need


## 0.1. Splitting a string and iterating on its parts

If you want to split a string into pieces, you can use the following. Suppose the variable `genes` contains a comma-separated list such as `CYP17A1, CYP17, P450C17`:

```python
gene_list = genes.split(",")
for gene in gene_list:
    gene = gene.strip()
    ...
```

The `str.strip()` method removes white space and newlines from the beginning and end of the string, so it's equivalent to `str.lstrip().rstrip()`.

You can also do this in one line of code, using the `map(f, v)` function, which results of applying function `f` to each element of the list `v`:

```python
gene_list = list(map(str.strip, disease["Genes"].split(',')))
```

## 0.2. Producing the intersection and union of lists

There are many ways of intersecting two lists in Python, one of the simplest ones is to convert them to sets, and then computing the set intersection using the built-in `&` operator:

```python
def intersection(list1, list2):
    return(list(set(list1) & set(list2)))
```

If you want to test if two lists have elements in common, you can check the length of its intersection (there are other ways). Remember the length of a list is obtained with `len()`:

```python
if len(intersection(list1,list2)) > 0:
    ...
```

A list `c` that is the union of two lists `a,b` can be computed in Python with `c = list(set(a) | set(b))`.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

# 1. The diseasome bi-partite graph

## 1.0. Tip: examine your input file

Before you begin, we highly recommend you to:

1. Copy the file ``disease-genes.csv`` to a local ``data/`` subdirectory in your practice folder 
2. Open this file in a Spreadsheet program and look at it. Use ',' as field separator.
3. Check its contents against the description given above.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

## 1.1. Read the disease-genes file in a dataframe

The following code, which you can leave as-is, reads the disease-genes file into a pandas dataframe.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [1]:
# Feel free to add imports if you need them

import io
import csv
import pandas as pd

In [4]:
# Leave this code as-is

INPUT_FILENAME = "data/disease-genes.csv"
OUTPUT_DISEASOME_FILENAME = "data/diseasome.csv"

In [None]:
# Leave this code as-is

disease_genes = pd.read_csv(INPUT_FILENAME, sep=",")
disease_genes.set_index("ID", inplace=True)
# Ten first lines to check if the object has the right type of data in it.
disease_genes.head(10)

## 1.2. Create the diseasome dataframe and file

Create a new dataframe named `diseasome` containing three columns: `disease`, `class`, and `gene_list`. These are extracted from `disease_genes` dataframe as follows:

* `diseasome.disease` is `disease_genes.Name`
* `diseasome.class` is `disease_genes.Class`
* `diseasome.gene_list` is a Python list containing the genes stripped from `disease_genes.Genes`

The columns should be in the specified order.

You can do this in two steps:
1. Create a temporary dataframe `tmp_diseasome` with the three source columns. This can be done directly in Pandas by using [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html).
2. For each row in `tmp_diseasome`, create `diseasome.gene_list`, and then, generate the disease-duplicate rows on the final `diseasome` dataframe.

Check the result by means of `diseasome.head(10)`.

Finally, create a **tab-separated** CSV file (to be used in Cytoscape later on). As expect, **this file should have one row per disease-gene pair**, as follows:

```
    disease          class          gene
    Alpers syndrome   Neurological   POLG
    Alpers syndrome   Neurological   POLG1
    Hepatic adenoma   Cancer         TCF1
    Hepatic adenoma   Cancer         HNF1A
    Hepatic adenoma   Cancer         MODY3
    ...
```
This can be done by using [pandas.DataFrame.to_csv](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html).

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Q1. Replace this cell with your code to create the `diseasome` dataframe from the `disease_genes` dataframe and write it to `diseasome.csv` file. Include the line `diseasome.head(10)`
</font>


## 1.3. Examine the file you generated

Open the ``diseasome.csv`` file in a spreadsheet program to make sure you generated it correctly.

## 1.4. Import this file in Cytoscape

Remember these files are imported with ``File > Import > Network from File ...`` and you must select only "tab" as separator in the advanced options. Then, you have to select:

* disease as a ``Source Node``
* gene as a ``Target Node`` 
* class as a ``Source Node Attribute``.

## 1.5. Draw this graph

Select **the largest connected component** of the graph. To do this, you can either:

* maintain "shift" pressed while you draw a rectangle around it, or
* select a node and use the two-house "neighbor" button repeatedly.

Then, create a new graph with the largest selected component (``File > New > Network > From selected nodes, all edges``), and execute some graph layout for it, such as a force directed layout.

Next, style the genes in white background (fill color white as the default fill color), and the remaining nodes according to the class of disease (Style/Node, Fill Color on Column "class", Discrete Mapping). If you right-click on "Mapping type" when creating a discrete mapping, you can use an automatic mapping generator to start with. Note that genes do not have a "class" and hence take the default color.
Make genes hexagons and all the remaining nodes rectangles.

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Q2. Save the image as diseasome.png and replace this cell with the code `![Diseasome graph](diseasome.png)` to display your graph.</font>

<font size="+1" color="red">Q3. Replace this cell by a brief commentary, in your own words, of what you see in this graph. Include your answer to at least these three questions: (1) What size was the largest component in terms of percentage of nodes of the graph? (2) What is the dominant type of disease in the largest component? and (3) Why are diseases of the same type close to each other in this graph?</font>

# 2. The disease-disease graph

The bi-partite diseasome is hard to visualize as it mixes diseases and genes. We will now try to visualize only the connections between diseases. Edges in the new graph will be labeled with the number of genes in common as a weight.

## 2.1. Create a disease-disease.csv file

The following code lists all the diseases that have at least one gene in common:

```python
for idx1, disease1 in diseasome.iterrows():
    gene_list_1 = ...
    for idx2, disease2 in diseasome.iterrows():
        if disease2["disease"] != disease1["disease"]:
            gene_list_2 = ...
            common_genes = intersection(gene_list_1, gene_list_2)
            if len(common_genes) > 0:
                print("diseases '%s' and '%s' have %d gene(s) in common" %
                      (disease1["disease"], disease2["disease"], len(common_genes)))
```

Modify this code to generate a tab-separated file like this one:

    disease1    disease2  ngenes1 ngenes2 class1    class2   ngcommon
    17-alpha... 17,20...  3       3       Endocrine Endocrine       3
    3-methyl... Optic ... 2       2       Metabolic Ophthamological 2
    Aarskog...  Mental... 3       3       multiple  Neurological    3
    ...

Where `ngenes1` and `ngenes2` correspond to `len(gene_list_1)` and `len(gene_list_2)` respectively.

If you want to avoid having double edges, change the condition `disease1["disease"] != disease2["disease"]` from `!=` (different) to `>` (greater than).

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

In [None]:
OUTPUT_DISEASEDISEASE_FILENAME = "data/disease-disease.csv"

<font size="+1" color="red">Q4. Replace this cell with your code to create the disease-disease.csv file; your code can span multiple cells</font>

## 2.2. Examine the file you generated

Open the ``disease-disease.csv`` file in a spreadsheet program to make sure you generated it correctly.

Check in particular that the number of genes is correct and the number of genes in common is correct.

## 2.3. Import this file into Cytoscape

To import this file remember to select the advanced options of the import, and select tab (and only tab) as the separator.

Now, for the columns you have to indicate their role, and rename the attribute columns so that Cytoscape knows when they are the same attribute.

1. `disease1`: Source Node
2. `disease2`: Target Node
3. `ngenes1`: Source Node Attribute - rename to "num_genes"
4. `ngenes2`: Target Node Attribute - rename to "num_genes"
5. `class1`: Source Node Attribute - rename to "class"
6. `class2`: Target Node Attribute - rename to "class"
7. `ngcommon`: Edge Attribute

**Warning:** If the network takes a long time to load or does not load in Cytoscape, it is very likely that you made a mistake during the generation of the graph. Double-check the output of your code to make sure that every pair of diseases you are including in the CSV file actually shares at least one gene.

## 2.4. Style and add simple annotations

Style lines connecting nodes so their thickness and color reflects the number of genes in common.

Color the nodes by default white, and with colors representing the class of diseases (leave "Unclassified" and "multiple" as gray).

Add text annotations (secondary button > add > text annotation) to the first, second, and third largest connected component, with your observations (e.g., "The second largest component is dominated by diseases of type x"). Place the annotations next to the components they refer to (secondary button > edit > move annotation).

<font size="-1" color="gray">(Remove this cell when delivering.)</font>

<font size="+1" color="red">Q5. Save an image of the entire graph as disease-disease.png and replace this cell with \!\[Diseases graph\]\(disease-disease.png\) to display your graph. **Include a color legend for disease type.**</font>

<font size="+1" color="red">Q6. Create a graph with the largest connected component and save its image as disease-disease-largest-cc.png and replace this cell with `![Largest connected component of diseases graph](disease-disease-largest-cc.png)` to display your graph. **Include a color legend for disease type.**</font>

<font size="+1" color="red">Q7. Replace this cell by a brief commentary, in your own words, of what you see in this graph. What interesting observations can you make about this graph?</font>

# DELIVER (individually)

Deliver a zip file containing:

* This notebook
* The ``diseasome.csv`` and ``diseasome.png`` files
* The ``disease-disease.csv`` and ``disease-disease.png`` files

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>
