# Data 620 - Project 2

 Baron Curtin, Heather Geiger

## Introduction

In this assignment, we are asked to choose a 2-mode network dataset to simplify and describe.

One example of a type of data that could be described as a 2-mode network is gene sets.

In human or model organism gene expression data, one will often be looking to find which genes are differentially expressed between two or more conditions.

If there is a particular single gene or short list of genes of interest, this data will be simple enough to work with.

Often though, you will have too many genes to look at to be able to just review them all manually.

So by finding what broader categories (like pathways) that these genes fit into, the data becomes much easier to interpret.

MSigDB, the Molecular Signatures Database, has organized a number of sets of genes to use for this type of analysis.

In this assignment, we will not be performing gene set enrichment analysis.

Rather, we will be exploring the reference used in this analysis (the gene set database) in greater detail.

We will use the GO (gene ontology), MF (molecular function) gene sets here. 

## Dataset info

Each line in the gene set file contains information about the gene set.

The first column is the name of the gene set (e.g. "GO_CYCLIC_NUCLEOTIDE_PHOSPHODIESTERASE_ACTIVITY").

The second columns is just a web link to get more info on the gene set, and can be discarded.

Then have any number of additional fields, where each field contains the name of one gene in the gene set.

The data is available for free to download from the MSigDB website.

Unfortunately, there are no links that work with download.file or similar commands like wget/curl.

But you can paste the following link into a browser to activate download:

http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/7.0/c5.mf.v7.0.symbols.gmt

## Data loading and building the network

### Import libraries.

In [1]:
import numpy as np
import pandas as pd
import networkx as nx
import networkx.algorithms.bipartite as bipartite

### Read in and format data.

Read in from ".gmt" file.

For this, we borrow code from GSEApy (https://github.com/zqfang/GSEApy/blob/master/gseapy/parser.py).

This will make it so that each gene set is a key in a dictionary.

Each value will be a list of genes.

In [23]:
with open("c5.mf.v7.0.symbols.gmt") as genesets:
    genesets_dict = { line.strip().split("\t")[0]: line.strip().split("\t")[2:]
                     for line in genesets.readlines()}    

### Build network.

Add the gene sets as nodes to a network graph.

In [34]:
G = nx.Graph()
G.add_nodes_from(list(genesets_dict.keys()),bipartite=0)

Add the genes as nodes to a network graph.

In [38]:
list_of_gene_lists = list(genesets_dict.values())
flattened_list_of_gene_lists = [item for sublist in list_of_gene_lists for item in sublist]
unique_genes = list(set(flattened_list_of_gene_lists))

G.add_nodes_from(unique_genes,bipartite=1)

Add edges between gene sets and genes by converting the dictionary to a list of tuples.

In [98]:
gene_set_sizes = [len(x) for x in list_of_gene_lists]
gene_sets_repeated = [[gene_set] * gene_set_size for gene_set_size,gene_set in zip(gene_set_sizes,list(genesets_dict.keys()))]
gene_sets_repeated = [item for sublist in gene_sets_repeated for item in sublist]

gene_set_to_gene_tuples = list(zip(flattened_list_of_gene_lists,gene_sets_repeated))

G.add_edges_from(gene_set_to_gene_tuples)