# Download MSigDB Gene Sets

**Gregory Way 2018**

**Modified scripts originally written by Daniel Himmelstein (@dhimmel)**

_Most_ MSigDB gene sets (_version 6.1_) are now `CC BY 4.0` (except KEGG, BioCarta, and AAAS/STKE). Download and process.

In [1]:
import csv
import pandas as pd

In [2]:
# MSigDB version
version = '6.1'

In [3]:
# Download full MSigDB matrix
# NOTE - This fill is not added to the repository because it contains
# gene sets with restrictive licenses
url_prefix = 'https://www.broadinstitute.org/gsea/resources/msigdb/'
url = '{}{}/msigdb.v{}.symbols.gmt'.format(url_prefix, version, version)
! wget --timestamping --no-verbose --directory-prefix 'data' $url

Last-modified header missing -- time-stamps turned off.
2018-03-26 20:34:00 URL:http://software.broadinstitute.org/gsea/resources/msigdb/6.1/msigdb.v6.1.symbols.gmt [17437810/17437810] -> "data/msigdb.v6.1.symbols.gmt" [1]


In [4]:
# Many of the genesets have sub gene sets - process these as well
msigdb_dict = {
    'c1.all': 'positional gene sets',
    'c2.cgp': 'chemical and genetic perturbations',
    'c2.cp.reactome': 'Reactome gene sets',
    'c3.mir': 'microRNA targets',
    'c3.tft': 'transcription factor targets',
    'c4.cgn': 'cancer gene neighborhoods',
    'c4.cm': 'cancer modules',
    'c5.bp': 'GO biological processes',
    'c5.cc': 'GO cellular components',
    'c5.mf': 'GO molecular functions',
    'c6.all': 'oncogenic signatures',
    'c7.all': 'immunologic signatrues'
}

for gene_set in msigdb_dict:
    url = '{}{}/{}.v{}.symbols.gmt'.format(url_prefix, version, gene_set, version)
    ! wget --timestamping --no-verbose --directory-prefix 'data' $url

Last-modified header missing -- time-stamps turned off.
2018-03-26 20:34:01 URL:http://software.broadinstitute.org/gsea/resources/msigdb/6.1/c1.all.v6.1.symbols.gmt [244078/244078] -> "data/c1.all.v6.1.symbols.gmt" [1]
Last-modified header missing -- time-stamps turned off.
2018-03-26 20:34:03 URL:http://software.broadinstitute.org/gsea/resources/msigdb/6.1/c2.cgp.v6.1.symbols.gmt [2734818/2734818] -> "data/c2.cgp.v6.1.symbols.gmt" [1]
Last-modified header missing -- time-stamps turned off.
2018-03-26 20:34:08 URL:http://software.broadinstitute.org/gsea/resources/msigdb/6.1/c2.cp.reactome.v6.1.symbols.gmt [330737/330737] -> "data/c2.cp.reactome.v6.1.symbols.gmt" [1]
Last-modified header missing -- time-stamps turned off.
2018-03-26 20:34:09 URL:http://software.broadinstitute.org/gsea/resources/msigdb/6.1/c3.mir.v6.1.symbols.gmt [234999/234999] -> "data/c3.mir.v6.1.symbols.gmt" [1]
Last-modified header missing -- time-stamps turned off.
2018-03-26 20:34:15 URL:http://software.broadinsti