In [1]:
import os

---

# CORUM

**2015 November 29-30**

Processing the downloaded CORUM file *CORUM_allComplexes.csv* into 2-column format (1<sup>st</sup> column &larr; complex name, 2<sup>nd</sup> column &larr; gene) with the script *proteincomplexformat.py*. The file lists complex members as UniProt IDs and ostensibly Entrez IDs. Unfortunately, it will not be possible to directly process the CORUM file into Entrez IDs. For whatever reason, CORUM maps some UniProt IDs to Entrez ID 0:

In [2]:
%%bash
grep ';SRp30c-SRp55 complex' /work/jyoung/DataDownload/CORUM_allComplexes.csv

6073;SRp30c-SRp55 complex;;Human;Q13247,Q13242;0,0;MI:0019- coimmunoprecipitation;15695522;11.04.03.01.10;"";"Tau exons 2 and 10, which are misregulated in neurodegenerative diseases, are partly regulated by silencers which bind a SRp30c-SRp55 complex that either recruits or antagonizes htra2beta1 (a variant of TRA2B).";""


<s>Plan to use the *corum()* function in *proteincomplexformat.py* to write IDs as UniProt and the Bioconductor R *org.Hs.eg.db* library to convert these to Entrez.</s> Unfortunately, *org.Hs.eg.db* only converts from Entrez to UniProt and not the other way around. Try to write out all the UniProt IDs from CORUM into a single file and have the [UniProt mapping service](http://www.uniprot.org/mapping/) perform the conversion. Also beware that for some reason, the CORUM file has parentheses surrounding some of the IDs:

In [3]:
%%bash
grep ';NK-3-Groucho complex;' /work/jyoung/DataDownload/CORUM_allComplexes.csv

3150;NK-3-Groucho complex;;Human;(Q99801,P78367),(Q04724,Q04725,Q04726,Q04727,Q9H808);(4824,579),(7088,7089,7090,7091,79816);MI:0096- pull down;10559189;11.02.03.04.03,70.10;"NK-3 homeodomain protein can associate wih the human Groucho homolog TLE in the absence of DNA. This interaction translocates Groucho proteins from the cytoplasm into the nucleus.";"";""


In [4]:
os.chdir(os.path.join(os.sep, 'work', 'jyoung','DataDownload'))
uniprotIDs = set()
corumFile = open('CORUM_allComplexes.csv')
header = corumFile.readline().rstrip().split(';')
orgCol = header.index('organism')
uniprotCol = header.index('subunits (UniProt IDs)')
for line in corumFile:
    tokens = line.rstrip().split(';')
    if tokens[orgCol] == 'Human':
        if tokens[uniprotCol] != '':
            if '(' in tokens[uniprotCol]:
                uniprotIDs.update(tokens[uniprotCol].replace('(', '').replace(')', '').split(','))
            else:
                uniprotIDs.update(tokens[uniprotCol].split(','))
corumFile.close()

os.chdir(os.path.join('..', 'DataProcessed'))
writeFile = open('CORUM_Human_UniProt.txt', 'w')
for uid in uniprotIDs:
    writeFile.write(uid + '\n')
writeFile.close()

2,539 out of 2,555 identifiers from UniProtKB AC/ID were successfully mapped to 2,576 Entrez Gene (GeneID) IDs. The mapped IDs were saved as *CORUM_Hs_UniProt_mapped.tab* and the un-mapped IDs were saved into *CORUM_Hs_UniProt_notmapped.txt*. Any un-mapped IDs will be ignored when converting from UniProt to Entrez. The mapping is also not 1-to-1:

In [5]:
%%bash
wc -l /work/jyoung/DataProcessed/CORUM_Hs_UniProt_mapped.tab

2583 /work/jyoung/DataProcessed/CORUM_Hs_UniProt_mapped.tab


In [6]:
%%bash
cut -f1 /work/jyoung/DataProcessed/CORUM_Hs_UniProt_mapped.tab | sort | uniq | wc -l

2540


In such instances, the offending UniProt ID with multiple corresponding Entrez IDs will be ignored and not converted.

Modifications made to *proteincomplexformat.py* to add a function to return a dictionary convert UniProt to Entrez and perform the processing to CORUM to 2-column format.

In [1]:
import proteincomplexformat

In [2]:
proteincomplexformat.corum('Human')

2-column output written as *CORUM_Human_Entrez.txt*. 