#### Deploying a new version of swissprot or uniprot

The module has been integrated in the pyproteinsExt package

Can be used to 
* build a new database


* query and existing database
    * CLI for easy threading
    * object mode for inspection in jupyter environment 
    
    
    
DATABASE building needs documentation




In [1]:
import sys, re
%load_ext autoreload

# Development library
sys.path.append("/Users/guillaumelaunay/work/DVL/python3/pyproteinsExt/src")
sys.path.append("/Users/guillaumelaunay/work/DVL/python3/pyproteins/src")

sprotFile="/Users/guillaumelaunay/work/databases/uniprot_sprot_2018_11.fasta"
workDir="/Users/guillaumelaunay/tmp/FS_sprot"

In [2]:
import pyproteinsExt.psicquic as psq
import pyproteinsExt.biogrid as bg
import pyproteinsExt.database.uniprotFastaFS as DB
%autoreload 2

Acknowledged 4344 entries (/Users/guillaumelaunay/work/data/pfam)


In [3]:
biogridMapper=bg.BIOGRIDMAPPER()
biogridMapper.read("/Users/guillaumelaunay/work/databases/biogrid/UNIPROT.tab.txt")

# Building interaction source

## Resulting mitab 
can be found at 
`/Users/guillaumelaunay/work/databases/psicquicCache/merged_uniprot_safe.mitab`

Required for step $2+$ of [divisomeFactory](https://github.com/glaunay/divisomeFactory)

### Using sources
As of November 2018

For all but biogrid, root folder location is **/Users/guillaumelaunay/work/databases/psicquicCache/**


|Provider | file location | nb entries | swiss-prot safe | uniprot safe |
| --- | :---: | --- | --- |
| intact **IN** |`intact_physical_association_2018_11.mitab` | 417220 | 232195 | 327353 |
| mint **MT** | `mint_physical_association_2018_11.mitab` | 124464 | 112210 | 121491 |
| matrixdb **MX** | `matrixdb_physical_association_2018_11.mitab` | 33334 | 28480 | 29573 |
| biogrid **BG**| `/Users/guillaumelaunay/work/databases/biogrid/3.5.166/BIOGRID-MV-Physical-3.5.166.mitab.txt`| 244547 | 231916 | 232233 |

NB:biogrid was provided already curated by the database, others were obtained from their psicquic WS. MatrixDB had to filtered-out for "spoke expansion"

### Readin in the mitab sources
1. Count total
2. Discard any interaction not supported by pair of up-to-date uniprot identifiers SAFE AS SPROT/UNIPROT 2018_11
3. Count valid

 * Predicate for valid uniprot identifier **Set to swissprot or uniprot** 

In [5]:
DBpathSwissprot = "/Users/guillaumelaunay/tmp/FS_uniprot_v3"
DBpathTrembl = "/Users/guillaumelaunay/tmp/vTrembl"

def validUniprotPredicate(psqData):
    ok = False
    for i in range(0,2):
        ok = False
        for alias in psqData.interactors[i]:
            if not alias[0] == "uniprotkb:":
                continue
            if DB.exists(alias[1], DBpathSwissprot):
                ok = True
                break
            elif DB.exists(alias[1], DBpathTrembl):
                ok = True
                break
                
        if not ok:
            break
        
    return ok


##### Intact

In [6]:
psqObj_IN = psq.PSICQUIC(offLine=True)
psqObj_IN.read("/Users/guillaumelaunay/work/databases/psicquicCache/intact_physical_association_2018_11.mitab")
print(len(psqObj_IN))

417220


In [7]:
psqObjRedux_IN = psqObj_IN.filter(predicate=validUniprotPredicate)
print(len(psqObjRedux_IN))

327353


#### biogrid

In [8]:
psqObj_BG = psq.PSICQUIC(offLine=True)
psqObj_BG.read("/Users/guillaumelaunay/work/databases/biogrid/3.5.166/BIOGRID-MV-Physical-3.5.166.mitab.txt")
print(len(psqObj_BG))

psqObj_BG.convert(biogridMapper)

244546


In [9]:
psqObjRedux_BG = psqObj_BG.filter(predicate=validUniprotPredicate)
print(len(psqObjRedux_BG))

232233


#### matrixDB

In [10]:
psqObj_MX = psq.PSICQUIC(offLine=True)
psqObj_MX.read("/Users/guillaumelaunay/work/databases/psicquicCache/matrixdb_physical_association_2018_11.mitab")
print(len(psqObj_MX))

33334


In [11]:
psqObjRedux_MX = psqObj_MX.filter(predicate=validUniprotPredicate)
print(len(psqObjRedux_MX))

29573


#### Mint

In [12]:
psqObj_MT = psq.PSICQUIC(offLine=True)
psqObj_MT.read("/Users/guillaumelaunay/work/databases/psicquicCache/mint_physical_association_2018_11.mitab")
print(len(psqObj_MT))

124464


In [13]:
psqObjRedux_MT = psqObj_MT.filter(predicate=validUniprotPredicate)
print(len(psqObjRedux_MT))

121491


#### MITAB normalizing
##### applying uniprot valid identifier to both interactor
Probably not needed as filtered was applied, but some DB may put internal identifier in main interactor slot

Example given w/ the tweaked `"/Users/guillaumelaunay/work/databases/psicquicCache/intact_physical_association_2018_11.mitab.normExample"` intact mitab file

In [14]:
DBpath = "/Users/guillaumelaunay/tmp/FS_uniprot_v3"
DBpathTrembl = "/Users/guillaumelaunay/tmp/vTrembl"
# If the main identifier is not DB safe, swap it w/ the 1st alternative safe one
def normalizePsqByDB(psqObj, _DBpath):
    for psqData in psqObj:
        
        for i in range(0,2):
            refLoad = None
            for alias in psqData.interactors[i]:
                if DB.exists(alias[1], _DBpath) or DB.exists(alias[1], DBpathTrembl):
                    if not refLoad:
                        break
                    psqData.swapInteractor(to=alias[1])
                    print("swapping from " + str(refLoad) + " to " + alias[1])
                    break
                if not refLoad:
                    refLoad = alias[1]
            

**eg:** input interactors `'intact:', 'EBI-592823'` of interaction _0_ is not swissprot safe but an alternative one `'uniprotkb:', 'Q15291'` is

In [27]:
psqObj_dev = psq.PSICQUIC(offLine=True)
psqObj_dev.read("/Users/guillaumelaunay/work/databases/psicquicCache/intact_physical_association_2018_11.mitab.normExample")
psqObj_dev[0].interactors

([('intact:', 'EBI-592823'),
  ('uniprotkb:', 'Q15291'),
  ('uniprotkb:', 'A8K272'),
  ('uniprotkb:', 'Q7Z6D8'),
  ('uniprotkb:', 'Q8NDZ7'),
  ('dip:', 'DIP-29224N')],
 [('uniprotkb:', 'Q921F2'),
  ('intact:', 'EBI-6876933'),
  ('uniprotkb:', 'Q3U591'),
  ('uniprotkb:', 'Q3V0E7')])

In [16]:
normalizePsqByDB(psqObj_dev, DBpath)
psqObj_dev[0].interactors

swapping from EBI-592823 to Q15291


([('uniprotkb:', 'Q15291'),
  ('intact:', 'EBI-592823'),
  ('uniprotkb:', 'A8K272'),
  ('uniprotkb:', 'Q7Z6D8'),
  ('uniprotkb:', 'Q8NDZ7'),
  ('dip:', 'DIP-29224N')],
 [('uniprotkb:', 'Q921F2'),
  ('intact:', 'EBI-6876933'),
  ('uniprotkb:', 'Q3U591'),
  ('uniprotkb:', 'Q3V0E7')])

#### Normalization of actual datasets

In [15]:
print("Normalizing IntAct dataset")
normalizePsqByDB(psqObjRedux_IN, DBpath) # 
print("Normalizing Mint dataset")
normalizePsqByDB(psqObjRedux_MT, DBpath)
print("Normalizing biogrid dataset")
normalizePsqByDB(psqObjRedux_BG, DBpath)
print("Normalizing matrixDB  dataset")
normalizePsqByDB(psqObjRedux_MX, DBpath)



Normalizing IntAct dataset
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P30042 to P0DPI2
swapping from P30042 to P0DPI2
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P62158 to P0DP23
swapping from P25854 to P0DH96
swapping from P25854 to P0DH96
swapping from P25854 to P0DH96
swapping fro

#### Merge, make NR & dump

In [16]:
total_psqObj = psq.PSICQUIC(offLine=True)

for psqSet in [psqObjRedux_IN, psqObjRedux_MT, psqObjRedux_BG, psqObjRedux_MX]:
    total_psqObj.records += psqSet.records

print( len(total_psqObj) )
total_psqObj.makeNR()
print( len(total_psqObj) )

710650
637100


In [17]:
total_psqObj.dump("/Users/guillaumelaunay/work/databases/psicquicCache/merged_uniprot_safe.mitab")

'100 first items...\nuniprotkb:Q38922\tuniprotkb:Q84K38\tintact:EBI-4472335|uniprotkb:A8MRA2\tintact:EBI-16882550\tpsi-mi:rab1b_arath(display_long)|uniprotkb:RABB1B(gene name)|psi-mi:RABB1B(display_short)|uniprotkb:GB2(gene name synonym)|uniprotkb:RAB2C(gene name synonym)|uniprotkb:Ras-related protein GB2(gene name synonym)|uniprotkb:Ras-related protein Rab2C(gene name synonym)|uniprotkb:F4B14.130(orf name)|uniprotkb:At4g35860(locus name)\tpsi-mi:q84k38_arath(display_long)|uniprotkb:MNF13.19(gene name)|psi-mi:MNF13.19(display_short)|uniprotkb:MNF13_19(gene name synonym)|uniprotkb:At5g40640(locus name)\tpsi-mi:"MI:0112"(ubiquitin reconstruction)\tJones et al. (2014)\timex:IM-26362|pubmed:24833385|doi:10.1126/science.1251358\ttaxid:3702(arath)|taxid:3702("Arabidopsis thaliana (Mouse-ear cress)")\ttaxid:3702(arath)|taxid:3702("Arabidopsis thaliana (Mouse-ear cress)")\tpsi-mi:"MI:0915"(physical association)\tpsi-mi:"MI:0469"(IntAct)\tintact:EBI-16994127|imex:IM-26362-12611\tintact-miscore:

##### TODO API AND/OR Constructing FSv2 database of swissprot  

!!!! BELOW IS most probably DEPRECATED

In [None]:
import pyproteinsExt.database.uniprotFastaFS_2 as DB
%autoreload 2

sprotFile="/Users/guillaumelaunay/work/databases/uniprot_sprot_2018_11.fasta.gz"
workDir="/Users/guillaumelaunay/tmp/FS_uniprot_v3"


In [None]:
import seaborn
#DB.exists("P98160U", workDir)
y=DB.stat(workDir)
seaborn.distplot([x[1] for x in y])

Subset development

In [None]:
import pyproteinsExt.database.uniprotFastaFS_2 as DB
%autoreload 2

sprotFile="/Users/guillaumelaunay/work/databases/uniprot_sprot_2018_11.fasta.gz"
workDir="/Users/guillaumelaunay/tmp/FS_DVL3"

u = 0
for elem in DB.fileCrawl(sprotFile):
    #print("===>" + elem['id'])
    DB._insertID(elem['id'], workDir, 5)
    DB._load(workDir, elem['id'], elem['content'])
    u+=1
    if u == 10000:
        break

In [None]:
DB.load(workDir,  'Q6GZU9',  bag['Q6GZU9'])
DB.load(workDir,  'Q6GZU7',  bag['Q6GZU7'])


In [None]:
import pyproteinsExt.database.uniprotFastaFS_2 as DB
%autoreload 2

sprotFile="/Users/guillaumelaunay/work/databases/uniprot_sprot_2018_11.fasta.gz"
workDir="/Users/guillaumelaunay/tmp/FS_uniprot_v3"

DB.batchBuild(workDir, sprotFile)