# Downloading DMS scores from MAVEdb

Using the MAVEdb API to get a list of DMS score datasets on human proteins and download the associated CSV files.

**MAVEdb**: <https://www.mavedb.org/>

## get list of DMS score datasets

`requests` module.

In [1]:
import requests
import os
import csv
import re
import pandas as pd
import numpy as np
import glob

In [2]:
scoresets = requests.get('https://www.mavedb.org/api/scoresets/')
assert scoresets.status_code == 200

In [3]:
scoresets = scoresets.json()

Now keep a dictionary of scoresets with uniprot identifier and scoreset ID. 
* DMS experiments that targeted the gene promoter (as opposed to the coding sequence) will have field `'uniprot': None`.

In [4]:
scoresets[0]


{'creation_date': '2019-08-07',
 'modification_date': '2019-08-09',
 'urn': 'urn:mavedb:00000040-a-4',
 'publish_date': '2019-08-07',
 'created_by': '0000-0003-1474-605X',
 'modified_by': '0000-0003-1474-605X',
 'extra_metadata': {},
 'abstract_text': 'This study measured the effect of variants in yeast HSP90 under different combinations of temperature (30C or 36C) and presence/absence of salt (0.5 M NaCl). The results explore the adaptive potential of this essential gene.',
 'method_text': "Sequencing reads were filtered based on a minimum Phred quality score of 20 across all 36 bases. For each time point, the log2 ratio of each variant's count to the wild type count was calculated. The score of each variant was calculated as the slope of these log ratios to time in wild type generations. Scores of -0.5 are considered null-like.",
 'short_description': 'Deep mutational scan of all single mutants in a nine-amino acid region of Hsp90 (Hsp82) in Saccharomyces cerevisiae at 36C with 0.5 M

In [5]:
# dictionary of scoreset ID and uniprot ID
scoresets_dict = dict()
for entry in scoresets:
    if entry['target']['uniprot'] is not None and \
        entry['target']['reference_maps'][0]['genome']['organism_name'] == 'Homo sapiens':
        scoreset_id = entry['target']['scoreset']
        scoresets_dict[ scoreset_id ] = {'uniprot': entry['target']['uniprot']['identifier'], \
                                         'offset': entry['target']['uniprot']['offset'], \
                                        'url': 'https://www.mavedb.org/scoreset/' + scoreset_id + '/scores/'}

In [6]:
[val for key, val in scoresets_dict.items() if key in list(scoresets_dict.keys())[:5]]

[{'uniprot': 'P12931',
  'offset': 269,
  'url': 'https://www.mavedb.org/scoreset/urn:mavedb:00000041-a-1/scores/'},
 {'uniprot': 'P61073',
  'offset': 1,
  'url': 'https://www.mavedb.org/scoreset/urn:mavedb:00000048-a-1/scores/'},
 {'uniprot': 'P37840',
  'offset': 0,
  'url': 'https://www.mavedb.org/scoreset/urn:mavedb:00000045-c-1/scores/'},
 {'uniprot': 'P42898',
  'offset': 0,
  'url': 'https://www.mavedb.org/scoreset/urn:mavedb:00000049-a-2/scores/'},
 {'uniprot': 'Q9NV35',
  'offset': 0,
  'url': 'https://www.mavedb.org/scoreset/urn:mavedb:00000056-a-1/scores/'}]

Download from the URLs the CSV files.

In [7]:
for key, val in scoresets_dict.items():
    os.system('wget -O data/MAVEdb/' + key.replace(':', '_') + '.csv ' + val['url'])

In [8]:
# write out the mapping betweeen uniprot and MAVE scoreset ID for future use
with open('data/MAVEdb/mapping.tsv', 'w') as instream:
    wr = csv.writer(instream, delimiter = '\t')
    wr.writerow(['scoreset_id', 'uniprot', 'numAA_offset'])
    for key, val in scoresets_dict.items():
        wr.writerow([key, val['uniprot'], val['offset']])
        

## Process the downloaded scoreset files

* Add additional column for uniprot ID of the entry
* Parse the AA change text (e.g. 'p.Leu308Pro') separately for WT AA, MUT AA and AA position. Use One-letter Amino Acid notation.
* Merge all into the same CSV file

In [9]:
def parseAAmut(mut_string, offset):
    """
    Given standard HGVS mutation notation (e.g. 'p.Leu308Pro'), give tuple of
    (WT_AA, MUT_AA, AA_pos) where WT_AA and MUT_AA are One-letter AA codes and AA_pos 
    is an integer (amino acid position of the mutation)
    
    Input:
      mut_string: standard HGVS mutation notation
      offset: int, number of amino acids to add to the AA pos to adjust it to uniprot numbering
    Output:
      tuple of (WT_AA, MUT_AA, AA_pos). semicolon-delimited if multiple AA positions are affected.
    """
    d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M', 'TER': 'X'}
    # if contain frameshift or variants that can't be mapped to AA sequence consequence, ignore this
    if re.search('fs', mut_string) is not None:
        return (None, None, None)
    if mut_string.find('?') != -1:
        return (None, None, None)
    if mut_string.find('*') != -1 and mut_string.find('del') != -1 and mut_string.find('ins') != -1:
        return (None, None, None)
    cmpd = re.search('(?<=p.)\[.*\]', mut_string)
    if cmpd is not None: # compound AA change in the form e.g. p.[Tyr19Cys;Asn22Asp;=]
        cmpd = str(cmpd.group(0)).strip('[]')
        cmpd = cmpd.split(';')
    else:
        cmpd = [mut_string]
    # remove synonymous
    cmpd = [i for i in cmpd if i not in ['p.=', '_wt', 'p.?', '_sy', '=']]
    cmpd = [i for i in cmpd if i.find('?') == -1 and i.find('*') == -1 and i.find('del') == -1 and i.find('ins') == -1]
    if len(cmpd) == 0:
        return (None, None, None)
    AA_pos = list()
    WT_AA = list()
    MUT_AA = list()    
    for ms in cmpd:
        try:
            aa_pos = int(re.search('[0-9]+', ms).group(0))
        except AttributeError:
            print(mut_string)
        try:
            wt = str(re.search('(?<=p.)[A-Za-z]+', ms).group(0)).upper()
        except AttributeError:
            wt = str(re.search('^[A-Za-z]+', ms).group(0)).upper()
        try:
            mut = str(re.search('[A-Za-z\=]+\Z', ms).group(0)).upper()
        except AttributeError:
            print(mut_string)    
        if mut not in d.keys() or wt not in d.keys():
            if mut == '=':
                mut = wt
            else:
                print(ms)
        AA_pos.append( str(aa_pos + offset) )
        WT_AA.append( d[wt] )
        MUT_AA.append( d[mut] )
    return (';'.join(WT_AA), ';'.join(MUT_AA), ';'.join(AA_pos))
    
# loop through CSV files and read as pd dataframes,
# - add additional columns with uniprot ID and scoreset ID
# - retain only the relevant columns
# - append dataframe to a new file
for scoreset_id, val in scoresets_dict.items():
    fn = scoreset_id.replace(':', '_')
    db = pd.read_csv('data/MAVEdb/' + fn + '.csv', header = 4)
    db = db[['hgvs_pro', 'score']]
    db['scoreset'] = scoreset_id
    db['uniprot'] = val['uniprot']
    AAchange = [ parseAAmut(str(mut), int(val['offset'])) for mut in list(db['hgvs_pro']) ] # process `hgvs_pro` column with parseAAmut
    db[ ['WT_AA', 'MUT_AA', 'AA_pos'] ] = pd.DataFrame(AAchange, index = db.index)
    db = db[ (db['WT_AA'] != db['MUT_AA']) & (db['MUT_AA'] != 'X') ]   # only misense changes, no silent or nonsense mutations
    db.to_csv('data/MAVEdb/human_proteins.csv', mode='a', index=False)

In [10]:
# clean up the written table

# remove those extra headers from appending tables into the same CSV file
os.system("head -1 data/MAVEdb/human_proteins.csv > data/MAVEdb/tmp")
os.system('awk \'$1 !~ /hgvs/{ print }\' data/MAVEdb/human_proteins.csv >> data/MAVEdb/tmp')
os.system('mv data/MAVEdb/tmp data/MAVEdb/human_proteins.csv')
# remove dubious AA changes like 'p.? or p.= or _wt' etc.
human_proteins = pd.read_csv('data/MAVEdb/human_proteins.csv')
print(human_proteins.shape)
bool1 = (human_proteins.hgvs_pro.str.find('?') == -1)         
bool2 = (human_proteins.hgvs_pro.str.find('=') == -1)           
bool3 = (human_proteins.hgvs_pro.str.find('wt') == -1)          
bool4 = (human_proteins.hgvs_pro.str.find('Ter') == -1)
human_proteins = human_proteins[bool1 & bool2 & bool3 & bool4]
human_proteins = human_proteins[~ human_proteins.WT_AA.isna()]
print(human_proteins.shape)


(1139670, 7)
(1108702, 7)


In [11]:
human_proteins.to_csv('data/MAVEdb/human_proteins.csv', index=False)

# Then replace entries for urn_mavedb_00000049-a-1 with urn_mavedb_00000049-a-1_pub.csv
# these are taken from the values from the publication which summarises all experimental conditions in mavedb:00000049.

To determine whether/how to use these values. The trouble is the definition of 'score' varies from one dataset to the other. In some a larger score means higher activity, in some it means more damaging. Some are scaled between 0 and 1, others are centered around 0. Doubtful whether a standard normalization would force them into similar (enough) distributions.

`DMSexp_prepare_data` notebook processes the data and turns this into a binary classification (damaging/neutral) problem.