# Introduction

This exercise makes use of the database you created in `Exercise02` and the BEL statement parsers you write with regular expressions in `Reading_searching_sending.ipynb`.

In [1]:
import pandas as pd
import os, json, re, time
time.asctime()

'Thu Oct  6 10:38:18 2016'

In [2]:
base = os.path.join(os.environ['BUG_FREE_EUREKA_BASE'])
base

'C:\\Users\\Ihsan Muchsin\\Documents\\GitHub\\bug-free-eureka'

# Task 1

This exercise is about loading the HGNC data to create a dictionary from HGNC symbols to set of enzyme ID's.

## 1.1 Load Data

Load json data from `/data/exercise02/hgnc_complete_set.json`.

In [3]:
data_path = os.path.join(base, 'data', 'exercise02', 'hgnc_complete_set.json')
with open(data_path) as f:
    hgnc_json = json.load(f)

## 1.2 Reorganize Data into `pd.DataFrame`

Identify the relevant subdictionaries in your `dictionary -> response -> docs`. Load them to a data frame, 
then create a new data frame with just the HGNC symbol and Enzyme ID

In [4]:
docs = hgnc_json['response']['docs']

df_hgnc = pd.DataFrame(docs)
list(df_hgnc.columns)

['_version_',
 'alias_name',
 'alias_symbol',
 'bioparadigms_slc',
 'ccds_id',
 'cd',
 'cosmic',
 'date_approved_reserved',
 'date_modified',
 'date_name_changed',
 'date_symbol_changed',
 'ena',
 'ensembl_gene_id',
 'entrez_id',
 'enzyme_id',
 'gene_family',
 'gene_family_id',
 'hgnc_id',
 'homeodb',
 'horde_id',
 'imgt',
 'intermediate_filament_db',
 'iuphar',
 'kznf_gene_catalog',
 'lncrnadb',
 'location',
 'location_sortable',
 'locus_group',
 'locus_type',
 'lsdb',
 'mamit-trnadb',
 'merops',
 'mgd_id',
 'mirbase',
 'name',
 'omim_id',
 'orphanet',
 'prev_name',
 'prev_symbol',
 'pseudogene.org',
 'pubmed_id',
 'refseq_accession',
 'rgd_id',
 'snornabase',
 'status',
 'symbol',
 'ucsc_id',
 'uniprot_ids',
 'uuid',
 'vega_id']

## 1.3 Build dictionary for lookup

Iterate over this dataframe to build a dictionary that is `{hgnc symbol: set of enzyme id's}`. Call this dictionary `symbol2ec`

In [5]:
df_hgnc[['symbol', 'enzyme_id']].head(10)

Unnamed: 0,symbol,enzyme_id
0,A1BG,
1,A1BG-AS1,
2,A1CF,
3,A2M,
4,A2M-AS1,
5,A2ML1,
6,A2ML1-AS1,
7,A2ML1-AS2,
8,A2MP1,
9,A3GALT2,


In [6]:
symbol2ec = {}

df_hgnc_sliced = df_hgnc[['symbol', 'enzyme_id']]

for idx, symbol, enzyme_ids in df_hgnc_sliced.itertuples():
    if isinstance(enzyme_ids, list):
        symbol2ec[symbol] = enzyme_ids
    else:
        symbol2ec[symbol] = None

# Task 2

This subexercise is about validating protein and kinase activity statements in BEL. Refer to last Thursday's work in `Reading_searching_sending.ipynb`.

## 2.1 Valid HGNC

Write a function, `valid_hgnc(hgnc_symbol, symbol2ec_instance)` that takes a name and the dictionary from Task 1.3 and returns whether this is a valid name

In [7]:
def valid_hgnc(hgnc_symbol, symbol2ec_instance):
    return (hgnc_symbol in symbol2ec_instance)

assert valid_hgnc('AKT1', symbol2ec)
assert not valid_hgnc('TUKU', symbol2ec)

## 2.2 Valid Kinase Activity

Write a function, `valid_kinase(hgnc_symbol, symbol2ec_instance)` that takes a name and the dictionary from Task 1.3 and returns whether this protein has kinase activity. Hint: an enzyme code reference can be found [here](http://brenda-enzymes.org/ecexplorer.php?browser=1&f[nodes]=132&f[action]=open&f[change]=153)

In [8]:
def valid_kinase(hgnc_symbol, symbol2ec_instance):
    res = []
    
    if not valid_hgnc(hgnc_symbol, symbol2ec_instance):
        return False
    
    enzyme_ids = symbol2ec_instance[hgnc_symbol]
    if enzyme_ids:
        for enzyme_id in enzyme_ids:
            if isinstance(enzyme_id, str):
                idx = [int(i) for i in enzyme_id.split('.')]   
                res.append(idx[0] == 2 and idx[1] == 7)      
            return (True in res)
    
    return False

print (valid_kinase('AKT1', symbol2ec))
print (valid_kinase('ASALAJA', symbol2ec))
print (valid_kinase('AADAT', symbol2ec))

True
False
False


## 2.3 Putting it all together

Write a function, `validate_bel_term(term, symbol2ec_instance)` that parses a BEL term about either a protein, or the kinase activity of a protein and validates it.

```python
def validate_bel_term(term, symbol2ec_instance):
    pass
```

### Examples

```python
>>> # check that the proteins have valid HGNC codes
>>> validate_bel_term('p(HGNC:APP)', symbol2ec)
True
>>> validate_bel_term('p(HGNC:ABCDEF)', symbol2ec)
False
>>> # check that kinase activity annotations are only on proteins that are
>>> # actually protein kinases (hint: check EC annotation)
>>> validate_bel_term('kin(p(HGNC:APP))', symbol2ec)
False
>>> validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec)
True
```

In [9]:
def validate_bel_term(term, symbol2ec_instance):
    prot_pattern_def = '^p\(([A-Z]+):([a-zA-Z0-9]+)\)'
    kin_pattern_def = '^kin\(([a-z])\(([A-Z]+):([a-zA-Z0-9]+)\)\)'
    prot_pattern = re.compile(prot_pattern_def)
    kin_pattern = re.compile(kin_pattern_def)
    if prot_pattern.match(term):
        result = prot_pattern.match(term).groups()
        return valid_hgnc(result[1], symbol2ec_instance)
    elif kin_pattern.match(term):
        result = kin_pattern.match(term).groups()
        return valid_kinase(result[2], symbol2ec_instance)
    return False

print (validate_bel_term('p(HGNC:APP)', symbol2ec))
print (validate_bel_term('p(HGNC:ABCDEF)', symbol2ec))
print (validate_bel_term('kin(p(HGNC:APP))', symbol2ec))
print (validate_bel_term('kin(p(HGNC:AKT1))', symbol2ec))
    

True
False
False
True


# Task 3

This task is about manual curation of text. You will be guided through translating the following text into BEL statements as strings within a python list.

## Document Definitions

Recall citations are written with source, title, then identifier as follows:

```
SET Citation = {"PubMed", "Nat Cell Biol 2007 Mar 9(3) 316-23", "17277771"}
```

Use these annotations and these namespaces:

```
DEFINE NAMESPACE HGNC AS URL "http://resource.belframework.org/belframework/20131211/namespace/hgnc-human-genes.belns"

DEFINE ANNOTATION CellLocation as LIST {"cell nucleus", "cytoplasm", "endoplasmic reticulum"}
```


## Source Text

> The following statements are from the document "BEL Exercise" in edition 00001 of the PyBEL Journal.
> The kinase activity of A causes the increased abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 in the cytoplasm, 
> but only the increased expression of AKT serine/threonine kinase 1 in the endoplasmic reticulum. 
> Additionally, the abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 were found to be postively correlated in the cell nuclei.
> AKT serine/threonine kinase 2 increases GSK3 Beta in all of the nuclei, cyoplasm, and ER.

In [10]:
definition_statements = [
    'SET DOCUMENT name = "BEL Exercise"'
    'DEFINE NAMESPACE HGNC AS URL "http://resource.belframework.org/belframework/20131211/namespace/hgnc-human-genes.belns"',
    'DEFINE ANNOTATION CellLocation AS LIST {"cell nucleus", "cytoplasm", "endoplasmic reticulum"}',
]

In [11]:
# hint: there should be 11 statements from this text
your_statements = [
    'SET Citation = {"PyBEL Journal", "BEL Exercise", "00001"}',
    'SET Evidence = "The following statements are from the document "BEL Exercise" in edition 00001 of the PyBEL Journal. The kinase activity of A causes the increased abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 in the cytoplasm, but only the increased expression of AKT serine/threonine kinase 1 in the endoplasmic reticulum. Additionally, the abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 were found to be postively correlated in the cell nuclei. AKT serine/threonine kinase 2 increases GSK3 Beta in all of the nuclei, cyoplasm, and ER"',
    'SET CellLocation = "cytoplasm"',
    'kin(p(HGNC:PIK3CA)) increases p(HGNC:AKT1)',
    'kin(p(HGNC:PIK3CA)) increases p(HGNC:AKT2)',
    'UNSET CellLocation',
    'SET CellLocation = "endoplasmic reticulum"',
    'kin(p(HGNC:PIK3CA)) increases p(HGNC:AKT1)',
    'UNSET CellLocation',
    'SET CellLocation = "cell nucleus"',
    'p(HGNC:AKT1) positiveCorrelation p(HGNC:AKT2)',
    'UNSET CellLocation',
    'SET {"endoplasmic reticulum","cell nucleus","cyoplasm"}',
    'p(HGNC:AKT2) increases p(HGNC:GSK3B)',
    'UNSET CellLocation',
    'UNSET Evidence',
    'UNSET Citation'
]

In [12]:
statements = definition_statements + your_statements

In [13]:
for statement in statements:
    if not (statement.startswith('SET') or statement.startswith('UNSET')):
        elem_ls = [statement.split(' ')]
        print (*elem_ls)

['DEFINE', 'ANNOTATION', 'CellLocation', 'AS', 'LIST', '{"cell', 'nucleus",', '"cytoplasm",', '"endoplasmic', 'reticulum"}']
['kin(p(HGNC:PIK3CA))', 'increases', 'p(HGNC:AKT1)']
['kin(p(HGNC:PIK3CA))', 'increases', 'p(HGNC:AKT2)']
['kin(p(HGNC:PIK3CA))', 'increases', 'p(HGNC:AKT1)']
['p(HGNC:AKT1)', 'positiveCorrelation', 'p(HGNC:AKT2)']
['p(HGNC:AKT2)', 'increases', 'p(HGNC:GSK3B)']


# Task 4

This task is again about regular expressions. Return to `Reading_searching_sending.ipynb` and find your regular expressions that parse the subject, predicate, and object from a statement like `p(HGNC:AKT1) pos p(HGNC:AKT2)`

## 4.1 Validating Statements

Write a function `validate_bel_statement(statement, symbol2ec)` that takes a subject, predicate, object BEL statement as a string and determines if it its subject and objects are valid.

In [14]:
def validate_bel_statement(statement, symbol2ec):
    if not (statement.startswith('SET') or statement.startswith('UNSET') or statement.startswith('DEFINE')):
        elem_ls = statement.split(' ')
        return (validate_bel_term(elem_ls[0], symbol2ec) and validate_bel_term(elem_ls[2], symbol2ec))

## 4.2 Validating Your Statements

Run this cell to validate the BEL statements you've written.

In [15]:
for statement in your_statements:
    valid = validate_bel_statement(statement, symbol2ec)
    print('{} is {}valid'.format(statement, '' if valid else 'in'))

SET Citation = {"PyBEL Journal", "BEL Exercise", "00001"} is invalid
SET Evidence = "The following statements are from the document "BEL Exercise" in edition 00001 of the PyBEL Journal. The kinase activity of A causes the increased abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 in the cytoplasm, but only the increased expression of AKT serine/threonine kinase 1 in the endoplasmic reticulum. Additionally, the abundance of AKT serine/threonine kinase 1 and AKT serine/threonine kinase 2 were found to be postively correlated in the cell nuclei. AKT serine/threonine kinase 2 increases GSK3 Beta in all of the nuclei, cyoplasm, and ER" is invalid
SET CellLocation = "cytoplasm" is invalid
kin(p(HGNC:PIK3CA)) increases p(HGNC:AKT1) is valid
kin(p(HGNC:PIK3CA)) increases p(HGNC:AKT2) is valid
UNSET CellLocation is invalid
SET CellLocation = "endoplasmic reticulum" is invalid
kin(p(HGNC:PIK3CA)) increases p(HGNC:AKT1) is valid
UNSET CellLocation is invalid
SET CellLo

## 4.3 Visualization

Use `pybel` to visualize the network.

In [17]:
try:
    import pybel
    import networkx
    
    g = pybel.from_bel(statements)
    nx.draw_spring(g, with_labels=True)
except:
    print('PyBEL not installed')

PyBEL not installed
