# Gene, Drug, and Disease Table

Create a large table for profiling expression of genes in tissues that correspond to known drug treatments

    I. Create MAB dataframe: Drugs - Gene - Disease
    II. Create mapping between CIViC disease and tissue 
    III. Create combined table: Drug - Disease - Gene - Tissue - p-val - l2fc

In [110]:
import pandas as pd

import mygene
import urllib2
from bs4 import BeautifulSoup

from rnaseq_lib.utils import mkdir_p
from rnaseq_lib.civic import create_civic_drug_disease_dataframe
from rnaseq_lib.tissues import get_gene_map

from progressbar import ProgressBar
bar = ProgressBar()

## I. Create MAB Dataframe

In [26]:
mab = pd.read_html('https://en.wikipedia.org/wiki/List_of_therapeutic_monoclonal_antibodies')[0]

In [27]:
mab.columns = mab.iloc[0]
mab = mab.drop(0, axis=0)

In [29]:
mab.head()

Unnamed: 0,Name,Trade name,Type,Source,Target,Use
1,3F8,,mab,mouse,GD2 ganglioside,neuroblastoma
2,8H9[1],,mab,mouse,B7-H3,"neuroblastoma, sarcoma, metastatic brain cancers"
3,Abagovomab[2],,mab,mouse,CA-125 (imitation),ovarian cancer
4,Abciximab,ReoPro,Fab,chimeric,CD41 (integrin alpha-IIb),platelet aggregation inhibitor
5,Abituzumab[3],,mab,humanized,CD51,cancer


Filter out mouse source

In [32]:
mab = mab[mab.Source != 'mouse']
mab.head()

Unnamed: 0,Name,Trade name,Type,Source,Target,Use
4,Abciximab,ReoPro,Fab,chimeric,CD41 (integrin alpha-IIb),platelet aggregation inhibitor
5,Abituzumab[3],,mab,humanized,CD51,cancer
6,Abrilumab[4],,mab,human,integrin α4β7,"inflammatory bowel disease, ulcerative colitis..."
7,Actoxumab[5],,mab,human,Clostridium difficile,Clostridium difficile colitis
8,Adalimumab,Humira,mab,human,TNF-α,"Rheumatoid arthritis, Crohn's Disease, Plaque ..."


## I.a Process MAB Dataframe

There isn't an easy way to programmatically filter **Use** by cancer as well as convert the **Target** to a gene name, but some of it can be.

In [36]:
mkdir_p('MAB-processing')
mab.to_csv('MAB-processing/mab-raw.tsv', sep='\t', encoding='utf-8')

Manually filter **Use** for cancer types

In [59]:
mab_cancer = pd.read_csv('MAB-processing/mab-cancer.tsv', index_col=0, sep='\t')
print 'Number of candidate cancer drugs: {}'.format(mab_cancer.shape[0])

Number of candidate cancer drugs: 156


The genes in the **Target** column aren't standardizd, let's first separate out valid and invalid genes and then use the mygene API to manually fill in the rest.

Collect all genes in the gene_map, which includes both ENS names and "real" gene names

In [55]:
gene_map = get_gene_map()
valid_genes = set(gene_map.keys() + gene_map.values())

Separate out all samples with invalid genes

In [85]:
invalid_df = mab_cancer[~mab_cancer.Target.isin(valid_genes)]
print 'Number of drugs with invalid genes: {}'.format(invalid_df.shape[0])

Number of drugs with invalid genes: 102


### Standardize Gene Names
Use MyGene API to find valid gene names

In [None]:
mg = mygene.MyGeneInfo()

# Iterate through invalid genes
genes = []
for invalid_gene in invalid_df.Target:
    
    # Process / clean input
    if '(' in invalid_gene or ')' in invalid_gene and not invalid_gene.startswith('CD'):
        invalid_gene = invalid_gene.split('(')[1].split(')')[0]
    if '?' in invalid_gene:
        invalid_gene = invalid_gene.split()[0]
    if '/' in invalid_gene:
        invalid_gene = invalid_gene.split('/')[0]
    
    try:
        hits = mg.query(invalid_gene, fields='symbol,ensembl.gene')['hits']
    except KeyError:
        hits = []
    
    # Iterate through hits for gene
    gene = None
    for hit in hits:
        if hit['symbol'] in valid_genes:
            gene = hit['symbol']
        elif hit['symbol'].upper() in valid_genes:
            gene = hit['symbol'].upper()
        
        # If no matching symbol is found, look for ensemble name
        else: 
            try:
                if hit['ensembl']['gene'] in valid_genes:
                    gene = hit['ensembl']['gene']
            except KeyError:
                pass
        
        # If we've found a match, break loop
        if gene:
            genes.append(gene)
            break
    
    # No gene found
    if not gene:
        genes.append(None)

print '\nMapped {} genes'.format(len([x for x in genes if x is not None]))

# Add mapped genes to dataframe, save original names
invalid_df['Original target'] = invalid_df['Target']
invalid_df['Target'] = genes

Recombine DataFrame, sort by invalid genes, and output

In [106]:
valid_df = mab_cancer[mab_cancer.Target.isin(valid_genes)]
concat = pd.concat([valid_df, invalid_df], axis=0).sort_values(['Target', 'Name']).reset_index(drop=True)
concat.to_csv('MAB-processing/mab-cancer-mapped-1st-pass.tsv', '\t')

The last few unmapped genes will be mapped manually or removed

In [109]:
len(concat[concat.Target.isnull()])

19

### Scrape to Confirm Drug Target

We'll scrape wiki for drug targets to confirm that our automated gene selections match.

In [None]:
base = 'https://en.wikipedia.org/wiki/'

for drug in invalid_df.Name:
    
    drug = drug.split('[')[0] if '[' in drug else drug
    
    # Look for wiki page
    try:
        page = urllib2.urlopen(base + drug)
    except urllib2.HTTPError:
        print 'Page not found for: {}'.format(drug)
        continue
    
    # Parse page
    soup = BeautifulSoup(page, 'html.parser')
    
    # Look for table
    name_box = soup.find('table', {'class': 'infobox'})
    
    if not name_box:
        print 'No table found for {}'.format(drug)
        continue
    
    # Look for Target, next item should be the Drug
    name = name_box.text.strip()
    if 'Target' in name:
        print drug, name.split('\n')[name.split('\n').index('Target') + 1]
    else:
        print '{} has no listed Target'.format(drug)
    
    # Pause
    while True:
        raw_input('Hit any key to continue')
        break

This process corrected for several mis-identified genes. Now we'll see if we can't use a similar mechanism to find cancer type / tissue information

### Scrape to Find Drug Use and Tissue Match

## II.  Create CIViC Disease and Tissue Mapping