# Gene, Drug, and Disease Table for Profiling

Create a large table for profiling expression of genes in tissues that correspond to known drug treatments

    I.   Create MAB dataframe
    II.  Create CIViC dataframe
    III. Create combined table: Drug - Gene - Tissue - p-val - l2fc
    IV.  Bipartite Graphs

In [14]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import os

import pandas as pd
import numpy as np
import networkx as nx

import mygene
import urllib2
from bs4 import BeautifulSoup

from rnaseq_lib.utils import mkdir_p
from rnaseq_lib.civic import create_civic_drug_disease_dataframe
from rnaseq_lib.tissues import get_gene_map
from rnaseq_lib.tissues import identify_tissue_from_str, grep_cancer_terms
from rnaseq_lib.web import find_gene_given_alias

from progressbar import ProgressBar
bar = ProgressBar()

# I. Create MAB Dataframe

In [15]:
mab = pd.read_html('https://en.wikipedia.org/wiki/List_of_therapeutic_monoclonal_antibodies')[0]

In [16]:
mab.columns = mab.iloc[0]
mab = mab.drop(0, axis=0)

In [17]:
mab.head()

Unnamed: 0,Name,Trade name,Type,Source,Target,Use
1,3F8,,mab,mouse,GD2 ganglioside,neuroblastoma
2,8H9[1],,mab,mouse,B7-H3,"neuroblastoma, sarcoma, metastatic brain cancers"
3,Abagovomab[2],,mab,mouse,CA-125 (imitation),ovarian cancer
4,Abciximab,ReoPro,Fab,chimeric,CD41 (integrin alpha-IIb),platelet aggregation inhibitor
5,Abituzumab[3],,mab,humanized,CD51,cancer


In [20]:
mab[mab.Source == 'mouse']

Unnamed: 0,Name,Trade name,Type,Source,Target,Use
1,3F8,,mab,mouse,GD2 ganglioside,neuroblastoma
2,8H9[1],,mab,mouse,B7-H3,"neuroblastoma, sarcoma, metastatic brain cancers"
3,Abagovomab[2],,mab,mouse,CA-125 (imitation),ovarian cancer
12,Afelimomab,,F(ab')2,mouse,TNF-α,sepsis
18,Altumomab pentetate,Hybri-ceaker,mab,mouse,CEA,colorectal cancer (diagnosis)
20,Anatumomab mafenatox,,Fab,mouse,TAG-72,non-small cell lung carcinoma
25,Arcitumomab,CEA-Scan,Fab',mouse,CEA,gastrointestinal cancers (diagnosis)
36,Bectumomab,LymphoScan,Fab',mouse,CD22,non-Hodgkin's lymphoma (detection)
37,Begelomab[23],,mab,mouse,DPP4,?
41,Besilesomab[24],Scintimun,mab,mouse,CEA-related antigen,inflammatory lesions and metastases (detection)


Filter out mouse source

In [32]:
mab = mab[mab.Source != 'mouse']
mab.head()

Unnamed: 0,Name,Trade name,Type,Source,Target,Use
4,Abciximab,ReoPro,Fab,chimeric,CD41 (integrin alpha-IIb),platelet aggregation inhibitor
5,Abituzumab[3],,mab,humanized,CD51,cancer
6,Abrilumab[4],,mab,human,integrin α4β7,"inflammatory bowel disease, ulcerative colitis..."
7,Actoxumab[5],,mab,human,Clostridium difficile,Clostridium difficile colitis
8,Adalimumab,Humira,mab,human,TNF-α,"Rheumatoid arthritis, Crohn's Disease, Plaque ..."


## I.a Process MAB Dataframe

There isn't an easy way to programmatically filter **Use** by cancer as well as convert the **Target** to a gene name, but some of it can be.

In [36]:
mkdir_p('MAB-processing')
mab.to_csv('MAB-processing/mab-raw.tsv', sep='\t', encoding='utf-8')

Manually filter **Use** for cancer types

In [59]:
mab_cancer = pd.read_csv('MAB-processing/mab-cancer.tsv', index_col=0, sep='\t')
print 'Number of candidate cancer drugs: {}'.format(mab_cancer.shape[0])

Number of candidate cancer drugs: 156


The genes in the **Target** column aren't standardizd, let's first separate out valid and invalid genes and then use the mygene API to manually fill in the rest.

Collect all genes in the gene_map, which includes both ENS names and "real" gene names

In [10]:
gene_map = get_gene_map()
valid_genes = set(gene_map.keys() + gene_map.values())

Separate out all samples with invalid genes

In [85]:
invalid_df = mab_cancer[~mab_cancer.Target.isin(valid_genes)]
print 'Number of drugs with invalid genes: {}'.format(invalid_df.shape[0])

Number of drugs with invalid genes: 102


### Standardize Gene Names
Use MyGene API to find valid gene names

In [None]:
mg = mygene.MyGeneInfo()

# Iterate through invalid genes
genes = []
for invalid_gene in invalid_df.Target:
    
    # Process / clean input
    if '(' in invalid_gene or ')' in invalid_gene and not invalid_gene.startswith('CD'):
        invalid_gene = invalid_gene.split('(')[1].split(')')[0]
    if '?' in invalid_gene:
        invalid_gene = invalid_gene.split()[0]
    if '/' in invalid_gene:
        invalid_gene = invalid_gene.split('/')[0]
    
    try:
        hits = mg.query(invalid_gene, fields='symbol,ensembl.gene')['hits']
    except KeyError:
        hits = []
    
    # Iterate through hits for gene
    gene = None
    for hit in hits:
        if hit['symbol'] in valid_genes:
            gene = hit['symbol']
        elif hit['symbol'].upper() in valid_genes:
            gene = hit['symbol'].upper()
        
        # If no matching symbol is found, look for ensemble name
        else: 
            try:
                if hit['ensembl']['gene'] in valid_genes:
                    gene = hit['ensembl']['gene']
            except KeyError:
                pass
        
        # If we've found a match, break loop
        if gene:
            genes.append(gene)
            break
    
    # No gene found
    if not gene:
        genes.append(None)

print '\nMapped {} genes'.format(len([x for x in genes if x is not None]))

# Add mapped genes to dataframe, save original names
invalid_df['Original target'] = invalid_df['Target']
invalid_df['Target'] = genes

Recombine DataFrame, sort by invalid genes, and output

In [106]:
valid_df = mab_cancer[mab_cancer.Target.isin(valid_genes)]
concat = pd.concat([valid_df, invalid_df], axis=0).sort_values(['Target', 'Name']).reset_index(drop=True)
concat.to_csv('MAB-processing/mab-cancer-mapped-1st-pass.tsv', '\t')

The last few unmapped genes will be mapped manually or removed

In [109]:
len(concat[concat.Target.isnull()])

19

### Scrape to Confirm Drug Target

We'll scrape wiki for drug targets to confirm that our automated gene selections match.

In [None]:
base = 'https://en.wikipedia.org/wiki/'

for drug in invalid_df.Name:
    
    drug = drug.split('[')[0] if '[' in drug else drug
    
    # Look for wiki page
    try:
        page = urllib2.urlopen(base + drug)
    except urllib2.HTTPError:
        print 'Page not found for: {}'.format(drug)
        continue
    
    # Parse page
    soup = BeautifulSoup(page, 'html.parser')
    
    # Look for table
    name_box = soup.find('table', {'class': 'infobox'})
    
    if not name_box:
        print 'No table found for {}'.format(drug)
        continue
    
    # Look for Target, next item should be the Drug
    name = name_box.text.strip()
    if 'Target' in name:
        print drug, name.split('\n')[name.split('\n').index('Target') + 1]
    else:
        print '{} has no listed Target'.format(drug)
    
    # Pause
    while True:
        raw_input('Hit any key to continue')
        break

This process corrected for several mis-identified and originally misclassified genes.

### Scrape to Find Tissue Match for Each Drug

In [None]:
out = []
for name in drugs:
    print name
    # Avoid possible request limit
    time.sleep(1)
        name = name.split('[')[0] if '[' in name else name
    name = '_'.join(name.split()) if ' ' in name else name

    # Scrape wikipedia
    info = None
    try:
        info = get_info_from_wiki(name)
    except:
         pass
    if not info:
        out.append('None\tNone\n')
        continue

    # Find sentences containing cancer terms
    grep = grep_cancer_terms(info.replace('\n', '.'))
    tissues, sentences = [], []
    generic_cancer_flag = None
    for sentence in grep:
         sentence = sentence.encode('utf-8')

     # Wikipedia meta entries that aren't relevent
    if '-gozu-' in sentence or '-pro-' in sentence or '-colo-' in sentence or '-govo-' in sentence:
        sentence = ''

    # Associate sentence with a Tissue
    tissue_identity = identify_tissue_from_string(sentence)
    if tissue_identity:
        tissues.append(tissue_identity)
        sentences.append(sentence)
    elif 'cancer' in sentence.lower():
        generic_cancer_flag = True
        sentences.append(sentence)
    elif sentence:
        print '\tNo tissue found for evidence: {}'.format(sentence)

    if not tissues and generic_cancer_flag:
        print '\tOnly generic cancer found for {}'.format(name)
        out.append('cancer\t{}\n'.format('.'.join(sentences)))

    elif tissues:
        out.append('{}\t{}\n'.format(', '.join(set(flatten(tissues))), '.'.join(sentences)))
    else:
        print 'Nothing found for: {}'.format(name)
        out.append('None\tNone\n')

9-22-17: (152 / 158) drugs in table with a valid gene target and corresponding tissue or flagged as for generic cancer

## II.  Create CIViC Disease and Tissue Mapping

Build/read CIViC table of cancers, drugs, and gene targets

In [3]:
civ_proc = 'CIViC-processing'
civic_db_path = os.path.join(civ_proc, 'input_df')

# If table exists, read instead of recreating
try:
    civic = pd.read_pickle(civic_db_path)
except:
    civic = create_civic_drug_disease_dataframe()
    mkdir_p(civ_proc)
    civic.to_pickle(civic_db_path)

In [4]:
print '{} rows'.format(civic.shape[0])
civic.head()

3290 rows


Unnamed: 0,Cancer,Drugs,Gene,Aliases,Variant-Name,Description
0,Non-small Cell Lung Carcinoma,Crizotinib,ALK,"ALK,NBLST3,CD246",EML4-ALK L1152R,"ALK amplifications, fusions and mutations have..."
1,Neuroblastoma,PF-2341066,ALK,"ALK,NBLST3,CD246",R1275Q,"ALK amplifications, fusions and mutations have..."
2,Neuroblastoma,PF-2341066,ALK,"ALK,NBLST3,CD246",R1275Q,"ALK amplifications, fusions and mutations have..."
3,Neuroblastoma,Lorlatinib,ALK,"ALK,NBLST3,CD246",R1275Q,"ALK amplifications, fusions and mutations have..."
4,Neuroblastoma,Crizotinib,ALK,"ALK,NBLST3,CD246",R1275Q,"ALK amplifications, fusions and mutations have..."


### II.a Process Dataframe

Find tissue labels for every row

In [5]:
tissue = []
no_match = set()
for cancer in civic.Cancer:
    ti = identify_tissue_from_str(cancer)
    if ti:
        tissue.append(', '.join(ti))
    elif cancer == 'Cancer':
        tissue.append('Cancer')
    elif 'Sarcoma' in cancer or 'Sheath Tumor' in cancer:
        tissue.append('Sarcoma')
    else:
        tissue.append(None)
        no_match.add(cancer.encode('utf-8'))
print 'Excluded cancers / diseases (mostly connective tissue): {}'.format(', '.join(no_match))

Excluded cancers / diseases (mostly connective tissue): Papillary Adenocarcinoma, Angiosarcoma, Myeloid And Lymphoid Neoplasms With Eosinophilia And Abnormalities Of PDGFRA, PDGFRB, And FGFR1, PTEN Harmatoma Tumor Syndrome, Langerhans-Cell Histiocytosis, Chuvash Polycythemia, Von Hippel-Lindau Disease, Desmoid Fibromatosis, NUT Midline Carcinoma, Peutz-Jeghers Syndrome, Tuberous Sclerosis, Pseudomyxoma Peritonei, Plexiform Neurofibroma, Systemic Mastocytosis, Epithelioid Hemangioendothelioma, Malignant Sertoli-Leydig Cell Tumor, Myelodysplastic Syndrome, Female Reproductive Organ Cancer, Solid Tumor, Inflammatory Myofibroblastic Tumor, Chordoma, Dermatofibrosarcoma Protuberans, Polycythemia Vera, Scrotum Paget's Disease, Chronic Myeloproliferative Disease, Acoustic Neuroma, Pericytoma, Bronchiolo-alveolar Adenocarcinoma, Thymic Carcinoma, Sezary's Disease, Histiocytoma, Waldenström's Macroglobulinemia, Liposarcoma, Myelofibrosis


Add tissue to table

In [6]:
civic['Tissue'] = tissue

Filtering

In [7]:
# Replace all empty cells with NaN
civic = civic.replace(r'\s+( +\.)|#', np.nan, regex=True).replace('',np.nan)

# Filter out rows with no drug 
civic = civic[~civic.Drugs.isnull()]
print 'Filtering for drugs: {} rows'.format(civic.shape[0])

# Filter out rows without a matching tissue
civic = civic[~civic.Tissue.isnull()]
print 'Filter for matching tissues: {} samples'.format(civic.shape[0])

Filtering for drugs: 2280 rows
Filter for matching tissues: 2239 samples


Remove duplicates if **Cancer** matches **Drug** in two rows

In [8]:
pairs = set()
index_to_drop = []
for i in xrange(civic.shape[0]):
    c, d = civic.iloc[i][['Cancer', 'Drugs']]
    if (c, d) in pairs:
        index_to_drop.append(i)
    else:
        pairs.add((c, d))
print 'Found {} duplicate rows'.format(len(index_to_drop))

# Drop duplicate rows
civic = civic.drop(civic.index[index_to_drop])

Found 1420 duplicate rows


### II.b Validate Genes

In [11]:
genes = []
for i in xrange(len(civic)):
    
    # Check if listed Gene is valid
    gene = None
    aliases = [civic.iloc[i].Gene] + civic.iloc[i].Aliases.split(',')
    for alias in aliases:
        if alias in valid_genes:
            gene = alias
            break

    # If no valid gene was found from aliases
    if not gene:
        print 'No valid gene found in {}, querying mygene'.format(aliases)
        for alias in aliases:      
            gene = find_gene_given_alias(alias)
            if gene:
                print 'Found valid gene name {} for alias {}'.format(gene, alias)
                break
            
    if not gene:
        print 'No valid gene found for: {}'.format(aliases[0])
    
    # Append gene or None
    genes.append(gene)

No valid gene found in [u'UGT1A', u'GNT1', u'UGT1', u'UGT1A', u'UGT1A@', u'UGT'], querying mygene
Found valid gene name UGT1A1 for alias UGT1A


Update table with valid genes

In [None]:
civic.Gene = genes

In [None]:
civic.head()

In [None]:
civic.to_csv('CIViC-processing/civic-processed.tsv', sep='\t')

Output drug-gene list

In [None]:
drug_gene = civic[['Drugs', 'Gene']]

In [None]:
drug_gene.head()

In [13]:
civic[civic.Drugs == 'Crizotinib']

Unnamed: 0,Cancer,Drugs,Gene,Aliases,Variant-Name,Description,Tissue
0,Non-small Cell Lung Carcinoma,Crizotinib,ALK,"ALK,NBLST3,CD246",EML4-ALK L1152R,"ALK amplifications, fusions and mutations have...",Lung
4,Neuroblastoma,Crizotinib,ALK,"ALK,NBLST3,CD246",R1275Q,"ALK amplifications, fusions and mutations have...",Brain
9,Skin Melanoma,Crizotinib,ALK,"ALK,NBLST3,CD246",ALTERNATIVE TRANSCRIPT (ATI),"ALK amplifications, fusions and mutations have...",Skin-Head
22,Lung Adenocarcinoma,Crizotinib,ALK,"ALK,NBLST3,CD246",HIP1-ALK I1171N,"ALK amplifications, fusions and mutations have...",Lung
23,Anaplastic Large Cell Lymphoma,Crizotinib,ALK,"ALK,NBLST3,CD246",HIP1-ALK I1171N,"ALK amplifications, fusions and mutations have...",Blood
30,Breast Cancer,Crizotinib,ALK,"ALK,NBLST3,CD246",ALK FUSIONS,"ALK amplifications, fusions and mutations have...",Breast
40,Diffuse Large B-cell Lymphoma,Crizotinib,ALK,"ALK,NBLST3,CD246",ALK FUSIONS,"ALK amplifications, fusions and mutations have...",Blood
43,Colorectal Adenocarcinoma,Crizotinib,ALK,"ALK,NBLST3,CD246",ALK FUSIONS,"ALK amplifications, fusions and mutations have...",Colon-Small_intestine
52,Lung Large Cell Carcinoma,Crizotinib,ALK,"ALK,NBLST3,CD246",EML4-ALK I1171S,"ALK amplifications, fusions and mutations have...",Lung
55,Cancer,Crizotinib,ALK,"ALK,NBLST3,CD246",NPM-ALK,"ALK amplifications, fusions and mutations have...",Cancer


### Questions

Which gene is targeted by the most drugs?

In [134]:
gp = civic.groupby('Gene').Drugs
gp.describe().sort_values('count', ascending=False).head(10)

Unnamed: 0_level_0,count,unique,top,freq
Gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BRAF,78,47,Vemurafenib,18
KRAS,63,56,Erlotinib,2
EGFR,51,37,Erlotinib,5
ERBB2,49,32,Trastuzumab,6
PIK3CA,40,31,Pictilisib,4
ALK,39,22,Crizotinib,10
PTEN,22,19,Everolimus,3
FLT3,20,18,Sunitinib,3
KIT,16,10,Imatinib,6
FGFR1,15,9,BGJ398,4


Which drug has the most targets? (check for accuracy)

In [135]:
gp = civic.groupby('Drugs').Gene
gp.describe().sort_values('count', ascending=False).head(10)

Unnamed: 0_level_0,count,unique,top,freq
Drugs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Vemurafenib,21,3,BRAF,18
Crizotinib,14,3,ALK,10
Erlotinib,10,5,EGFR,5
Dasatinib,10,6,BRAF,2
Cisplatin,10,10,KRAS,1
Afatinib,9,4,ERBB2,4
Everolimus,9,7,PTEN,3
Olaparib,9,4,BRCA1,5
Selumetinib (AZD6244),9,6,BRAF,3
Imatinib,9,3,KIT,6


Which cancer has the most drugs?

Which tissue has the most drugs?

# III. Combine Tables

# IV. Bipartite Graphs

In [100]:
# Map drugs to genes
drug_gene_map = {}
for drug in civic.Drugs.unique():
    drug_gene_map[drug] = civic[civic.Drugs == drug].Gene.tolist()


# Write out TSV for visualization in Gephi
bg_dir = 'Bipartite_graphs'
output = os.path.join(bg_dir, 'drug-gene.tsv')
mkdir_p(bg_dir)
with open(output, 'w') as f:
    for gene, drug in [(gene, drug) for drug in drug_gene_map for gene in drug_gene_map[drug]]:
        f.write('{}\t{}\n'.format(gene, drug))

# Create graph and add nodes between all genes and drugs
#G = nx.Graph()
#pos = nx.spring_layout(G)
#nx.draw(G, with_labels=True)
#plt.show();