# Extract-Transform-Load Script

Extract-Transform-Load Scripts (ETLS) are common tools in data management. The purpose of ETLS is to gather relevant data (both direct and inferred) from public databases and capture important features in a possibly different data structure schema for specific analysis.

## PubMed Central ETLS Example

This script will Extract data from the CSV files provided to us by Stanford, Transform the data into a format usable by GeneDive, and then Load the data into the GeneDive sqlite database.

Whenever new data is obtained for GeneDive, this process should be run against that dataset. 

In [1]:
import re
import sqlite3
from shutil import copy2

In [2]:
# Progress Bar I found on the internet.
# https://github.com/alexanderkuk/log-progress
from progress_bar import log_progress

## <span style="color:red">IMPORTANT!</span> You need to create folders and organize the data before starting

Below are many file names and directory names. You have to create the directories and put the files correctly in them.

`GENE_GENE_INTERACTIONS_FILE`, `GENE_DRUG_INTERACTIONS_FILE`, and `GENE_DISEASE_INTERACTIONS_FILE` are TSV files from Emily. If they come with a .csv extension, and they are tab seperated, rename them. If they are deliminated some other way, change their extensions appropriately and change the value of `DELIMITER` below.

`GOOD_PHARM_GKB_DB`, `GOOD_ALL_DB`, and `GOOD_PLOS_PMC_DB` are the current working, valid databases used in GeneDive. They will not be altered, but instead they will be copied and updated.

`PLOS_PMC_DB` and `ALL_DB` are the newly generated databases.

In [3]:
PLOS_PMC_NAME = "PLOS-PMC"
PHARMGKB_NAME = "PHARMGKB"
ALL_NAME = "ALL"

# TSV files containing PLOS-PMC data
GENE_GENE_PLOS_INTERACTIONS_FILE    = 'tsv_data/plos_pmc/plos.tsv'
GENE_GENE_PMC_INTERACTIONS_FILE     = 'tsv_data/plos_pmc/pmc.tsv'
GENE_DRUG_INTERACTIONS_FILE         = 'tsv_data/genedrug_relationship_100417_sfsu_with_excerpts.tsv'
GENE_DISEASE_INTERACTIONS_FILE      = 'tsv_data/genedisease_relationship_100417_sfsu_with_excerpts.tsv'

# TSV files containing Pharm-GKB data
PHARMGKB_INTERACTIONS_FILE          = 'tsv_data/pharmgkb/relationships.tsv'
PHARMGKB_CHEMICAL_IDS_FILE          = 'tsv_data/pharmgkb/ids/chemicals.tsv'
PHARMGKB_DRUGS_IDS_FILE             = 'tsv_data/pharmgkb/ids/drugs.tsv'
PHARMGKB_GENES_IDS_FILE             = 'tsv_data/pharmgkb/ids/genes.tsv'
PHARMGKB_PHENOTYPES_IDS_FILE        = 'tsv_data/pharmgkb/ids/phenotypes.tsv'

# These will be unaltered
GOOD_PHARM_GKB_DB = 'sqlite_data/good_data/data.pgkb.sqlite'
GOOD_ALL_DB = 'sqlite_data/good_data/data.all.sqlite'
GOOD_PLOS_PMC_DB = 'sqlite_data/good_data/data.plos-pmc.sqlite'

# These will be created/overwritten
PLOS_PMC_DB = 'sqlite_data/data.plos-pmc.sqlite' # This will be just the data from emilies files
ALL_DB = 'sqlite_data/data.all.sqlite' # This is a combination of emilies files and PharmGKB

# if excepts alrady come wrapped with pound signs, set this to false
WRAP_EXCERPTS = True

DELIMITER = "\t"
EMILYS_FILES = [
#     {"filename":GENE_GENE_PLOS_INTERACTIONS_FILE,"type":"GeneGene"},
    {"filename":GENE_GENE_PMC_INTERACTIONS_FILE,"type":"GeneGene"},
    {"filename":GENE_DRUG_INTERACTIONS_FILE,"type":"GeneDrug"},
    {"filename":GENE_DISEASE_INTERACTIONS_FILE,"type":"GeneDisease"},
]

PHARMGKB_ID_FILES = [
#     {"filename":PHARMGKB_CHEMICAL_IDS_FILE, "type" : "chemicals"},
#     {"filename":PHARMGKB_DRUGS_IDS_FILE, "type" : "drugs"},
    {"filename":PHARMGKB_GENES_IDS_FILE, "type" : "genes"},
     {"filename":PHARMGKB_PHENOTYPES_IDS_FILE, "type" : "phenotypes"},
]

If write is false, the script will run but not write anything to the database. This keeps it safe while you're nosing around, and can also be useful if you need to re-generate the complete typeahead/adjacency files.

In [4]:
WRITE = True

This copies the fields, then initializes the connections. The `databases` array will be looped through at the end, applying all the interactions to each database.

In [5]:
copy2(GOOD_PLOS_PMC_DB, PLOS_PMC_DB)
conn_plos_pmc = sqlite3.connect(PLOS_PMC_DB)
cursor_plos_pmc = conn_plos_pmc.cursor()

copy2(GOOD_ALL_DB, ALL_DB)
conn_all = sqlite3.connect(ALL_DB)
cursor_all = conn_all.cursor()


databases = [
#     {"conn": conn_plos_pmc, "cursor": cursor_plos_pmc, "name": PLOS_PMC_NAME}, 
    {"conn": conn_all, "cursor": cursor_all, "name": ALL_NAME}
]

## PharmGKB ids
This data is used to convert ids in our database. Genes should be NCBI ids, Drugs should be PharmGKB, and Diseases should be Mesh

In [6]:
pgkb_map = {'pgkb' : {}, 'mesh':{}, 'ncbi':{},}

for id_file in PHARMGKB_ID_FILES:
    gene_file = id_file["type"] == "genes"
    phenotype_file = id_file["type"] == "phenotypes"
    with open(id_file["filename"]) as file:
        try:
            header = None
            linenum = 0
            for line in file:
                linenum+=1
                pgkb = None
                ncbi = None
                mesh = None

                line = line.strip().split(DELIMITER)

                # Read the headers of the file and assign them to a dictionary {column_name: column_number}
                if linenum == 1:
                    header = {name.strip(): col for col, name in enumerate(line)}
                    print(header)                    
                    continue

                # deliminate the lines


                # set variables
                pgkb = line[header["PharmGKB Accession Id"]]

                if gene_file:
                    ncbi = line[header["NCBI Gene ID"]]
                elif phenotype_file:
                    if len(line) > header['External Vocabulary']: # If a line doesn't have data on the end, it wont be in the delimination
                        external = str(line[header['External Vocabulary']]).replace('"', "") # weird bug where quotes break regex
                        match = re.match('MESH:[0-9A-Za-z]+',external,re.IGNORECASE)
                        if match != None:
                            mesh = match.group(0).upper()

                # fill map
                pgkb_map["pgkb"][pgkb] = {}

                if ncbi is not None:
                    pgkb_map["ncbi"][ncbi] = {'pgkb':pgkb}
                    pgkb_map["pgkb"][pgkb]["ncbi"]  = ncbi                
                if mesh is not None:
                    pgkb_map["mesh"][mesh] = {'pgkb':pgkb}
                    pgkb_map["pgkb"][pgkb]["mesh"]  = mesh

        except Exception as e:
            print(line)
            raise e

{'PharmGKB Accession Id': 0, 'NCBI Gene ID': 1, 'HGNC ID': 2, 'Ensembl Id': 3, 'Name': 4, 'Symbol': 5, 'Alternate Names': 6, 'Alternate Symbols': 7, 'Is VIP': 8, 'Has Variant Annotation': 9, 'Cross-references': 10, 'Has CPIC Dosing Guideline': 11, 'Chromosome': 12, 'Chromosomal Start - GRCh37.p13': 13, 'Chromosomal Stop - GRCh37.p13': 14, 'Chromosomal Start - GRCh38.p7': 15, 'Chromosomal Stop - GRCh38.p7': 16}
{'PharmGKB Accession Id': 0, 'Name': 1, 'Alternate Names': 2, 'Cross-references': 3, 'External Vocabulary': 4}


In [7]:
# Just to test
pgkb_map["pgkb"]['PA447298']

{'mesh': 'MESH:D054556'}

## PLOS-PMC
Map the columns as they appear in the file to the correct values.

In [8]:
# If the exercept is not found, don't run the Excerpt wrapping cell below
excerptFound = False
interactions = {
    PLOS_PMC_NAME : [],
    PHARMGKB_NAME : [],
}
for data_file in EMILYS_FILES:
    
    gene_drug_file = data_file["type"] == "GeneDrug"
    gene_disease_file = data_file["type"] == "GeneDisease"
    gene_gene_file = data_file["type"] == "GeneGene"
    
    # Identifying each DGR based on the file type a Drug (r) Disease (d) or Gene (g),
    dgd_type1 = ""
    dgd_type2 = "g"
    if gene_drug_file:
        dgd_type1 = "r"
    elif gene_disease_file:
        dgd_type1 = "d"
    elif gene_gene_file:
        dgd_type1 = "g"
    else:
        raise ValueError('{type} is an unrecognized type in EMILYS_FILES'.format(type = data_file["type"]))
        
    with open(data_file["filename"]) as file:
        header = None
        linenum = 0  
        for line in file:
            linenum+=1
            
            # Read the headers of the file and assign them to a dictionary {column_name: column_number}
            if linenum == 1:
                header = {name.strip(): col for col, name in enumerate(line.split(DELIMITER))}
                
                # The GeneGene headers differ from Gene Drug and Gene Disease. This normalizes them.
                if "geneids" in header and "disease_ids" in header: # GeneDrug/GeneDisease
                    header["dgr1"] = header["geneids"]
                    header["dgr2"] = header["disease_ids"]                    
                    header["mention1_offset"] = header["mention1_offset_start"]
                    header["mention2_offset"] = header["mention2_offset_start"]
                elif "geneids1" in header and "geneids2" in header: # GeneGene
                    header["dgr1"] = header["geneids1"]
                    header["dgr2"] = header["geneids2"]
                else:
                    raise ValueError('{f} column headers didn\'t contain expected values'.format(f = data_file["filename"]))
                
                # if no excerpts provided, substituted with article name
                needsTokens = False
                if "excerpt" in header:                    
                    excerptFound =  True
                elif "sentence" in header:                    
                    excerptFound =  True
                    header["excerpt"] = header["sentence"]
                else:
                    header["excerpt"] = header["article_id"]
                
                continue
                
            line = line.strip().split(DELIMITER)
            
            
            interaction = {
                "journal": line[header["journal"]], # no change
                "article_id": line[header["article_id"]], # no change
                "pubmed_id": (line[header["pubmed_id"]],), # make it a tuple, so that it can be looped over later
                "sentence_id": line[header["sentence_id"]], # no change
                "mention1_offset": line[header["mention1_offset"]], # new data describes a mention1_offset_start and mention1_offset_end -- I arbitarily chose to just assign offset_start here (offset start and end are often the same anyway) 
                "mention2_offset": line[header["mention2_offset"]], # same principle as above, but for mention2
                "mention1": line[header["mention1"]], # no change
                "mention2": line[header["mention2"]], # no change
                "geneids1": line[header["dgr1"]], # there's a column named "geneids", but it never seems to contain more than one value "MESH:xxxxxxx"
                "geneids2": line[header["dgr2"]], # the column after "geneids" is called "disease_ids", and may be a suitable substitute for this geneids value
                "confidence": line[header["probability"]], # no change
                "excerpt": line[header["excerpt"]],
                "type1" : dgd_type1,
                "type2" : dgd_type2,
              }
            
            try:
                mention1_offset = int(interaction["mention1_offset"])
                mention2_offset = int(interaction["mention2_offset"])
                excerpt = line[header["excerpt"]]                
            except Exception as e:
                print(line)
                print(mention1_offset,excerpt)
                raise e
    
            interactions[PLOS_PMC_NAME].append(interaction)



## PharmGKB
Map the columns as they appear in the file to the correct values.

In [9]:
interactions[PHARMGKB_NAME] = []

geneid_type = {
    "Gene" : "g",
    "Disease" : "d",
    "Chemical" : "r",
    "Haplotype" : "d",
    "Variant" : "d",
    #"VariantLocation" : "C",
    
}

# If not in the prepend dictionary, the type will not be added and instead added to this set
types_ignored = set()
total_ignored = 0

# This prevents duplicates
seen_interactions = set()
duplicates_found = 0

interactions_flipped = 0
linenum = 0  
no_pubmed_ids = 0
# Prepending to identify each DGR so that they can be identified as a Gene (no-prepend), Drug (C), or Disease (D)
with open(PHARMGKB_INTERACTIONS_FILE) as file:
    header = None

    for line in log_progress(file, every=1000, name=PHARMGKB_INTERACTIONS_FILE+" progress"):
        linenum+=1
        line = line.strip().split(DELIMITER)
        
        # Read the headers of the file and assign them to a dictionary {column_name: column_number}
        if linenum == 1:
            header = {name.strip(): col for col, name in enumerate(line)}
            continue
       
        dgr1 = line[header["Entity1_id"]]
        dgr2 = line[header["Entity2_id"]]
        
        # prepend the GeneIDs appropriately
        try:
            dgr_type1 = geneid_type[line[header["Entity1_type"]]];
        except KeyError:
            types_ignored.add(dgr1)
            total_ignored += 1
            continue
        try:
            dgr_type2 = geneid_type[line[header["Entity2_type"]]];
        except KeyError:
            types_ignored.add(dgr2)
            total_ignored += 1
            continue            

        # Replace PharmGKB GeneIDs with NCBI ids
        if dgr_type1 == "g":
            lookup = pgkb_map["pgkb"][dgr1]
            if "ncbi" not in lookup or lookup is None:
                raise ValueError("Cannot find {dgr}'s NCBI value in phkb_map".format(dgr=dgr1))
            dgr1 = lookup["ncbi"]
        if dgr_type2 == "g":
            lookup = pgkb_map["pgkb"][dgr2]
            if "ncbi" not in lookup or lookup is None:
                raise ValueError("Cannot find {dgr}'s NCBI value in phkb_map".format(dgr=dgr2))
            dgr2 = pgkb_map["pgkb"][dgr2]["ncbi"]
            

        if str(dgr_type1) not in "dgr" or str(dgr_type2) not in "dgr":
            print("type1:\"{a}\", type2:\"{b}\"".format(a = dgr_type1, b= dgr_type2))
            raise ValueError(i)
            
        interaction = {
            "journal": "PharmGKB", 
            "article_id": "0",
            "pubmed_id": "0",
            "sentence_id": "0", 
            "mention1_offset": "0", 
            "mention2_offset": "0",
            "mention1": line[header["Entity1_name"]],
            "mention2": line[header["Entity2_name"]],
            "geneids1": dgr1,
            "geneids2": dgr2,
            "confidence": "0.999",
            "excerpt": "Source: PharmGKB",
            "type1" : dgr_type1,
            "type2" : dgr_type2,
        }
        
        
        # Not all lines will have PMIDs, and will error out if you try to access it
        try:
            pubids = line[header["PMIDs"]].split(";")                
            interaction["pubmed_id"] = tuple(pubids)
        except IndexError:
            no_pubmed_ids += 1
            continue
            
                    
        # Remap Diseases to MESH
        if dgr_type1 == 'd':
            if 'PA' in dgr1 and dgr1 in pgkb_map["pgkb"] and "mesh" in pgkb_map["pgkb"][dgr1]:
                interaction["geneids1"] = pgkb_map["pgkb"][dgr1]['mesh']


        if dgr_type2 == 'd':
            if 'PA' in dgr2 and dgr2 in pgkb_map["pgkb"] and "mesh" in pgkb_map["pgkb"][dgr2]:
                interaction["geneids2"] = pgkb_map["pgkb"][dgr2]['mesh']
                
        interactions[PHARMGKB_NAME].append(interaction)

            

            
        
print("{total} interactions processed".format(total = linenum))
print("{flip} interactions flipped".format(flip = interactions_flipped))
print("{missing} didn't have Pubmed IDs".format(missing = no_pubmed_ids))
print("{ignore} interactions were ignored due to having at least one of these types:".format(ignore=total_ignored), types_ignored)

VBox(children=(HTML(value=''), IntProgress(value=1, bar_style='info', max=1)))

66690 interactions processed
0 interactions flipped
3754 didn't have Pubmed IDs
144 interactions were ignored due to having at least one of these types: {'Haplotype for PA166', 'null:PA134946555', 'Haplotype for PA128', 'null:PA33532', 'PA123:1184470420', 'PA124:1183684088', 'PA145:1183944268', 'null:PA35845', 'PA145:1183685159', 'PA128:1184470421', 'Haplotype for PA356', 'PA126:1183682177', 'PA128:1184470420', 'PA124:1183682639', 'PA397:1183685159', 'PA131:1183682177', 'PA128:1183681726', 'null:PA38407', 'PA128:1448995405'}


Remove any interactions for which the a gene traces to multiple IDs.

In [10]:

def invalidInteraction(i):
    t1 = i['type1']
    t2 = i['type2']
    id1 = i['geneids1']
    id2 = i['geneids2']
    
    if t1 not in "dgr" or t2 not in "dgr":
        print("type1:\"{a}\", type2:\"{b}\"".format(a = t1, b= t2))
        raise ValueError(i)
        
    if ("PA" in id1 and "g" == t1) or ("PA" in id2 and "g" == t2):
        raise ValueError(i)
    
    return (';' in i['geneids1'] 
    or ';'    in i['geneids2']
    or 'NULL' in i['article_id'] 
    or 'NULL' in i['pubmed_id'] 
    or 'NULL' in i['sentence_id'] 
    or 'NULL' in i['mention1_offset'] 
    or 'NULL' in i['mention2_offset'] 
    or 'NULL' in i['mention1'] 
    or 'NULL' in i['mention2'] 
    or 'NULL' in i['geneids1'] or len(i['geneids1']) == 0
    or 'NULL' in i['geneids2'] or len(i['geneids2']) == 0
    or 'NULL' in i['confidence'] 
    or 'NULL' in i['excerpt'])
       
for source in interactions:
    totalInteractions = len(interactions[source])

    interactions[source] = [x for x in interactions[source] if not invalidInteraction(x)]

    newTotal = len(interactions[source])
    print(
    '''{source}:
    Total Interactions:     {total}
    Filtered Interactions:  {filtered}
    Remaining Interactions: {remaining}'''
          .format(source = source, total = totalInteractions,filtered=totalInteractions-newTotal , remaining =newTotal))


PLOS-PMC:
    Total Interactions:     2195438
    Filtered Interactions:  388222
    Remaining Interactions: 1807216
PHARMGKB:
    Total Interactions:     62791
    Filtered Interactions:  4
    Remaining Interactions: 62787


## Excerpt wrapping
GeneDive expects the target genes in the excerpt to be wrapped in pound signs. This is important because a sentence may mention the target gene multiple times, so we need to use the offset data her to make sure we tag the right mention.

In [11]:
if excerptFound and WRAP_EXCERPTS:
    for i in interactions[PLOS_PMC_NAME]:
        #try:
        if i['journal'] != 'journal' and 'excerpt' in i:
            excerpt = i['excerpt']

            excerpt = re.sub('"', '', excerpt)
            tokens = excerpt.split(" ")
            offset1 = int(i['mention1_offset'])
            offset2 = int(i['mention2_offset'])
            
            if i["mention1"] == 'SP-B' or i["mention2"] == 'SP-B':
                print(i)
            
            if ( offset1 >= len(tokens) or offset2 >= len(tokens) ) or ( '#' in tokens[offset1]):
                continue

            tokens[offset1] = "".join(["#",tokens[offset1],"#"])
            tokens[offset2] = "".join(["#",tokens[offset2],"#"])

            i['excerpt'] = " ".join(tokens)
        #except Exception:
        #    print(i["article_id"])

{'journal': 'NULL', 'article_id': 'Tob_Induc_Dis_2004_Mar_15_2(1)_3-25.nxml.txt.nlp', 'pubmed_id': ('19570267',), 'sentence_id': '131', 'mention1_offset': '5', 'mention2_offset': '7', 'mention1': 'SP-B', 'mention2': 'SP-A', 'geneids1': '6439', 'geneids2': '653509', 'confidence': '0.89', 'excerpt': 'Furthermore , potential interactions of SP-A with SP-B in formation of tubular myelin , the symmetrical phospholipid arrays intervening between the secreted lamellar bodies and the air-liquid monolayer has been demonstrated [ 67,68 ] , suggesting vital importance of these proteins to the surfactant .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Exp_Ther_Med_2013_Apr_21_5(4)_1157-1160.nxml.txt.nlp', 'pubmed_id': ('23596483',), 'sentence_id': '91', 'mention1_offset': '20', 'mention2_offset': '30', 'mention1': 'RDS', 'mention2': 'SP-B', 'geneids1': '7263', 'geneids2': '6439', 'confidence': '0.82', 'excerpt': 'Several studies have demonstrated that deletion variants of intron

{'journal': 'NULL', 'article_id': 'Br_J_Cancer_2011_Aug_23_105(5)_673-681.nxml.txt.nlp', 'pubmed_id': ('21811254',), 'sentence_id': '132', 'mention1_offset': '3', 'mention2_offset': '9', 'mention1': 'SP-B', 'mention2': 'TTF-1', 'geneids1': '6439', 'geneids2': '7080', 'confidence': '0.689', 'excerpt': 'Expression patterns of TTF-1 , MAdL , SP-A and SP-B in adenocarcinomas of the lung', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Br_J_Cancer_2011_Aug_23_105(5)_673-681.nxml.txt.nlp', 'pubmed_id': ('21811254',), 'sentence_id': '114', 'mention1_offset': '7', 'mention2_offset': '9', 'mention1': 'SP-B', 'mention2': 'SP-A', 'geneids1': '6439', 'geneids2': '653509', 'confidence': '0.684', 'excerpt': 'Only one case was positive for both SP-A and SP-B .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Theranostics_2013_May_20_3(6)_409-419.nxml.txt.nlp', 'pubmed_id': ('23781287',), 'sentence_id': '161', 'mention1_offset': '17', 'mention2_offset': '19', 'mention1'

{'journal': 'NULL', 'article_id': 'Front_Immunol_2012_Jun_7_3_131.nxml.txt.nlp', 'pubmed_id': ('22701116',), 'sentence_id': '21', 'mention1_offset': '11', 'mention2_offset': '13', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.64', 'excerpt': 'SP-A and SP-D are large hydrophilic proteins , as opposed to SP-B and SP-C , the other two hydrophobic surfactant proteins found in the lungs .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Proteome_Sci_2005_Jun_7_3_5.nxml.txt.nlp', 'pubmed_id': ('15941475',), 'sentence_id': '221', 'mention1_offset': '3', 'mention2_offset': '5', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.639', 'excerpt': 'Structural studies on SP-B and SP-C in aqueous organic solvents and lipidsBiochim Biophys Acta199311682612708323965NavarreCDegandHBen nettKLCrawfordJSMortzEBoutryMSubproteomics : identification of plasma membrane proteins from the yeast Sacc

{'journal': 'NULL', 'article_id': 'Crit_Care_2005_Oct_5_9(6)_550-555.nxml.txt.nlp', 'pubmed_id': ('16356236',), 'sentence_id': '89', 'mention1_offset': '17', 'mention2_offset': '21', 'mention1': 'SP-B', 'mention2': 'SP-A', 'geneids1': '6439', 'geneids2': '653509', 'confidence': '0.604', 'excerpt': 'A possible explanation for the inability to detect SP-A in plasma may be its size , because SP-A is larger than SP-B , although the actual molecular weight of SP-A depends upon its glycosylation [ 45 ] .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Theranostics_2013_May_20_3(6)_409-419.nxml.txt.nlp', 'pubmed_id': ('23781287',), 'sentence_id': '12', 'mention1_offset': '11', 'mention2_offset': '13', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.603', 'excerpt': 'Pulmonary surfactant extracts such as Survanta contain hydrophobic surfactant proteins ( SP-B and SP-C ) that facilitate lipid folding and retention on lipid monola

{'journal': 'NULL', 'article_id': 'Arch_Med_Sci_2012_May_9_8(2)_286-295.nxml.txt.nlp', 'pubmed_id': ('22662002',), 'sentence_id': '38', 'mention1_offset': '16', 'mention2_offset': '20', 'mention1': 'IL-13', 'mention2': 'SP-B', 'geneids1': '3596', 'geneids2': '6439', 'confidence': '0.571', 'excerpt': 'In this study we aimed to investigate the possible association of TNF - -308 G/A , SP-B +1580 C/T , IL-13 -1055 C/T gene polymorphisms and latent adenovirus infection with COPD in an Egyptian population .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Int_J_Pediatr_2009_Mar_1_2009_170491.nxml.txt.nlp', 'pubmed_id': ('19946415',), 'sentence_id': '62', 'mention1_offset': '6', 'mention2_offset': '9', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.57', 'excerpt': 'The mRNA expression of VEGF , SP-B , and SP-C was quantified using the rtPCR technology ( BioRad , Germany ) , QTM SYBR Green Supermix ( BioRad , Germany ) , and a s

{'journal': 'NULL', 'article_id': 'Arch_Med_Sci_2012_May_9_8(2)_286-295.nxml.txt.nlp', 'pubmed_id': ('22662002',), 'sentence_id': '52', 'mention1_offset': '11', 'mention2_offset': '14', 'mention1': 'IL-13', 'mention2': 'SP-B', 'geneids1': '3596', 'geneids2': '6439', 'confidence': '0.534', 'excerpt': 'Identification of adenovirus C gene and genotyping of TNF - , SP-B , and IL-13 gene single nucleotide polymorphisms ( SNPs ) were done by real-time polymerase chain reaction ( real-time PCR ) .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Biochemistry_2011_Jun_7_50(22)_4867-4876.nxml.txt.nlp', 'pubmed_id': ('21553841',), 'sentence_id': '5', 'mention1_offset': '0', 'mention2_offset': '20', 'mention1': 'SP-B', 'mention2': 'SP-A', 'geneids1': '6439', 'geneids2': '653509', 'confidence': '0.533', 'excerpt': 'SP-A performs host defense activities and modulates the biophysical properties of surfactant in concerted action with surfactant protein B ( SP-B ) .', 'type1': 'g', 'ty

{'journal': 'NULL', 'article_id': 'Influenza_Other_Respir_Viruses_2013_Nov_26_7(6)_1218-1226.nxml.txt.nlp', 'pubmed_id': ('23710832',), 'sentence_id': '44', 'mention1_offset': '15', 'mention2_offset': '39', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.509', 'excerpt': 'Amino acid sequence SP-B-type peptide SP-B ( 1-25 ) FPIPLPYCWLCRALIKRIQAMIPKG SP-B ( 20-60 ) AMIPKGALAVAVAQVCRVVPLVAGGICQCLAERYSVILLDT SP-B ( 64-80 ) RMLPQLVCRLVLRCSMD KL4 KLLLLKLLLLKLLLLKLLLLK SP-C-type peptide SP-C ( 1-35 ) FGIPCCPVHLKRLLIVVVVVVLIVVVIVGALLMGL SP-C ( 1-12 ) FGIPCCPVHLKR SP-C ( 1-19 ) FGIPCCPVHLKRLLIVVVV SP-C ( 13-35 ) LLIVVVVVVLIVVVIVGALLMGL SP-CL11 PVHLKRLLLLLLLLLLL SP-CL16 PVHLKRLLLLLLLLLLLLLLLL K6L16 KKKKKKLLLLLLLLLLLLLLLL', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Arch_Med_Sci_2012_May_9_8(2)_286-295.nxml.txt.nlp', 'pubmed_id': ('22662002',), 'sentence_id': '187', 'mention1_offset': '16', 'mention2_offset': '20', 'mention1': '

{'journal': 'NULL', 'article_id': 'Influenza_Other_Respir_Viruses_2013_Nov_26_7(6)_1218-1226.nxml.txt.nlp', 'pubmed_id': ('23710832',), 'sentence_id': '44', 'mention1_offset': '15', 'mention2_offset': '29', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.493', 'excerpt': 'Amino acid sequence SP-B-type peptide SP-B ( 1-25 ) FPIPLPYCWLCRALIKRIQAMIPKG SP-B ( 20-60 ) AMIPKGALAVAVAQVCRVVPLVAGGICQCLAERYSVILLDT SP-B ( 64-80 ) RMLPQLVCRLVLRCSMD KL4 KLLLLKLLLLKLLLLKLLLLK SP-C-type peptide SP-C ( 1-35 ) FGIPCCPVHLKRLLIVVVVVVLIVVVIVGALLMGL SP-C ( 1-12 ) FGIPCCPVHLKR SP-C ( 1-19 ) FGIPCCPVHLKRLLIVVVV SP-C ( 13-35 ) LLIVVVVVVLIVVVIVGALLMGL SP-CL11 PVHLKRLLLLLLLLLLL SP-CL16 PVHLKRLLLLLLLLLLLLLLLL K6L16 KKKKKKLLLLLLLLLLLLLLLL', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Influenza_Other_Respir_Viruses_2013_Nov_26_7(6)_1218-1226.nxml.txt.nlp', 'pubmed_id': ('23710832',), 'sentence_id': '44', 'mention1_offset': '10', 'mention2_offset':

{'journal': 'NULL', 'article_id': 'Allergy_2010_Oct_65(10)_1256-1265.nxml.txt.nlp', 'pubmed_id': ('20337607',), 'sentence_id': '67', 'mention1_offset': '15', 'mention2_offset': '23', 'mention1': 'SP-B', 'mention2': 'CD81', 'geneids1': '6439', 'geneids2': '975', 'confidence': '0.471', 'excerpt': '( E ) Western blotting assessment of the levels of host cell marker proteins , CD81 , ICAM-1 , surfactant protein B ( SP-B ) , and MHC class II , in the EVs derived from LPS-treated and PBS-treated BALB/c mice .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Crit_Care_2011_Feb_10_15(1)_R57.nxml.txt.nlp', 'pubmed_id': ('21310059',), 'sentence_id': '266', 'mention1_offset': '6', 'mention2_offset': '19', 'mention1': 'RDS', 'mention2': 'SP-B', 'geneids1': '7263', 'geneids2': '6439', 'confidence': '0.471', 'excerpt': 'Exogenous surfactant preparation containing the hydrophobic SP-B and - C are nowadays widely used for replacement therapies in infantile RDS .', 'type1': 'g', 'type2'

{'journal': 'NULL', 'article_id': 'Clin_Exp_Otorhinolaryngol_2010_Mar_30_3(1)_13-17.nxml.txt.nlp', 'pubmed_id': ('20379396',), 'sentence_id': '92', 'mention1_offset': '12', 'mention2_offset': '17', 'mention1': 'SP-D', 'mention2': 'SP-B', 'geneids1': '6441', 'geneids2': '6439', 'confidence': '0.426', 'excerpt': 'The proteins that make up the remaining 10 % are SP-A , SP-B , SP-C , and SP-D .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Orphanet_J_Rare_Dis_2009_Dec_23_4_29.nxml.txt.nlp', 'pubmed_id': ('20030831',), 'sentence_id': '52', 'mention1_offset': '2', 'mention2_offset': '4', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.423', 'excerpt': 'ABCA3 , SP-B and SP-C genes were sequenced and analyzed by Ambrey Genetics .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Tob_Induc_Dis_2004_Mar_15_2(1)_3-25.nxml.txt.nlp', 'pubmed_id': ('19570267',), 'sentence_id': '176', 'mention1_offset': '12', 'mention2

{'journal': 'NULL', 'article_id': 'Lab_Invest_2011_Mar_15_91(3)_363-378.nxml.txt.nlp', 'pubmed_id': ('21079581',), 'sentence_id': '210', 'mention1_offset': '13', 'mention2_offset': '18', 'mention1': 'SP-D', 'mention2': 'SP-B', 'geneids1': '6441', 'geneids2': '6439', 'confidence': '0.398', 'excerpt': 'Indeed the undifferentiated AEPCs express both mRNA and surfactant proteins of SP-A , SP-B , pro-SP-C , and SP-D ( Figures 1c-f and 2d ) .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Crit_Care_2012_Nov_22_16(6)_238.nxml.txt.nlp', 'pubmed_id': ('23171712',), 'sentence_id': '126', 'mention1_offset': '13', 'mention2_offset': '15', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.395', 'excerpt': 'This surfactant preparation consisting of phospholipids ( 90 to 95 % ) and SP-B and SP-C ( 1 to 2 % ) was instilled for up to three doses ( totalling 600 mg/kg ) .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Ita

{'journal': 'NULL', 'article_id': 'Crit_Care_2005_Oct_5_9(6)_550-555.nxml.txt.nlp', 'pubmed_id': ('16356236',), 'sentence_id': '29', 'mention1_offset': '6', 'mention2_offset': '10', 'mention1': 'SP-D', 'mention2': 'SP-B', 'geneids1': '6441', 'geneids2': '6439', 'confidence': '0.365', 'excerpt': 'Four SPs , designated SP-A , SP-B , SP-C and SP-D , play an important role in surfactant homeostasis and protection against inhibition by plasma proteins or serum [ 10,11,14,15 ] .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Diabetes_Care_2011_May_22_34(Suppl_2)_S335-S341.nxml.txt.nlp', 'pubmed_id': ('21525479',), 'sentence_id': '122', 'mention1_offset': '7', 'mention2_offset': '9', 'mention1': 'SP-B', 'mention2': 'SP-A', 'geneids1': '6439', 'geneids2': '653509', 'confidence': '0.365', 'excerpt': 'Four surfactant proteins ( SPs ) ( SP-A , SP-B , SP-C , and SP-D ) are intimately associated with surfactant lipids in the lung ( 58 ) .', 'type1': 'g', 'type2': 'g'}
{'journal': 

{'journal': 'NULL', 'article_id': 'Gene_Ther_2010_Apr_7_17(4)_541-549.nxml.txt.nlp', 'pubmed_id': ('20054353',), 'sentence_id': '32', 'mention1_offset': '9', 'mention2_offset': '11', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.254', 'excerpt': 'The 5 flanking sequences and promoters for SP-A , SP-B , SP-C , SP-D , and cytokeratin 8 were amplified by PCR from human genomic DNA ( Promega , Madison , WI ; Table 1 ) .', 'type1': 'g', 'type2': 'g'}
{'journal': 'NULL', 'article_id': 'Gene_Ther_2010_Apr_7_17(4)_541-549.nxml.txt.nlp', 'pubmed_id': ('20054353',), 'sentence_id': '188', 'mention1_offset': '12', 'mention2_offset': '14', 'mention1': 'SP-C', 'mention2': 'SP-B', 'geneids1': '6440', 'geneids2': '6439', 'confidence': '0.246', 'excerpt': 'MLE-12 cells were cytoplasmically injected with plasmids carrying either the SP-A , SP-B , SP-C , SP-D , or keratin 8 promoter , the SV40 enhancer , or no eukaryotic promoter ( pCRII - DTS ) .', 'typ

{'journal': 'PMC', 'article_id': 'Exp_Ther_Med_2013_Apr_21_5(4)_1157-1160.nxml.txt.nlp', 'pubmed_id': ('23596483',), 'sentence_id': 'SENT23', 'mention1_offset': '5', 'mention2_offset': '0', 'mention1': 'respiratory failure', 'mention2': 'SP-B', 'geneids1': 'MESH:D012131', 'geneids2': '6439', 'confidence': '0.591', 'excerpt': '#SP-B# deficiency results in severe #respiratory failure# in term infants shortly after birth and the primary associated diseases are neonatal RDS and acinar dysplasia (5).', 'type1': 'd', 'type2': 'g'}
{'journal': 'PMC', 'article_id': 'Int_J_Chron_Obstruct_Pulmon_Dis_2007_Dec_2(4)_541-550.nxml.txt.nlp', 'pubmed_id': ('18268927',), 'sentence_id': 'SENT171', 'mention1_offset': '19', 'mention2_offset': '12', 'mention1': 'respiratory failure', 'mention2': 'SP-B', 'geneids1': 'MESH:D012131', 'geneids2': '6439', 'confidence': '0.584', 'excerpt': 'A gene variation within intron 4 of the surfactant protein B (#SP-B#) gene had been associated with #respiratory failure# in

{'journal': 'PMC', 'article_id': 'Exp_Ther_Med_2013_Apr_21_5(4)_1157-1160.nxml.txt.nlp', 'pubmed_id': ('23596483',), 'sentence_id': 'SENT95', 'mention1_offset': '9', 'mention2_offset': '8', 'mention1': 'protein deficiency', 'mention2': 'SP-B', 'geneids1': 'MESH:D011488', 'geneids2': '6439', 'confidence': '0.125', 'excerpt': 'The immunohistochemical results of autopsy lung tissue suggested #SP-B# #protein deficiency#, and the results of gene analysis indicated that an SP-B intron 4 variant caused SP-B protein deficiency.', 'type1': 'd', 'type2': 'g'}
{'journal': 'PMC', 'article_id': 'Br_J_Cancer_2011_Aug_23_105(5)_673-681.nxml.txt.nlp', 'pubmed_id': ('21811254',), 'sentence_id': 'SENT139', 'mention1_offset': '12', 'mention2_offset': '19', 'mention1': 'adenocarcinoma', 'mention2': 'SP-B', 'geneids1': 'MESH:D000230', 'geneids2': '6439', 'confidence': '0.111', 'excerpt': 'As depicted in Figure 4, a pleura carcinosis from a pulmonary #adenocarcinoma# was negative for both SP-A and #SP-B#.',

**Specific for PMC Data**

We didn't get Journal Data - we need to extract it from the article titles. Comment out the next section if journal titles were included.

In [12]:
for i in interactions[PLOS_PMC_NAME]:
    journal_split = i['article_id'].split("_")
    x = 0
    length = len(journal_split)
    journal = ""
    
    while x < length:
        if journal_split[x][:2] == "19" or journal_split[x][:2] == "20" or x == length -1:
            journal = " ".join(journal_split[:x])
            break
        x+= 1
    
    break

    i['journal'] = journal

Our insert statement - probably don't need to touch this

In [13]:
INTERACTIONS_WRITE = '''insert into interactions ( journal, article_id, pubmed_id, sentence_id, mention1_offset, mention2_offset, mention1, mention2, geneids1, geneids2, probability, context, section, reactome , type1, type2) values ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? ,? ,?);'''
DELETE_PHARMGKB_DATA = '''DELETE FROM interactions WHERE journal = "PharmGKB";'''
DELETE_ALL = '''DELETE FROM interactions'''

## Add the data to sqlite files

This will load the data into the plos-pmc database and the all database.

In [14]:
# Sort the DGRs so that the lower alphanumeric mention is the first one
def sort_interaction_dgrs(interaction):
    
    if interaction["mention1"].lower() > interaction["mention2"].lower():
        # store temp
        temp_mention1 = interaction["mention1"]
        temp_geneids1 = interaction["geneids1"]
        temp_type1 = interaction["type1"]
        temp_mention1_offset = interaction['mention1_offset']
        
        # move 2 to 1
        interaction["mention1"] = interaction["mention2"]
        interaction["geneids1"] = interaction["geneids2"]
        interaction["type1"] = interaction["type2"]
        interaction["mention1_offset"] = interaction["mention2_offset"]
        
        # move 1 to 2
        interaction["mention2"] = temp_mention1
        interaction["geneids2"] = temp_geneids1
        interaction["type2"] = temp_type1
        interaction["mention2_offset"] = temp_mention1_offset
        return True
    return False

current_interaction = 0
for sql in databases:
    try:
        statement = tuple();
        # Delete all data in interactions table
        sql["cursor"].execute(DELETE_ALL)
        sql["conn"].commit()
        sql["cursor"].execute("Select * from interactions")
        if sql["cursor"].rowcount > 0:
            raise ValueError("There is more rows than expected" + sql["cursor"].rowcount)
        
        # Select interactions based on which SQL we're generating
        if sql["name"] == PLOS_PMC_NAME:
            cur_interactions = interactions[PLOS_PMC_NAME]
        elif sql["name"] == ALL_NAME:
            cur_interactions = interactions[PHARMGKB_NAME] + interactions[PLOS_PMC_NAME]
        else:
            raise NameError("Could not find "+sql["name"])
            
        for interaction in log_progress(cur_interactions, every=1000, name=sql["name"]+" database progress"):
            current_interaction = interaction
            
            # If not a tuple, end it
            if type(interaction['pubmed_id']) is not tuple:
                raise TypeError("pubmed_id "+interaction['pubmed_id']+ "is not a tuple")
                
            sort_interaction_dgrs(interaction)
            dgr_type2=interaction['type1']
            dgr_type1= interaction['type2']
                
            for pubid in interaction['pubmed_id']:
                statement = (
                    interaction['journal'],         # journal
                    interaction['article_id'],      # article_id 
                    pubid,                          # pubmed_id
                    interaction['sentence_id'],     # sentence_id
                    interaction['mention1_offset'], # mention1_offset
                    interaction['mention2_offset'], # mention2_offset
                    interaction['mention1'],        # mention1
                    interaction['mention2'],        # mention2
                    interaction['geneids1'],        # geneids1
                    interaction['geneids2'],        # geneids2
                    interaction['confidence'],      # probability
                    interaction['excerpt'],         # context
                    "Unknown",                      # section
                    0,                              # reactome
                    interaction['type1'],           # type1
                    interaction['type2'],           # type2
                )


                sql["cursor"].execute(INTERACTIONS_WRITE,statement)

        if WRITE:
            sql["conn"].commit()
    except Exception as e:
        print(INTERACTIONS_WRITE, statement)
        print(current_interaction)
        print('pubmed_id:',interaction['pubmed_id'])
        print("type of pubmed_id:", type(interaction['pubmed_id']))        
        raise e



print("All databases complete")

VBox(children=(HTML(value=''), IntProgress(value=0, max=1870003)))

All databases complete


In [15]:
sql["conn"].close()