# Extract-Transform-Load Script

Extract-Transform-Load Scripts (ETLS) are common tools in data management. The purpose of ETLS is to gather relevant data (both direct and inferred) from public databases and capture important features in a possibly different data structure schema for specific analysis.

## PubMed Central ETLS Example

This script will Extract data from the CSV files provided to us by Stanford, Transform the data into a format usable by GeneDive, and then Load the data into the GeneDive sqlite database.

Whenever new data is obtained for GeneDive, this process should be run against that dataset. 

In [1]:
import re
import sqlite3
from shutil import copy2

In [2]:
# Progress Bar I found on the internet.
# https://github.com/alexanderkuk/log-progress
from progress_bar import log_progress

## <span style="color:red">IMPORTANT!</span> You need to create folders and organize the data before starting

Below are many file names and directory names. You have to create the directories and put the files correctly in them.

`GENE_GENE_INTERACTIONS_FILE`, `GENE_DRUG_INTERACTIONS_FILE`, and `GENE_DISEASE_INTERACTIONS_FILE` are TSV files from Emily. If they come with a .csv extension, and they are tab seperated, rename them. If they are deliminated some other way, change their extensions appropriately and change the value of `DELIMITER` below.

`GOOD_PHARM_GKB_DB`, `GOOD_ALL_DB`, and `GOOD_PLOS_PMC_DB` are the current working, valid databases used in GeneDive. They will not be altered, but instead they will be copied and updated.

`PLOS_PMC_DB` and `ALL_DB` are the newly generated databases.

In [3]:
PLOS_PMC_NAME = "PLOS-PMC"
PHARMGKB_NAME = "PHARMGKB"
ALL_NAME = "ALL"

# TSV files containing PLOS-PMC data
GENE_GENE_PLOS_INTERACTIONS_FILE    = 'tsv_data/plos_pmc/plos_with_excerpts.tsv'
GENE_GENE_PMC_INTERACTIONS_FILE     = 'tsv_data/plos_pmc/pmc_with_excerpts.tsv'
GENE_DRUG_INTERACTIONS_FILE         = 'tsv_data/genedrug_relationship_100417_sfsu_with_excerpts.tsv'
GENE_DISEASE_INTERACTIONS_FILE      = 'tsv_data/genedisease_relationship_100417_sfsu_with_excerpts.tsv'

# TSV files containing Pharm-GKB data
PHARMGKB_INTERACTIONS_FILE          = 'tsv_data/pharmgkb/relationships.tsv'
PHARMGKB_CHEMICAL_IDS_FILE          = 'tsv_data/pharmgkb/ids/chemicals.tsv'
PHARMGKB_DRUGS_IDS_FILE             = 'tsv_data/pharmgkb/ids/drugs.tsv'
PHARMGKB_GENES_IDS_FILE             = 'tsv_data/pharmgkb/ids/genes.tsv'
PHARMGKB_PHENOTYPES_IDS_FILE        = 'tsv_data/pharmgkb/ids/phenotypes.tsv'

# These will be unaltered
GOOD_PHARM_GKB_DB = 'sqlite_data/good_data/data.pgkb.sqlite'
GOOD_ALL_DB = 'sqlite_data/good_data/data.all.sqlite'
GOOD_PLOS_PMC_DB = 'sqlite_data/good_data/data.plos-pmc.sqlite'

# These will be created/overwritten
PLOS_PMC_DB = 'sqlite_data/data.plos-pmc.sqlite' # This will be just the data from emilies files
ALL_DB = 'sqlite_data/data.all.sqlite' # This is a combination of emilies files and PharmGKB

# if excepts alrady come wrapped with pound signs, set this to false
WRAP_EXCERPTS = True

DELIMITER = "\t"
EMILYS_FILES = [
    {"filename":GENE_GENE_PLOS_INTERACTIONS_FILE,"type":"GeneGene", "source" : "PLOS"},
    {"filename":GENE_GENE_PMC_INTERACTIONS_FILE,"type":"GeneGene", "source" : "PMC"},
    {"filename":GENE_DRUG_INTERACTIONS_FILE,"type":"GeneDrug", "source" : "Emily"},
    {"filename":GENE_DISEASE_INTERACTIONS_FILE,"type":"GeneDisease", "source" : "Emily"},
]

PHARMGKB_ID_FILES = [
     {"filename":PHARMGKB_CHEMICAL_IDS_FILE, "type" : "chemicals"},
     {"filename":PHARMGKB_DRUGS_IDS_FILE, "type" : "drugs"},
    {"filename":PHARMGKB_GENES_IDS_FILE, "type" : "genes",},
     {"filename":PHARMGKB_PHENOTYPES_IDS_FILE, "type" : "phenotypes"},
]


MESH_VALUE = "mesh"
PHARMGKB_VALUE = "pgkb"
NCBI_VALUE = "ncbi"



If write is false, the script will run but not write anything to the database. This keeps it safe while you're nosing around, and can also be useful if you need to re-generate the complete typeahead/adjacency files.

In [4]:
WRITE = True

This copies the fields, then initializes the connections. The `databases` array will be looped through at the end, applying all the interactions to each database.

In [5]:
copy2(GOOD_PLOS_PMC_DB, PLOS_PMC_DB)
conn_plos_pmc = sqlite3.connect(PLOS_PMC_DB)
cursor_plos_pmc = conn_plos_pmc.cursor()

copy2(GOOD_ALL_DB, ALL_DB)
conn_all = sqlite3.connect(ALL_DB)
cursor_all = conn_all.cursor()


databases = [
#     {"conn": conn_plos_pmc, "cursor": cursor_plos_pmc, "name": PLOS_PMC_NAME}, 
    {"conn": conn_all, "cursor": cursor_all, "name": ALL_NAME}
]

## PharmGKB ids
This data is used to convert ids in our database. Genes should be NCBI ids, Drugs should be PharmGKB, and Diseases should be Mesh

In [6]:
id_map = {}

for id_file in PHARMGKB_ID_FILES:
    with open(id_file["filename"]) as file:
        try:
            header = None
            linenum = 0
            for line in file:
                linenum+=1
                pgkb = None
                ncbi = None
                mesh = None

                # deliminate the lines
                line = line.strip().split(DELIMITER)

                # Read the headers of the file and assign them to a dictionary {column_name: column_number}
                if linenum == 1:
                    header = {name.strip(): col for col, name in enumerate(line)}
                    print(header)                    
                    continue

                


                # set variables
                pgkb = line[header["PharmGKB Accession Id"]]

                if "NCBI Gene ID" in header:
                    ncbi = line[header["NCBI Gene ID"]]
                if 'External Vocabulary'in header:
                    if len(line) > header['External Vocabulary']: # If a line doesn't have data on the end, it wont be in the delimination
                        external = str(line[header['External Vocabulary']]).replace('"', "") # weird bug where quotes break regex
                        match = re.match('MESH:[0-9A-Za-z]+',external,re.IGNORECASE)
                        if match != None:
                            mesh = match.group(0).upper()

                # fill map   
                values = {PHARMGKB_VALUE: pgkb}                
                if ncbi is not None:
                    values[NCBI_VALUE] = ncbi
                if mesh is not None:
                    values[MESH_VALUE] = mesh
        
                id_map[pgkb] = values
                
                if ncbi is not None:
                    id_map[ncbi] = values
                if mesh is not None:
                    id_map[mesh] = values
                

        except Exception as e:
            print(line)
            raise e

{'PharmGKB Accession Id': 0, 'Name': 1, 'Generic Names': 2, 'Trade Names': 3, 'Brand Mixtures': 4, 'Type': 5, 'Cross-references': 6, 'SMILES': 7, 'InChI': 8, 'Dosing Guideline': 9, 'External Vocabulary': 10, 'Clinical Annotation Count': 11, 'Variant Annotation Count': 12, 'Pathway Count': 13, 'VIP Count': 14, 'Dosing Guideline Sources': 15, 'Top Clinical Annotation Level': 16, 'Top FDA Label Testing Level': 17, 'Top Any Drug Label Testing Level': 18, 'Label Has Dosing Info': 19, 'Has Rx Annotation': 20}
{'PharmGKB Accession Id': 0, 'Name': 1, 'Generic Names': 2, 'Trade Names': 3, 'Brand Mixtures': 4, 'Type': 5, 'Cross-references': 6, 'SMILES': 7, 'InChI': 8, 'Dosing Guideline': 9, 'External Vocabulary': 10, 'Clinical Annotation Count': 11, 'Variant Annotation Count': 12, 'Pathway Count': 13, 'VIP Count': 14, 'Dosing Guideline Sources': 15, 'Top Clinical Annotation Level': 16, 'Top FDA Label Testing Level': 17, 'Top Any Drug Label Testing Level': 18, 'Label Has Dosing Info': 19, 'Has Rx

In [8]:
# Just to test
id_map['PA447298']

{'pgkb': 'PA447298', 'mesh': 'MESH:D054556'}

## PLOS-PMC
Map the columns as they appear in the file to the correct values.

In [9]:
# If the exercept is not found, don't run the Excerpt wrapping cell below
excerptFound = False
interactions = {
    PLOS_PMC_NAME : [],
    PHARMGKB_NAME : [],
}
for data_file in EMILYS_FILES:
    
    gene_drug_file = data_file["type"] == "GeneDrug"
    gene_disease_file = data_file["type"] == "GeneDisease"
    gene_gene_file = data_file["type"] == "GeneGene"
    
    # Identifying each DGR based on the file type a Drug (r) / Chemical (c), Disease (d), or Gene (g),
    dgd_type1 = ""
    dgd_type2 = "g"
    if gene_drug_file:
        dgd_type1 = "r"
    elif gene_disease_file:
        dgd_type1 = "d"
    elif gene_gene_file:
        dgd_type1 = "g"
    else:
        raise ValueError('{type} is an unrecognized type in EMILYS_FILES'.format(type = data_file["type"]))
        
    with open(data_file["filename"], encoding='utf-8') as file :
        header = None
        linenum = 0  
        for line in file:
            linenum+=1
            
            # Read the headers of the file and assign them to a dictionary {column_name: column_number}
            if linenum == 1:
                header = {name.strip(): col for col, name in enumerate(line.split(DELIMITER))}
                
                # The GeneGene headers differ from Gene Drug and Gene Disease. This normalizes them.
                if "geneids" in header and "disease_ids" in header: # GeneDrug/GeneDisease
                    header["dgr1"] = header["geneids"]
                    header["dgr2"] = header["disease_ids"]                    
                    header["mention1_offset"] = header["mention1_offset_start"]
                    header["mention2_offset"] = header["mention2_offset_start"]
                elif "geneids1" in header and "geneids2" in header: # GeneGene
                    header["dgr1"] = header["geneids1"]
                    header["dgr2"] = header["geneids2"]
                else:
                    raise ValueError('{f} column headers didn\'t contain expected values'.format(f = data_file["filename"]))
                
                # if no excerpts provided, substituted with article name
                needsTokens = False
                if "excerpt" in header:                    
                    excerptFound =  True
                elif "sentence" in header:                    
                    excerptFound =  True
                    header["excerpt"] = header["sentence"]
                else:
                    header["excerpt"] = header["article_id"]
                
                continue
                
            line = line.strip().split(DELIMITER)
            
            section = "Unknown"
            if "section" in data_file:
                section = data_file["section"]            
            
            interaction = {
                "journal": line[header["journal"]], # no change
                "article_id": line[header["article_id"]], # no change
                "pubmed_id": (line[header["pubmed_id"]],), # make it a tuple, so that it can be looped over later
                "sentence_id": line[header["sentence_id"]], # no change
                "mention1_offset": line[header["mention1_offset"]], # new data describes a mention1_offset_start and mention1_offset_end -- I arbitarily chose to just assign offset_start here (offset start and end are often the same anyway) 
                "mention2_offset": line[header["mention2_offset"]], # same principle as above, but for mention2
                "mention1": line[header["mention1"]], # no change
                "mention2": line[header["mention2"]], # no change
                "geneids1": line[header["dgr1"]], # there's a column named "geneids", but it never seems to contain more than one value "MESH:xxxxxxx"
                "geneids2": line[header["dgr2"]], # the column after "geneids" is called "disease_ids", and may be a suitable substitute for this geneids value
                "confidence": line[header["probability"]], # no change
                "excerpt": line[header["excerpt"]],
                "type1" : dgd_type1,
                "type2" : dgd_type2,
              }
            
            if data_file["source"] == "PMC" and interaction["journal"] == "NULL":
                interaction["journal"] = data_file["source"]
            
            try:
                mention1_offset = int(interaction["mention1_offset"])
                mention2_offset = int(interaction["mention2_offset"])
                excerpt = line[header["excerpt"]]                
            except Exception as e:
                print(line)
                print(mention1_offset,excerpt)
                raise e
    
            interactions[PLOS_PMC_NAME].append(interaction)



## PharmGKB
Map the columns as they appear in the file to the correct values.

In [22]:
interactions[PHARMGKB_NAME] = []
unmapped = {}

def addToUnmapped(dgr, dgr_type_original, dgr_type_target):
    unmapped[dgr] = "{o} -/-> {t}".format(o = dgr_type_original, t = dgr_type_target)
    
geneid_type = {
    "Gene" : "g",
    "Disease" : "d",
    "Chemical" : "c",
    "Drug" : "r",
    "Haplotype" : "d",
    # "Variant" : "d",
    #"VariantLocation" : "C",
    
}

types = set()

# If not in the prepend dictionary, the type will not be added and instead added to this set
types_ignored = set()
total_ignored = 0

# This prevents duplicates
seen_interactions = set()
duplicates_found = 0

linenum = 0  
no_pubmed_ids = 0
# Prepending to identify each DGR so that they can be identified as a Gene (no-prepend), Drug (C), or Disease (D)
with open(PHARMGKB_INTERACTIONS_FILE) as file:
    header = None

    for line in log_progress(file, every=1000, name=PHARMGKB_INTERACTIONS_FILE+" progress"):
        linenum+=1
        line = line.strip().split(DELIMITER)
        
        # Read the headers of the file and assign them to a dictionary {column_name: column_number}
        if linenum == 1:
            header = {name.strip(): col for col, name in enumerate(line)}
            continue
       
        dgr1 = line[header["Entity1_id"]]
        dgr2 = line[header["Entity2_id"]]
        
        # prepend the GeneIDs appropriately
        try:
            type1 = line[header["Entity1_type"]]
            dgr_type1 = geneid_type[type1]
            if type1 not in types:
                types.add(type1)
        except KeyError:
            types_ignored.add(type1)
            total_ignored += 1
            continue
        try:
            type2 = line[header["Entity2_type"]]
            dgr_type2 = geneid_type[type2]
            if type2 not in types:
                types.add(type2)
        except KeyError:
            types_ignored.add(type2)
            total_ignored += 1
            continue            

        # Replace PharmGKB GeneIDs with NCBI ids
        if dgr_type1 == "g":
            if dgr1 not in id_map or NCBI_VALUE not in id_map[dgr1]:
                addToUnmapped(dgr1, type1, NCBI_VALUE)
            else:
                dgr1 = id_map[dgr1][NCBI_VALUE]
        if dgr_type2 == "g":
            if dgr2 not in id_map or NCBI_VALUE not in id_map[dgr2]:
                addToUnmapped(dgr2, type2, NCBI_VALUE)
            else:
                dgr2 = id_map[dgr2][NCBI_VALUE]
            

        if str(dgr_type1) not in "dgrc" or str(dgr_type2) not in "dgrc":
            print("type1:\"{a}\", type2:\"{b}\"".format(a = dgr_type1, b= dgr_type2))
            raise ValueError(i)
            
        interaction = {
            "journal": "PharmGKB", 
            "article_id": "0",
            "pubmed_id": "0",
            "sentence_id": "0", 
            "mention1_offset": "0", 
            "mention2_offset": "0",
            "mention1": line[header["Entity1_name"]],
            "mention2": line[header["Entity2_name"]],
            "geneids1": dgr1,
            "geneids2": dgr2,
            "confidence": "0.999",
            "excerpt": "Source: PharmGKB",
            "type1" : dgr_type1,
            "type2" : dgr_type2,
        }
        
        
        # Not all lines will have PMIDs, and will error out if you try to access it
        try:
            pubids = line[header["PMIDs"]].split(";")                
            interaction["pubmed_id"] = tuple(pubids)
        except IndexError:
            no_pubmed_ids += 1
            continue
            
                    
        # Remap Diseases to MESH
        if dgr_type1 == "d":
            if dgr1 not in id_map or MESH_VALUE not in id_map[dgr1]:
                addToUnmapped(dgr1, type1, MESH_VALUE)
            else:
                dgr1 = id_map[dgr1][MESH_VALUE]
        if dgr_type2 == "d":
            if dgr2 not in id_map or MESH_VALUE not in id_map[dgr2]:
                addToUnmapped(dgr2, type2, MESH_VALUE)
            else:
                dgr2 = id_map[dgr2][MESH_VALUE]
                
               
        interactions[PHARMGKB_NAME].append(interaction)

            

            
print("{total} interactions processed".format(total = linenum))
print("{missing} didn't have Pubmed IDs".format(missing = no_pubmed_ids))
print("{ignore} interactions were ignored due to having at least one of these types:".format(ignore=total_ignored), types_ignored)

VBox(children=(HTML(value=''), IntProgress(value=1, bar_style='info', max=1)))

66690 interactions processed
3754 didn't have Pubmed IDs
20747 interactions were ignored due to having at least one of these types: {'Variant', 'VariantLocation'}


In [23]:
id_map["PA166114942"]

{'pgkb': 'PA166114942'}

Remove any interactions for which the a gene traces to multiple IDs.

In [24]:

def invalidInteraction(i):
    t1 = i['type1']
    t2 = i['type2']
    id1 = i['geneids1']
    id2 = i['geneids2']
    
    if t1 not in "dgr" or t2 not in "dgr":
        print("type1:\"{a}\", type2:\"{b}\"".format(a = t1, b= t2))
        raise ValueError(i)
        
    if ("PA" in id1 and "g" == t1) or ("PA" in id2 and "g" == t2):
        raise ValueError(i)
    
    return (';' in i['geneids1'] 
    or ';'    in i['geneids2']
    or 'NULL' in i['article_id'] 
    or 'NULL' in i['pubmed_id'] 
    or 'NULL' in i['sentence_id'] 
    or 'NULL' in i['mention1_offset'] 
    or 'NULL' in i['mention2_offset'] 
    or 'NULL' in i['mention1'] 
    or 'NULL' in i['mention2'] 
    or 'NULL' in i['geneids1'] or len(i['geneids1']) == 0
    or 'NULL' in i['geneids2'] or len(i['geneids2']) == 0
    or 'NULL' in i['confidence'] 
    or 'NULL' in i['excerpt'])
       
for source in interactions:
    totalInteractions = len(interactions[source])

    interactions[source] = [x for x in interactions[source] if not invalidInteraction(x)]

    newTotal = len(interactions[source])
    print(
    '''{source}:
    Total Interactions:     {total}
    Filtered Interactions:  {filtered}
    Remaining Interactions: {remaining}'''
          .format(source = source, total = totalInteractions,filtered=totalInteractions-newTotal , remaining =newTotal))


PLOS-PMC:
    Total Interactions:     3832749
    Filtered Interactions:  703112
    Remaining Interactions: 3129637
type1:"c", type2:"g"


ValueError: {'journal': 'PharmGKB', 'article_id': '0', 'pubmed_id': ('24695352',), 'sentence_id': '0', 'mention1_offset': '0', 'mention2_offset': '0', 'mention1': 'lamivudine', 'mention2': 'CYP2A6', 'geneids1': 'PA450163', 'geneids2': '1548', 'confidence': '0.999', 'excerpt': 'Source: PharmGKB', 'type1': 'c', 'type2': 'g'}

## Excerpt wrapping
GeneDive expects the target genes in the excerpt to be wrapped in pound signs. This is important because a sentence may mention the target gene multiple times, so we need to use the offset data her to make sure we tag the right mention.

In [None]:
total_skipped = 0
if excerptFound and WRAP_EXCERPTS:
    for i in interactions[PLOS_PMC_NAME]:
        #try:
        if i['journal'] != 'journal' and 'excerpt' in i:
            excerpt = i['excerpt']

            excerpt = re.sub('"', '', excerpt)
            tokens = excerpt.split(" ")
            offset1 = int(i['mention1_offset'])
            offset2 = int(i['mention2_offset'])
            

            
            if ( offset1 >= len(tokens) or offset2 >= len(tokens) ) or ('#'+i['mention1']+'#' in excerpt or '#'+i['mention2']+'#' in excerpt):
                total_skipped+=1
                continue

            tokens[offset1] = "".join(["#",tokens[offset1],"#"])
            tokens[offset2] = "".join(["#",tokens[offset2],"#"])

            i['excerpt'] = " ".join(tokens)
        #except Exception:
        #    print(i["article_id"])
print("Total skipped " + str(total_skipped))

**Specific for PMC Data**

We didn't get Journal Data - we need to extract it from the article titles. Comment out the next section if journal titles were included.

In [None]:
for i in interactions[PLOS_PMC_NAME]:
    journal_split = i['article_id'].split("_")
    x = 0
    length = len(journal_split)
    journal = ""
    
    while x < length:
        if journal_split[x][:2] == "19" or journal_split[x][:2] == "20" or x == length -1:
            journal = " ".join(journal_split[:x])
            break
        x+= 1
    
    break

    i['journal'] = journal

Our insert statement - probably don't need to touch this

In [25]:
INTERACTIONS_WRITE = '''insert into interactions ( journal, article_id, pubmed_id, sentence_id, mention1_offset, mention2_offset, mention1, mention2, geneids1, geneids2, probability, context, section, reactome , type1, type2) values ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? ,? ,?);'''
DELETE_PHARMGKB_DATA = '''DELETE FROM interactions WHERE journal = "PharmGKB";'''
DELETE_ALL = '''DELETE FROM interactions'''

## Add the data to sqlite files

This will load the data into the plos-pmc database and the all database.

In [27]:
# Sort the DGRs so that the lower alphanumeric mention is the first one
def sort_interaction_dgrs(interaction):
    
    if interaction["mention1"].lower() > interaction["mention2"].lower():
        # store temp
        temp_mention1 = interaction["mention1"]
        temp_geneids1 = interaction["geneids1"]
        temp_type1 = interaction["type1"]
        temp_mention1_offset = interaction['mention1_offset']
        
        # move 2 to 1
        interaction["mention1"] = interaction["mention2"]
        interaction["geneids1"] = interaction["geneids2"]
        interaction["type1"] = interaction["type2"]
        interaction["mention1_offset"] = interaction["mention2_offset"]
        
        # move 1 to 2
        interaction["mention2"] = temp_mention1
        interaction["geneids2"] = temp_geneids1
        interaction["type2"] = temp_type1
        interaction["mention2_offset"] = temp_mention1_offset
        return True
    return False

current_interaction = 0
for sql in databases:
    try:
        statement = tuple();
        # Delete all data in interactions table
        sql["cursor"].execute(DELETE_ALL)
        sql["conn"].commit()
        sql["cursor"].execute("Select * from interactions")
        if sql["cursor"].rowcount > 0:
            raise ValueError("There is more rows than expected" + sql["cursor"].rowcount)
        
        # Select interactions based on which SQL we're generating
        if sql["name"] == PLOS_PMC_NAME:
            cur_interactions = interactions[PLOS_PMC_NAME]
        elif sql["name"] == ALL_NAME:
            cur_interactions = interactions[PHARMGKB_NAME] + interactions[PLOS_PMC_NAME]
        else:
            raise NameError("Could not find "+sql["name"])
            
        for interaction in log_progress(cur_interactions, every=10000, name=sql["name"]+" database progress"):
            current_interaction = interaction
            
            # If not a tuple, end it
            if type(interaction['pubmed_id']) is not tuple:
                raise TypeError("pubmed_id "+interaction['pubmed_id']+ "is not a tuple")
                
            sort_interaction_dgrs(interaction)
            dgr_type2=interaction['type1']
            dgr_type1= interaction['type2']
                
            for pubid in interaction['pubmed_id']:
                statement = (
                    interaction['journal'],         # journal
                    interaction['article_id'],      # article_id 
                    pubid,                          # pubmed_id
                    interaction['sentence_id'],     # sentence_id
                    interaction['mention1_offset'], # mention1_offset
                    interaction['mention2_offset'], # mention2_offset
                    interaction['mention1'],        # mention1
                    interaction['mention2'],        # mention2
                    interaction['geneids1'],        # geneids1
                    interaction['geneids2'],        # geneids2
                    interaction['confidence'],      # probability
                    interaction['excerpt'],         # context
                    "Unknown",         # section
                    0,                              # reactome
                    interaction['type1'],           # type1
                    interaction['type2'],           # type2
                )


                sql["cursor"].execute(INTERACTIONS_WRITE,statement)

        if WRITE:
            sql["conn"].commit()
    except Exception as e:
        print(INTERACTIONS_WRITE, statement)
        print(current_interaction)
        print('pubmed_id:',interaction['pubmed_id'])
        print("type of pubmed_id:", type(interaction['pubmed_id']))        
        raise e



print("All databases complete")

VBox(children=(HTML(value=''), IntProgress(value=0, max=3171825)))

All databases complete


In [28]:
sql["conn"].close()