# Extract-Transform-Load Script

Extract-Transform-Load Scripts (ETLS) are common tools in data management. The purpose of ETLS is to gather relevant data (both direct and inferred) from public databases and capture important features in a possibly different data structure schema for specific analysis.

## PubMed Central ETLS Example

This script will Extract data from the CSV files provided to us by Stanford, Transform the data into a format usable by GeneDive, and then Load the data into the GeneDive sqlite database.

Whenever new data is obtained for GeneDive, this process should be run against that dataset. 

In [111]:
import re
import sqlite3

In [112]:
INTERACTIONS_FILE = "genedisease_relationship_100417_sfsu.csv"
DELIMITER = "\t"
DATABASE = "data.sqlite"

If write is false, the script will run but not write anything to the database. This keeps it safe while you're nosing around, and can also be useful if you need to re-generate the complete typeahead/adjacency files.

In [113]:
WRITE = True

In [114]:
conn = sqlite3.connect(DATABASE)
cursor = conn.cursor()

Map the columns as they appear in the file to the correct values.

In [115]:
interactions = []

with open(INTERACTIONS_FILE) as file:
    for line in file:
        line = line[:-1]
        line = line.split(DELIMITER)
        
        interaction = {
          "journal": line[0], # no change
          "article_id": line[1], # no change
          "pubmed_id": line[2], # no change
          "sentence_id": line[3], # no change
          "mention1_offset": line[4], # new data describes a mention1_offset_start and mention1_offset_end -- I arbitarily chose to just assign offset_start here (offset start and end are often the same anyway) 
          "mention2_offset": line[6], # same principle as above, but for mention2
          "mention1": line[8], # no change
          "mention2": line[9], # no change
          "geneids1": line[10], # there's a column named "geneids", but it never seems to contain more than one value "MESH:xxxxxxx"
          "geneids2": line[11], # the column after "geneids" is called "disease_ids", and may be a suitable substitute for this geneids value
          "probability": line[12], # no change
          "excerpt": line[1] # no excerpts provided, substituted with article name    
        }
        
        interactions.append(interaction)
print(interactions[5])

{'journal': 'NULL', 'article_id': 'Mol_Cancer_2009_Aug_25_8_66.nxml.txt.nlp', 'pubmed_id': '19706164', 'sentence_id': 'SENT5', 'mention1_offset': '17', 'mention2_offset': '3', 'mention1': 'glioblastoma', 'mention2': 'p53', 'geneids1': 'MESH:D005909', 'geneids2': '7157', 'probability': '0.999', 'excerpt': 'Mol_Cancer_2009_Aug_25_8_66.nxml.txt.nlp'}


Remove any interactions for which the a gene traces to multiple IDs.

In [118]:
interactions = [i for i in interactions if ( ';' not in i['geneids1'] and ';' not in i['geneids2'])]
interactions = [i for i in interactions if ('NULL' not in i['article_id'] and 'NULL' not in i['pubmed_id'] and 'NULL' not in i['sentence_id'] and 'NULL' not in i['mention1_offset'] and 'NULL' not in i['mention2_offset'] and 'NULL' not in i['mention1'] and 'NULL' not in i['mention2'] and 'NULL' not in i['geneids1'] and 'NULL' not in i['geneids2'] and 'NULL' not in i['probability'] and 'NULL' not in i['excerpt'])] #there's probably a shorter way of doing this, but I think this works for now -- excludes JOURNAL entries of course

GeneDive expects the target genes in the excerpt to be wrapped in pound signs. This is important because a sentence may mention the target gene multiple times, so we need to use the offset data her to make sure we tag the right mention.

In [95]:
# for i in interactions:
#     #try:
#     print(i)
#     if (i['journal'] != 'journal'):
#         excerpt = i['excerpt']
    
#         excerpt = re.sub('"', '', excerpt)
#         tokens = excerpt.split(" ")
#         offset1 = int(i['mention1_offset'])
#         offset2 = int(i['mention2_offset'])
        
#         tokens[offset1] = "".join(["#",tokens[offset1],"#"])
#         tokens[offset2] = "".join(["#",tokens[offset2],"#"])
        
#         i['excerpt'] = " ".join(tokens)
#     #except Exception:
#     #    print(i["article_id"])

**Specific for PMC Data**

We didn't get Journal Data - we need to extract it from the article titles. Comment out the next section if journal titles were included.

In [96]:
for i in interactions:
    journal = i['article_id'].split("_")
    x = 0

    while x < len(journal):
        if journal[x][:2] == "19" or journal[x][:2] == "20":
            journal = " ".join(journal[:x])
            break
            
        x+= 1

    i['journal'] = journal

Our insert statement - probably don't need to touch this

In [97]:
INTERACTIONS_WRITE = '''insert into interactions ( journal, article_id, pubmed_id, sentence_id, mention1_offset, mention2_offset, mention1, mention2, geneids1, geneids2, probability, context, section, reactome ) values ( "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}", "{}" );'''

In [98]:
for interaction in interactions:
    statement = INTERACTIONS_WRITE.format(
        interaction['journal'],
        interaction['article_id'],
        interaction['pubmed_id'],
        interaction['sentence_id'],
        interaction['mention1_offset'],
        interaction['mention2_offset'],
        interaction['mention1'],
        interaction['mention2'],
        interaction['geneids1'],
        interaction['geneids2'],
        interaction['probability'],
        interaction['excerpt'],
        "Unknown",
        0
    )
    
    cursor.execute(statement)

if WRITE:
    conn.commit()

conn.close()

OperationalError: no such table: interactions