## Optional Homework #1 (for the updated JSON file)

For anyone looking to try out their skills, try one or all of these exercises:

The data structure used by clinicaltrials.ucsf.edu needs a number of fields and formatting choices that are important for keeping their webpage operational, but aren't ideal for our needs. However, if we want our modeling work to be useful, we need a way to easily join the outcome of our NLP model to the production JSON data structure.  The production data structure is a dictionary of dictionaries, with the primary key being the clinical trial "nct_number": an 11 digit unique identifier beginning with the string "NCT" (ie - 'NCT12345678').  Our NLP data structure, should therefore also be indexed by this identifier. 

Additionally, only certain trials are visible on the website, so we will need to know which trials are considered "visible" and "joinable", so we can appropriately categorize only the visible trials. The variable names in the JSON are 'is_joinable' and 'is_visible.'

#### Exercise 1
Your first task is to build a smaller test dataset that removes some of the fields we do not care about. Create a new data structure, and for each trial, copy over only the data that you believe would be useful as inputs to our model (plus the variables mentioned above) and save it to disk. Remember that modeling works best when we focus on the details that change appreciable between our unit of analysis (ie - trials.) In other words, your choise of variables to retain should not be things like "sponsor" or "recruitment status" because they won't really contribute much to our models.

#### Exercise 2
Many of the most interesting fields (like 'summary') are riddled with formatting characters. This is great for hosting a website, but it is just noise for our NLP work.  For each of the free-text fields you've extracted in Exercise 1, remove all of the HTML formatting (Hint: The python package BeautifulSoup can be very useful here!).

#### Exercise 2a
If you'd like to extend excercise 2, it would be useful to use Spacy/NLTK or something else to break the free text sections from excercise 2 into individual sentences and store them as elements in a list.

#### Exercise 3
Our goal for our modeling project is to categorize each of the trials into a unique category.  The language of clinicaltrials.ucsf.edu defines "clusters" as the primary category, and "conditions" as the sub-category.  Each trial can be classified as relating to multiple "conditions", and each condition can belong to multiple "clusters." This complicates our modeling task appreciably. To simplify our project, we will assume that all "conditions" are adequately assigned (not a bad assumption, there are very few conditions = "other") and we will assume that any clincical trial that has at least 1 "cluster" not equal to "Other" can be categorized into that non-other cluster.  In other words, we are only interested in classifying trials that cannot be otherwise categorized.

For exercise 3, please create a field in our data structure that indicates whether a specific trial has ONLY "other" clusters (even if there are more than 1). 

## Exercise 1

In [2]:
import json
json_data_file = "/Users/rthombley/Downloads/__data_exprt.json"

In [3]:
with open(json_data_file, 'rb') as ct_raw:
    jdata = json.load(ct_raw)

Let's start by exploring some of the data.
We can start by printing out the 1st 10 keys to the jdata dictionary (these are the NCT codes)

In [4]:
[x for x in jdata.keys()][1:10]

['NCT00000134',
 'NCT00000136',
 'NCT00000139',
 'NCT00000142',
 'NCT00000143',
 'NCT00000168',
 'NCT00000173',
 'NCT00000409',
 'NCT00000410']

Now let's choose a random key, and inspect the dictionary contents.

In [5]:
jdata['NCT00000134']

{'completion_date': '1995-03-01',
 'conditions': [{'clusters': ['Infectious Diseases'],
   'name': 'HIV/AIDS',
   'slug': 'hiv-aids'},
  {'clusters': ['Other'],
   'name': 'Cytomegalovirus Retinitis',
   'slug': 'cytomegalovirus-retinitis'}],
 'description_html': '<p>CMV retinitis is the most common intraocular infection in patients with AIDS and is estimated to affect 35 to 40 percent of patients with AIDS. Untreated CMV retinitis is a progressive disorder, the end result of which is total retinal destruction and blindness. At the time of this trial, drugs approved by the United States Food and Drug Administration (FDA) for the treatment of CMV retinitis were ganciclovir (Cytovene) and foscarnet (Foscavir). Although most retinitis responds well to initial therapy with systemically administered drugs, given enough time, nearly all patients will suffer a relapse of the retinitis. Relapsed retinitis generally responds to reinduction and maintenance therapy, but the interval between succe

Ok - so, looking through this, I think I want to use the following elements in our analytic dataset:
* The raw condition tree
* description_html
* eligibility_by_age_open_to_18_and_over
* eligibility_by_age_open_under_18
* eligibility_by_sex_all
* eligibility_by_sex_female
* eligibility_by_sex_male
* eligibility_exclusion_html
* eligibility_inclusion_html
* eligibility_healthy_volunteers
* eligibility_tags
* is_joinable
* is_visible
* keywords
* summary
* title_brief
* title_official

From these elements, I'll also create a few derived elements:
* Parsed conditions & cluster lists
* Assigned category label
* on_website flag
* pediatric flag
* adult flag
* female only flag
* male only flag
* lists for the parsed tokens

So, let's begin by extracting the elements of interest

In [6]:
elements_of_interest = ['conditions','description_html', 'eligibility_by_age_open_to_18_and_over','eligibility_by_age_open_under_18',
                       'eligibility_by_sex_all','eligibility_by_sex_female','eligibility_by_sex_male',
                       'eligibility_exclusion_html','eligibility_inclusion_html','eligibility_healthy_volunteers',
                       'eligibility_tags','is_joinable','is_visible','keywords','summary','title_brief','title_official']
analytic_data = {}
for nct in jdata.keys():
    analytic_data[nct] = {k:jdata[nct][k] for k in jdata[nct].keys() if k in elements_of_interest}

Now we've built our dictionary subset, let's take a look at what we've got:

In [7]:
analytic_data['NCT00000134']

{'conditions': [{'clusters': ['Infectious Diseases'],
   'name': 'HIV/AIDS',
   'slug': 'hiv-aids'},
  {'clusters': ['Other'],
   'name': 'Cytomegalovirus Retinitis',
   'slug': 'cytomegalovirus-retinitis'}],
 'description_html': '<p>CMV retinitis is the most common intraocular infection in patients with AIDS and is estimated to affect 35 to 40 percent of patients with AIDS. Untreated CMV retinitis is a progressive disorder, the end result of which is total retinal destruction and blindness. At the time of this trial, drugs approved by the United States Food and Drug Administration (FDA) for the treatment of CMV retinitis were ganciclovir (Cytovene) and foscarnet (Foscavir). Although most retinitis responds well to initial therapy with systemically administered drugs, given enough time, nearly all patients will suffer a relapse of the retinitis. Relapsed retinitis generally responds to reinduction and maintenance therapy, but the interval between successive relapses progressively short

#### Exercise 2
Now that we've subset our dataset, let's start cleaning it up a little bit. There are 4 immediately obvious things wrong with the HTML fields, for our purposes.
1. There are HTML tags (`<p>`). We will strip this with a regular expression.
2. There are new line characters (`\n`). We will strip this also with a regex.
3. There are HTML character entities (`&Lt;`). These exist because the carat characters (<,>) are reserved characters in HTML, so they have to be escaped in well formatted HTML. The html library has the unescape() method which converts these character entities to their unicode equivalents.
4. Lastly, we need to normalize all of the spaces (ie - multiple spaces are replaced by a single space).


In [8]:
import re
import html
find_html = re.compile(r'<.*?>')
find_spaces = re.compile(r'\s{2,+}')
def removeHTML(raw_html):
    cleantext = re.sub(find_html, '', raw_html)
    cleantext = re.sub('\n','',cleantext)
    cleantext = html.unescape(cleantext)
    cleantext = re.sub(find_spaces, ' ', cleantext)
    return cleantext.strip()
# Test it out:
html_test = "<html><head></head><h2>Hello\n world!</h2>"
print(removeHTML(html_test))

Hello world!


In [9]:
elements_with_html = ['description_html','eligibility_exclusion_html','eligibility_inclusion_html']

for trial in analytic_data.keys():
    for element in elements_with_html:
        analytic_data[trial][element] = removeHTML(analytic_data[trial][element]) 

In [10]:
analytic_data['NCT00000134']

{'conditions': [{'clusters': ['Infectious Diseases'],
   'name': 'HIV/AIDS',
   'slug': 'hiv-aids'},
  {'clusters': ['Other'],
   'name': 'Cytomegalovirus Retinitis',
   'slug': 'cytomegalovirus-retinitis'}],
 'description_html': 'CMV retinitis is the most common intraocular infection in patients with AIDS and is estimated to affect 35 to 40 percent of patients with AIDS. Untreated CMV retinitis is a progressive disorder, the end result of which is total retinal destruction and blindness. At the time of this trial, drugs approved by the United States Food and Drug Administration (FDA) for the treatment of CMV retinitis were ganciclovir (Cytovene) and foscarnet (Foscavir). Although most retinitis responds well to initial therapy with systemically administered drugs, given enough time, nearly all patients will suffer a relapse of the retinitis. Relapsed retinitis generally responds to reinduction and maintenance therapy, but the interval between successive relapses progressively shortens

#### Exercise 2a
If you'd like to extend excercise 2, it would be useful to use Spacy/NLTK or something else to break the free text sections from excercise 2 into individual sentences and store them as elements in a list.

In [88]:
import spacy

nlp = spacy.load('en_core_web_sm')
for trial in analytic_data.keys():
    for element in ['description_html','eligibility_exclusion_html','eligibility_inclusion_html','summary']: 
        new_index = 'parsed_' + element
        doc = nlp(analytic_data[trial][element])
        # Note that if we were keeping this data structure in memory, it would probably be best to keep the 
        # doc object in our data structure, rather than the parsed sentences.
        # Something like: analytic_data[trial][new_index] = doc.sents
        analytic_data[trial][new_index] = [sent.text for sent in doc.sents]
        

In [144]:
analytic_data['NCT00000409']


{'conditions': [{'clusters': ['Other'],
   'name': 'Spondylolisthesis',
   'slug': 'spondylolisthesis'},
  {'clusters': ['Other'],
   'name': 'Spinal Stenosis',
   'slug': 'spinal-stenosis'},
  {'clusters': ['Other'], 'name': 'Low Back Pain', 'slug': 'low-back-pain'}],
 'description_html': 'Low back pain is considered one of the most widely experienced health problems in the U.S. and the world. It is the second most frequent condition, after the common cold, for which patients see a physician or lose days from work. Estimated costs to those who are severely disabled from low back pain range from $30-70 billion annually. Rates of spinal surgery in the U.S. have increased sharply over time, and researchers have documented 15-fold geographic variation in rates of these surgeries. In many cases, where one lives and who one sees for the condition appear to determine the rates of surgery. Despite these trends, there is little evidence proving the effectiveness of these therapies over nonsurg

#### Exercise 3
Our goal for our modeling project is to categorize each of the trials into a unique category.  The language of clinicaltrials.ucsf.edu defines "clusters" as the primary category, and "conditions" as the sub-category.  Each trial can be classified as relating to multiple "conditions", and each condition can belong to multiple "clusters." This complicates our modeling task appreciably. To simplify our project, we will assume that all "conditions" are adequately assigned (not a bad assumption, there are very few conditions = "other") and we will assume that any clincical trial that has at least 1 "cluster" not equal to "Other" can be categorized into that non-other cluster.  In other words, we are only interested in classifying trials that cannot be otherwise categorized.

For exercise 3, please create a field in our data structure that indicates whether a specific trial has ONLY "other" clusters (even if there are more than 1). 

Let's take a look at the conditions field for trial NCT000004009.  All of the conditions fall under the "other" cluster.

In [94]:
jdata['NCT00000409']['conditions']

[{'clusters': ['Other'],
  'name': 'Spondylolisthesis',
  'slug': 'spondylolisthesis'},
 {'clusters': ['Other'], 'name': 'Spinal Stenosis', 'slug': 'spinal-stenosis'},
 {'clusters': ['Other'], 'name': 'Low Back Pain', 'slug': 'low-back-pain'}]

In [98]:
jdata['NCT00000134']['conditions']

[{'clusters': ['Infectious Diseases'], 'name': 'HIV/AIDS', 'slug': 'hiv-aids'},
 {'clusters': ['Other'],
  'name': 'Cytomegalovirus Retinitis',
  'slug': 'cytomegalovirus-retinitis'}]

This will be easier to work with if we move it out of the tree array of dictionaries format. Let's put the 1st 11 trials into words first to see what we are looking at.

In [146]:
trials = []
out_str = ''
for i,y in enumerate(jdata.keys()):
    if i > 10:
        break
    #trials[y] = set()
    out_str = "Trial #{} is:".format(y)
    for c in jdata[y]['conditions']:
        for cl in c['clusters']:
            out_str += " a(n) {} trial under the {} cluster,".format(c['name'],cl) 
    trials.append(out_str[0:-1] + '.')
    out_str = ''
for i in trials:
    print(i)

Trial #NCT00000126 is: a(n) Neuropathy trial under the Brain and Nerves cluster.
Trial #NCT00000134 is: a(n) HIV/AIDS trial under the Infectious Diseases cluster, a(n) Cytomegalovirus Retinitis trial under the Other cluster.
Trial #NCT00000136 is: a(n) HIV/AIDS trial under the Infectious Diseases cluster, a(n) Cytomegalovirus Retinitis trial under the Other cluster.
Trial #NCT00000139 is: a(n) Keratitis, Herpetic trial under the Other cluster, a(n) Ocular Herpes Simplex trial under the Other cluster, a(n) Herpes Simplex trial under the Other cluster.
Trial #NCT00000142 is: a(n) HIV/AIDS trial under the Infectious Diseases cluster, a(n) CMV Cytomegalovirus Retinitis trial under the Other cluster.
Trial #NCT00000143 is: a(n) Cytomegalovirus Retinitis trial under the Other cluster, a(n) HIV/AIDS trial under the Infectious Diseases cluster.
Trial #NCT00000168 is: a(n) HIV/AIDS trial under the Infectious Diseases cluster, a(n) Cytomegalovirus Retinitis trial under the Other cluster.
Trial #

So 126 is pretty straight forward, it's only assigned to a single cluster: "Brain and Nerves"

134 & 136 are both a little more difficult. They are HIV/AIDS trials which are assigned to the "Infectious Diseases" cluster, but they are also labeled as cytomegalovirus retinitis trials which has been categorized as "other"

For our purposes, we are going to REMOVE any "other" clusters from consideration if additional clusters are defined. This would mean, for our purposes, 134 & 136 will both be assigned, unambiguously, to the "Infectious Diseases" cluster.

What about trials like 168? Here, it's categorized as both Alzheimer's and Mild Cognitive impairment under the "mental health cluster", and Alzheimer's under the "brain and nerves" cluster. This is a more challenging scenario.  For our purposes, we will retain both unique clusters (mental health/brain & nerves) as shared labels. The trial data will be duplicated and assigned to each cluster. This may have downstream consequences, but let's set things up this way for now. 

We'll start by parsing just the unique clusters assigned to each trial

In [140]:
for i,k in enumerate(analytic_data.keys()):
    #if not x[y]['is_visible'] or not x[y]['is_joinable']:
    #    continue
    
    analytic_data[k]['labels'] = set()

    for c in analytic_data[k]['conditions']:
        for cl in c['clusters']:
            analytic_data[k]['labels'].add(cl.upper())
    #print("TRIAL# {} => {}".format(k, analytic_data[k]['labels']))
#print(trials)
#print("__________________")
#print(len(trials))

And now, we'll remove any "other" clusters from the label set if there is a suitable default.

In [141]:
for k in analytic_data.keys():
    analytic_data[k]['has_multiple_labels'] = False
    analytic_data[k]['was_other_removed'] = False
    analytic_data[k]['is_other'] = False
    
    if 'OTHER' in analytic_data[k]['labels']:
        if len(analytic_data[k]['labels']) > 1:    
            # Could be assigned to something besides other
            analytic_data[k]['labels'].remove('OTHER')
            analytic_data[k]['was_other_removed'] = True
        else:
            analytic_data[k]['is_other'] = True
    if len(analytic_data[k]['labels']) > 1: 
        # Multiple assignments remain
        analytic_data[k]['has_multiple_labels'] = True
    
#    if analytic_data[k]['is_other']:
#        print("{}: => {}".format(k, ','.join(list(analytic_data[k]['labels']))))
#    if analytic_data[k]['is_other']:
#        print("{}: => {}".format(k, ','.join(list(analytic_data[k]['labels']))))
                
        


In [142]:
["{}: => {}".format(k, ','.join(list(analytic_data[k]['labels']))) for k in analytic_data.keys() if analytic_data[k]['has_multiple_labels']]

['NCT00000173: => BRAIN AND NERVES,MENTAL HEALTH',
 'NCT00000657: => INFECTIOUS DISEASES,BRAIN AND NERVES,MENTAL HEALTH',
 'NCT00000658: => CANCER,INFECTIOUS DISEASES',
 'NCT00000660: => CANCER,INFECTIOUS DISEASES',
 'NCT00000703: => CANCER,INFECTIOUS DISEASES',
 'NCT00000751: => INFECTIOUS DISEASES,REPRODUCTIVE HEALTH',
 'NCT00000793: => BRAIN AND NERVES,INFECTIOUS DISEASES',
 'NCT00000801: => CANCER,INFECTIOUS DISEASES',
 'NCT00000807: => CANCER,INFECTIOUS DISEASES',
 'NCT00000808: => INFECTIOUS DISEASES,REPRODUCTIVE HEALTH',
 'NCT00000862: => INFECTIOUS DISEASES,REPRODUCTIVE HEALTH',
 'NCT00000867: => INFECTIOUS DISEASES,BRAIN AND NERVES,MENTAL HEALTH',
 'NCT00000869: => INFECTIOUS DISEASES,REPRODUCTIVE HEALTH',
 'NCT00000917: => INFECTIOUS DISEASES,REPRODUCTIVE HEALTH',
 'NCT00000944: => INFECTIOUS DISEASES,REPRODUCTIVE HEALTH',
 'NCT00000954: => CANCER,INFECTIOUS DISEASES',
 'NCT00000960: => INFECTIOUS DISEASES,REPRODUCTIVE HEALTH',
 'NCT00000994: => CANCER,INFECTIOUS DISEASES',
 

In [170]:
analytic_data['NCT00000658']['parsed_summary'][0:2]

["To determine the impact of dose intensity on tumor response and survival in patients with HIV-associated non-Hodgkin's lymphoma (NHL).",
 'HIV-infected patients are at increased risk for developing intermediate and high-grade NHL.']