## Optional Homework #1 (for the updated JSON file)

For anyone looking to try out their skills, try one or all of these exercises:

Start with the data file: "NEW_clinical_trial_data.json" DO NOT USE the in-memory version of the data, as generated by the Jupyter Notebook "Clinical Trial Data Extraction.ipynb"

The data structure used by clinicaltrials.ucsf.edu needs a number of fields and formatting choices that are important for keeping their webpage operational, but aren't ideal for our needs. However, if we want our modeling work to be useful, we need a way to easily join the outcome of our NLP model to the production JSON data structure.  The production data structure is a dictionary of dictionaries, with the primary key being the clinical trial "nct_number": an 11 digit unique identifier beginning with the string "NCT" (ie - 'NCT12345678').  Our NLP data structure, should therefore also be indexed by this identifier. 

Additionally, only certain trials are visible on the website, so we will need to know which trials are considered "visible" and "joinable", so we can appropriately categorize only the visible trials. The variable names in the JSON are 'is_joinable' and 'is_visible.'

#### Exercise 1
Your first task is to build a smaller test dataset that removes some of the fields we do not care about. Create a new data structure, and for each trial, copy over only the data that you believe would be useful as inputs to our model (plus the variables mentioned above) and save it to disk. Remember that modeling works best when we focus on the details that change appreciable between our unit of analysis (ie - trials.) In other words, your choise of variables to retain should not be things like "sponsor" or "recruitment status" because they won't really contribute much to our models.

In [None]:
import json
with open('NEW_clinical_trial_data.json','r') as f:
    data = json.load(f)
keys = list(data.keys())
mydata = {k:v for k,v in data.items() if v['is_visible'] and v['is_joinable']}
mykeys = list(mydata.keys())

In [59]:
to_include = ['nct_number','summary','summary_html','title_brief','title_official',
              'keywords','conditions','description_html',
              'eligibility_html','eligibility_inclusion_html','eligibility_exclusion_html',
              'links','eligibility_summary_short', 'is_visible','is_joinable']
# a dict of studies indexed by NCT number containing a dict of data fields
filtered = {k:{i:j for i,j in v.items() if i in to_include} for k,v in data.items()}

#### Exercise 2
Many of the most interesting fields (like 'summary') are riddled with formatting characters. This is great for hosting a website, but it is just noise for our NLP work.  For each of the free-text fields you've extracted in Exercise 1, remove all of the HTML formatting (Hint: The python package BeautifulSoup can be very useful here!).

In [60]:
from bs4 import BeautifulSoup
# start = 20
# count = 10
# for i in range(start,start+count):
#     test_text = filtered[list(filtered.keys())[i]]['summary_html']
#     if '&' not in test_text:
#         i -=1
#         continue
#     print(test_text)
#     sp = BeautifulSoup(test_text, 'html.parser')
#     print('...')
#     print(sp.get_text())
#     print('\n')

# for study,attrs in filtered.items():
#     for term in attrs:
#         sp = BeautifulSoup(str(term), 'html.parser').get_text()
#         if isinstance(v, (str,bytes)):
#             filtered[k] = sp.get_text() 
#         else:
#             print(v)
#             input()

# remove newlines, make all text lowercase
filtered = {k:{label:BeautifulSoup(term, 'html.parser').get_text().replace('\n',' ').lower() if isinstance(term, str) else term
               for label,term in v.items()} 
            for k,v in filtered.items()}


#### Exercise 2a
If you'd like to extend excercise 2, it would be useful to use Spacy/NLTK or something else to break the free text sections from excercise 2 into individual sentences and store them as elements in a list.

In [62]:
import spacy
from tqdm import tqdm
nlp = spacy.load('en_core_web_sm')
# or nltk.sent_tokenize()

# split long data fields into sentences
filtered_sent_tok = {k:{label:term.split('.') if isinstance(term, str) else term
               for label,term in v.items()} 
            for k,v in tqdm(filtered.items())}




  0%|          | 0/4533 [00:00<?, ?it/s][A[A[A


 79%|███████▉  | 3577/4533 [00:00<00:00, 34554.37it/s][A[A[A


100%|██████████| 4533/4533 [00:00<00:00, 31691.77it/s][A[A[A

#### Exercise 3
Our goal for our modeling project is to categorize each of the trials into a unique category.  The language of clinicaltrials.ucsf.edu defines "clusters" as the primary category, and "conditions" as the sub-category.  Each trial can be classified as relating to multiple "conditions", and each condition can belong to multiple "clusters." This complicates our modeling task appreciably. To simplify our project, we will assume that all "conditions" are adequately assigned (not a bad assumption, there are very few conditions = "other") and we will assume that any clincical trial that has at least 1 "cluster" not equal to "Other" can be categorized into that non-other cluster.  In other words, we are only interested in classifying trials that cannot be otherwise categorized.

For exercise 3, please create a field in our data structure that indicates whether a specific trial has ONLY "other" clusters (even if there are more than 1). 

In [70]:
# traverse the structure collecting all clusters, conditions, and cluster/condition pairs
# at the same time, add a field 'is_other' that labels studies belonging exclusively to the 'Other' cluster

conditions = set()
clusters = set()
pairs = set()
slugs_to_names = dict()

for k,v in tqdm(filtered.items()):
    l_cond = v['conditions'] # a list of dicts
    for cond in l_cond:
        slugs_to_names[cond['slug']] = cond['name']
        conditions |= set([cond['slug']])
        clusters |= set(cond['clusters']) # a list
        pairs |= set('{}:{}'.format(cond['slug'], clust) for clust in cond['clusters'])
        if cond['clusters'] == ['Other']:
            v['is_other'] = True
    if 'is_other' not in v:
        v['is_other'] = False
        




  0%|          | 0/4533 [00:00<?, ?it/s][A[A[A


100%|██████████| 4533/4533 [00:00<00:00, 82540.09it/s][A[A[A

In [190]:
# plot the relationships of the clusters/conditions. More complete code in separate script, under Condition_deps_graph/.

import graphviz

dot = graphviz.Graph(format='svg', engine='neato', 
                     graph_attr={'overlap':'true',
                                 'splines':'line',
#                                  'Damping':'0.5',
#                                  'overlap_shrink':'True',
#                                  'pack':'True',
                                 'quadtree':'2',
                                 'minlen':'5'
                                },
                    node_attr={'sep':'0.055','margin':'0.055'})

nodes_added = set()

for j in pairs:
    if 'Other' not in j:
        one,two = j.split(':')
        if one not in nodes_added:
            dot.node(one)
            nodes_added |= set(one)
        if two not in nodes_added:
            dot.node(two, fillcolor='lightskyblue', style='filled')
            nodes_added |= set(two)
        dot.edge(one, two)

In [191]:
dot.render('out.gv')

'out.gv.svg'