## Optional Homework #1

For anyone looking to try out their skills, try one or all of these exercises:

Start with the data file: "clinical_trial_data.json" or the in-memory version of the data, as generated by the Jupyter Notebook "Clinical Trial Data Extraction.ipynb"

#### Exercise 1
Each of the clinical trials listed includes some standard text headers such as "Summary", "Keywords", "You can join if..." For this task, you will need to remove these standard headers from the data structure.

In [4]:
import re, json
%pdb

filename = 'clinical_trial_data.json'
# return from data is a dict of {topic, ['header tag': [['topic', 'title', 'url', list_of_full_text], ...]}
# data[key].append([key, sub_condition_name, study_name, study_link, study_data])

# filename = '__data_export.json'  # provided you have downloaded the full data source

with open(filename, 'r') as f:
    data = json.load(f)

Automatic pdb calling has been turned ON


In [5]:
phrases_to_remove = ('Summary', 'Keywords', 'You can join if...') # need to ensure these match verbatim

def remove_phrases_from_line_list(a_list, phrases, inplace=False):
    for key in data:
        for i in data[key]:
            _, sub_condition_name, study_name, study_data = i
            
            if inplace:
                for phrase in phrases_to_remove:
                    study_data.remove(phrase)  # slow list removal, todo build a copy sequentially to speed up
                return None

            # remove lines/phrases that are known standard text headers
            tmp = []
            for line in study_data:
                if line in phrases_to_remove:
                    continue
                tmp.append(line)
            return tmp

#### Excercise 2
The clinical trial text data is currently stored line by line (ie - anytime there is a carriage return/newline create a new element in the text data list), but not necessarily sentence by sentence (ie - anytime there is a natural end to a sentence (period, exclamation point, etc) create a new element in the text data list). For each trial, merge the lines, then use SPACY or NLTK to split the lines by sentence. This will create a more "NLP friendly" dataset. 

In [6]:
import nltk
def join_and_sent_segment(a_list):
    longstr = '\n'.join(a_list)
    return nltk.sent_tokenize(longstr) # potentially use colons as sentence tokens as well?

#### Exercise 3
Because there is a section that includes pre-built keywords, and these keywords are important (and a varying token length), it would be helpful to 'set them aside'.  Build a new data structure that separates the keywords out from the rest of the raw text data and saves them in a list, alongside the other clinical trial level meta-data (ie-trial link, trial name, etc).

In [21]:
"""
better idea:
use a default-dict instead of a data class
properties are keys, values are lists of lines

creation involves looping over lines and searching for matched prop-label patterns
if a line matches, store that prop in the dict
store the remaining lines in an 'unlabeled_lines' property
"""
from collections import defaultdict

from tqdm import tqdm

class MyDefaultDict(defaultdict):
    """a default dict with dot access for keys"""
    # see https://stackoverflow.com/questions/2352181/how-to-use-a-dot-to-access-members-of-dictionary
    # can access/set/get/delete properties with dot notation, and new properties default to None
    __getattr__ = defaultdict.get
    __setattr__ = defaultdict.__setitem__
    __delattr__ = defaultdict.__delitem__
    

class StudyData(MyDefaultDict):
    """a container for clinical trial data and metadata"""
    # NOTE: subclassing default dict like this means accessing nonexistent properties no longer throws an error
    
    PROP_REGEX = {
        'topic': 'Topic:',
        'subtopic': 'subtopic:',
        'title': 'Title:',
        'url': 'Link:',
        'summary': 'Summary',
        'official_title': 'Official Title',
        'keywords': 'Keywords',
        'eligibility_criteria_pos': ' You can join if… ',
        'eligibility_criteria_neg': " You CAN'T join if... ",
        'principal_investigators': 'Principal Investigators:',
        'study_design': 'Study Design:',
        'eligibility': 'Eligibility', # section header, will not contain any data
        'details': 'Details', # section header, will not contain any data

    }

    
    @classmethod 
    def parse_fulltext(cls, list_of_str, progbar=True):
        """parses a list of strings into a defaultdict of field: [lines] based on fields defined in PROP_REGEX"""
        prop_label = None
        prop_items = []
        properties_list = MyDefaultDict()
        unlabeled_lines = []
        # loop over the lines in the full text
        for line in (tqdm(list_of_str, desc='Parsing fulltext') if progbar else list_of_str):
            for k, v in cls.PROP_REGEX.items():
                # if any line matches a property tag, set the flag to save the subsequent lines
                if re.match(v, line):
                    if prop_label and prop_items:  # save only labels with following line items
                        properties_list[prop_label] =  prop_items if len(prop_items) > 1 else prop_items[0]
                    prop_label = k
                    prop_items = []
                    break
            else:
                # no match, save the subsequent lines
                if prop_label:
                    prop_items.append(line)
                else:
                    unlabeled_lines.append(line)
        # save the final list if applicable
        if prop_label:
            properties_list[prop_label] = prop_items
        if unlabeled_lines:
            properties_list['unlabeled_lines'] = unlabeled_lines
        return properties_list
                
    
    def __init__(self, fulltext=[], **kwargs):
        self.topic = None
        self.subtopic = None
        self.title = None
        self.url = None
        self.fulltext = fulltext
        self.misc_props = dict()  # supports arbitrary miscellaneous properties
        self.data = self.parse_fulltext(fulltext)
        for i in kwargs:
            if hasattr(self, i):
                setattr(self, i, kwargs[i])
            else:
                self.misc_props[i] = kwargs[i]
    
    
#     def __str__(self):
#         out = str(super.__str__(self)) + '\n'
#         out += 'topic: {}\nsubtopic: {}\ntitle: {}\nurl: {}\n'.format(topic, subtopic, title, url)
#         out += '\n'.join(['{}: {}'.format(k, v) 
#                           for k,v in self.misc_props.items()]) + '\n'
#         out += '\n'.join(['{}: {}'.format(k, v) 
#                           for k,v in self.data.items()])
#         return out.strip(' \n')

                
    

todo list:
    - Find a list of:
        - known structure tags (you can join if..., eligibility, keywords, ...)
            - flatten everything
            - count occurrances
            - drop unique terms
        - abbreviations (see below)
        - total vocabulary
    - Expand abbreviations ('orbital occlusion (OO)... OO measures ...')
    - Impute structure from tags (vs treat everything as text blob, but maintain linebreaks as delimiters)


In [13]:
# data structure is a dict of {'topic': [['topic', 'subtopic', 'title', 'url', list_of_full_text], ...]}

# data[key].append([key, sub_condition_name, study_name, study_link, study_data])
def print_str_items(iterable, prefix=''):
    len_cutoff = 80
    indent = 2
    for i in iterable:
        if isinstance(i, str):
            print(prefix + (i[:len_cutoff] + '...' if len(i) > len_cutoff else i))
        else:
            print(prefix)
            print_str_items(i, prefix + ' '*indent + '|')
# print_str_items(data['cancer'])

In [22]:
topic, subtopic, title, url, dt = data['cancer'][0]
b = StudyData(topic=topic, subtopic=subtopic, title=title, url=url, fulltext=dt)
print(b)

Parsing fulltext: 100%|██████████| 46/46 [00:00<00:00, 19425.89it/s]

defaultdict(None, {'topic': 'cancer', 'subtopic': 'Acute Lymphoblastic Leukemia', 'title': 'A Multi-Center Study Evaluating KTE-C19 in Pediatric and Adolescent Subjects With Relapsed/Refractory B-precursor Acute Lymphoblastic Leukemia', 'url': 'https://clinicaltrials.ucsf.edu/trial/NCT02625480', 'fulltext': ['Summary', 'This is a single arm, open-label, multi-center, phase 1/2 study, to determine the safety and efficacy of KTE-C19, an autologous anti-CD19 chimeric antigen receptor (CAR)-positive T cell therapy, in relapsed/refractory B-precursor acute lymphoblastic leukemia (ALL) in pediatric or adolescent subjects.', 'Official Title', 'A Phase 1-2 Multi-Center Study Evaluating the Safety and Efficacy of KTE C19 in Pediatric and Adolescent Subjects With Relapsed/Refractory B-precursor Acute Lymphoblastic Leukemia (r/r ALL) (ZUMA-4)', 'Keywords', 'Acute Lymphoblastic Leukemia', 'Leukemia', 'Precursor Cell Lymphoblastic Leukemia-Lymphoma', 'Leukemia, Lymphoid', 'KTE-C19', 'Eligibility', 


