MeSH Descriptors
================

This notebook contains code to parse and clean the health and medical terms from the NIH Medical Subject Headings. The original files can be found on their FTP site [here](ftp://nlmpubs.nlm.nih.gov/online/mesh/MESH_FILES/xmlmesh/).

There are some existing resources for dealing with MeSH files. These include:

* [Working with MeSH Files in Python](https://code.tutsplus.com/tutorials/working-with-mesh-files-in-python-linking-terms-and-numbers--cms-28587) - a rudimentary approach to parsing the available .bin files.
* [mesh-tree](https://github.com/scienceai/mesh-tree) - a Java library that parses and provides many useful functions for handling MeSH files.

In [None]:
import os
import pandas as pd
import numpy as np
import json
import xml.etree.ElementTree as ET
import xmltodict
from datetime import datetime

pd.options.display.max_colwidth = 100
pd.options.display.max_columns = 999

In [None]:
%matplotlib inline
#NB I open a standard set of directories

#Paths

#Get the top path
top_path = os.path.dirname(os.getcwd())

#Create the path for external data
ext_data = os.path.join(top_path,'data/external')

#Raw path (for html downloads)

raw_data = os.path.join(top_path,'data/raw')

#And external data
proc_data = os.path.join(top_path,'data/processed')

fig_path = os.path.join(top_path,'reports/figures')

#Get date for saving files
today = datetime.utcnow()

today_str = "_".join([str(x) for x in [today.month, today.day, today.year]])

## Approach 1 - xmltodict

This approach uses the very handy `xmltodict` library, which unsurprisingly parses an XML file into a Python dict.

In [None]:
with open(ext_data + '/desc2018.xml', 'r') as f:
    desc_2018_xml = f.read()

In [None]:
desc_2018_json = xmltodict.parse(desc_2018_xml)

This essentially does everything that we need. From here we can create maps between various attributes of the terms, to use for analysis. 

As there is no Python API for interfacing with the MeSH services, a useful thing to do might be to create a wrapper class for the MeSH tree. An idea of how this might look and be used is shown here. It would essentially serve as a class to download and parse the latest MeSH files, and to provide convenience functions for creating mappings.

In [None]:
class MeSHDescriptors():
    def __init__(self, mesh_descriptor_dict=None, file=None, url=None):
        """MeSHDescriptors
        Read, parse or download MeSH descriptor XML files.
        """
        if mesh_descriptor_dict is not None:
            self.descriptors = mesh_descriptor_dict
        elif file is not None:
            self.descriptors = self.read_mesh_xml(file)
        elif url is not None:
            self.descriptors = self.read_remote_xml(file)
    
    def read_mesh_xml(self, file):
        """read_mesh_xml
        Reads and parses from XML file.
        """
        with open(file, 'rb') as f:
            desc_2018_xml = f.read()
        self.descriptors = xmltodict.parse(desc_2018_xml)
    
    def descriptor_ui_2_tree_number(self):
        """descriptor_ui_2_tree_number
        Create a mapping between DescriptorUI and TreeNumber fields.
        """
        mapper = {}
        for d in self.descriptors['DescriptorRecordSet']['DescriptorRecord']:
            k = d['DescriptorUI']
            if k is not None:
                v = d.get('TreeNumberList')
                if v is not None:
                    v = v.get('TreeNumber')
            mapper[k] = v
        return mapper
    
    def to_json(self, file_path=None):
        """to_json
        Serialize the parsed descriptors as a json.
        """
        with open(file_path, 'w') as f:
            json.dump(self.descriptors, f)

In [None]:
mesh_descriptors = MeSHDescriptors(desc_2018_json)

As an intial example, we can create a mapping between the _DescriptorUI_ and the _TreeNumber_.

In [None]:
dui_tree_number_map = mesh_descriptors.descriptor_ui_2_tree_number()

In [None]:
dui_tree_number_map['D013334']

From here, it is obvious how we might create further mappings that could be useful to make increased use of the full information available fom the descriptors. To do this, we will export the dict representation of the original XML to JSON format.

In [None]:
with open(proc_data + '/mesh_descriptions_{}.json'.format(today_str), 'w') as f:
    json.dump(desc_2018_json, f)

## 2. An Alternate Route - XML to DataFrame

This was the original approach to parsing the MeSH term XML file. It seems irrelavent now that the `xmltodict` method is in use, however I have left it here for interest.

In [None]:
# Adapted from 
# http://www.austintaylor.io/lxml/python/pandas/xml/dataframe/2016/07/08/convert-xml-to-pandas-dataframe/
# The original did not account for structures where the last children shared names but not parents as 
# occurs in this dataset. This gives messier names, but all the information.

class XML2DataFrame:

    def __init__(self, xml_data):
#         parser = ET.XMLParser(encoding="utf-8")
#         self.root = ET.fromstring(xml_data, parser=parser)
        self.root = ET.XML(xml_data)

    def parse_root(self, root):
        return [self.parse_element(child, 'Root') for child in iter(root)]

    def parse_element(self, element, parent_name, parsed=None):
        if parsed is None:
            parsed = dict()
        for key in element.keys():
            parsed[parent_name + key] = element.attrib.get(key)
        if element.text:
            h_key = parent_name + element.tag
#             if h_key in parsed:
#                 h_key = h_key + '_1'
            parsed[h_key] = element.text
        for child in list(element):
            self.parse_element(child, element.tag, parsed)
        return parsed

    def process_data(self):
        structure_data = self.parse_root(self.root)
        return pd.DataFrame(structure_data)

In [None]:
desc_2018_df.head()

In [None]:
desc_2018_df.columns

In [None]:
desc_2018_df.head(1)

In [None]:
desc_2018_df.drop([
       'AllowableQualifierQualifierReferredTo',
       'AllowableQualifiersListAllowableQualifier',
       'ConceptConceptName', 'ConceptConceptRelationList',
       'ConceptListConcept',
       'ConceptRelatedRegistryNumberList', 'ConceptRelationListConceptRelation',
       'DescriptorRecordAllowableQualifiersList',
       'DescriptorRecordConceptList', 
       'DescriptorRecordDateCreated', 'DescriptorRecordDateEstablished',
       'DescriptorRecordDateRevised', 'DescriptorRecordDescriptorName',
       'DescriptorRecordPharmacologicalActionList',
       'DescriptorRecordPreviousIndexingList',
       'DescriptorRecordTreeNumberList', 'DescriptorReferredToDescriptorName',
       'PharmacologicalActionDescriptorReferredTo',
       'PharmacologicalActionListPharmacologicalAction',
       'QualifierReferredToQualifierName',
       'RootDescriptorRecord',
       'TermDateCreated',
       'TermListTerm',
       'TermThesaurusIDlist','ECINDescriptorReferredTo',
       'ECINQualifierReferredTo',
       'ECOUTDescriptorReferredTo',
       'ECOUTQualifierReferredTo',
       'EntryCombinationECIN',
       'EntryCombinationECOUT'],
        axis=1, inplace=True)

In [None]:
desc_2018_df.head(1)

In [None]:
desc_2018_df.rename(columns={'AllowableQualifierAbbreviation': 'QualifierAbbreviation',
                            'ConceptConceptUI': 'ConceptUI',
                            'ConceptListPreferredConceptYN': 'PreferredConceptYN',
                            'ConceptRelationConcept1UI': 'Concept1UI',
                            'ConceptRelationConcept1UI': 'Concept2UI',
                            'ConceptRelationListRelationName' : 'ConceptRelationName',
                            'PreviousIndexingListPreviousIndexing': 'PreviousIndexing',
                            'EntryCombinationListEntryCombination': 'EntryCombination',
                            'RelatedRegistryNumberListRelatedRegistryNumber': 'RelatedRegistryNumber',
                            'SeeRelatedDescriptorDescriptorReferredTo': 'DescriptorReferredTo',
                            'SeeRelatedListSeeRelatedDescriptor': 'SeeRelatedDescriptor',
                            'TermListConceptPreferredTermYN': 'PreferredTermYN',
                            'TermListIsPermutedTermYN': 'IsPermutedTermYN',
                            'ThesaurusIDlistThesaurusID': 'ThesaurusID',
                            'TreeNumberListTreeNumber': 'TreeNumber'}, inplace=True)

In [None]:
# desc_2018_df['TreeNumber'][pd.isnull(desc_2018_df['TreeNumber'])] = ['U01', 'U02']
desc_2018_df = desc_2018_df[~pd.isnull(desc_2018_df['TreeNumber'])]

MeSH codes resemble the format "A01.343.124.243" with up to 12 levels, and where the first letter denotes the coarsest category. We want to know the position in the hierarchy for each word, so we count the number of splits in the code for each term.

In [None]:
code_splits = []

for c in desc_2018_df['TreeNumber'].str.split('.'):
    code_splits.append(c)

In [None]:
# mesh_tree_codes = ['.'.join(c) for c in code_splits]
code_lengths = [len(c) for c in code_splits]
max_code_length = max(code_lengths)
# desc_2018_df['MeshTreeCode'] = mesh_tree_codes

In [None]:
print(max_code_length)

In [None]:
# reset

# for c in desc_2018_df.columns:
#     if 'tree' in c:
#         desc_2018_df.drop(c, axis=1, inplace=True)

In [None]:
desc_2018_df['tree_number_0'] = [c[0][0] for c in code_splits]

In [None]:
code_splits[200]

Let's add columns for each code order, so we can group terms together under common codes later.

In [None]:
for i in range(1, max_code_length):
    tree_lvl_codes = []
    for c in code_splits:
        if len(c) >= i:
            tree_lvl_codes.append('.'.join(c[:i]))
        else:
            tree_lvl_codes.append(np.nan)
    desc_2018_df['tree_number_{}'.format(i)] = tree_lvl_codes

We want to map the codes to actual terms, so starting with the 0th level, we map terms obtained manually from the MeSH website.

In [None]:
# from https://meshb.nlm.nih.gov/treeView
tree_0_map = {
    'A': 'anatomy',
    'B': 'organisms',
    'C': 'diseases',
    'D': 'chemicals and drugs',
    'E': 'analytical, diagnostic, and therapeutic techniques, and equipment',
    'F': 'psychiatry and psychology',
    'G': 'phenomena and processes',
    'H': 'disciplines and occupations',
    'I': 'anthropology, education, sociology, and social phenomena',
    'J': 'technology, industry, and agriculture',
    'K': 'humanities',
    'L': 'information science',
    'M': 'named groups',
    'N': 'health care',
    'V': 'publication characteristics',
    'Z': 'geographicals'
}

In [None]:
desc_2018_df['tree_string_0'] = desc_2018_df['tree_number_0'].map(tree_0_map)

Some of the original strings are reversed using commas. To help matching in the documents we should put them in correct order.

In [None]:
# desc_2018_df.to_csv(proc_data + '/mesh_codes_cleaned_{}.csv'.format(today_str), index=False)
desc_2018_df = pd.read_csv(proc_data + '/mesh_codes_cleaned_{}.csv'.format('5_3_2018')).drop('Unnamed: 0', axis=1)

In [None]:
def process_string(string):
    string = string.split(', ')
    string = ' '.join(string[::-1])
    return string.lower()

In [None]:
for c in desc_2018_df.columns:
    if 'String' in c:
        print(c)

In [None]:
desc_2018_df['ConceptNameString'][:10]

In [None]:
desc_2018_df['DescriptorNameString'][:10]

In [None]:
desc_2018_df['QualifierNameString'][:10]

In [None]:
desc_2018_df['TermString'][:10]

In [None]:
desc_2018_df['ConceptStringProcessed'] = desc_2018_df['ConceptNameString'].apply(lambda x: process_string(x))
desc_2018_df['DescriptorStringProcessed'] = desc_2018_df['DescriptorNameString'].apply(lambda x: process_string(x))
# desc_2018_df['QualifierStringProcessed'] = desc_2018_df['QualifierNameString'].apply(lambda x: process_string(x))
desc_2018_df['TermStringProcessed'] = desc_2018_df['TermString'].apply(lambda x: process_string(x))

For each level, take the tree codes and the processed strings, but only for the ones where the next level up is NaN. This means that only ones which finish at this level of the tree are taken. Set the index of the dataframe to the tree codes and convert to a dict that maps codes to strings. Map that dict on to the codes for the next level up.

In [None]:
def expand_string_tree(df, string_column, max_code_length=13):
    for i in range(1, max_code_length - 1):
        tree_name_map = desc_2018_df[['TreeNumber', string_column]][pd.isnull(desc_2018_df['tree_number_{}'.format(i + 1)])].set_index('TreeNumber').to_dict()
        tree_name_map = tree_name_map[string_column]
        tree_name_map.pop(np.nan, None)
        desc_2018_df['tree_{}_{}'.format(string_column, i)] = desc_2018_df['tree_number_{}'.format(i)].map(tree_name_map, na_action='ignore')
    desc_2018_df['tree_{}_{}'.format(string_column, max_code_length - 1)] = np.nan
    return df

In [None]:
for c in ['ConceptStringProcessed', 'DescriptorStringProcessed', 'TermStringProcessed']:
    desc_2018_df = expand_string_tree(desc_2018_df, c)

In [None]:
# desc_2018_df.to_csv(proc_data + '/mesh_codes_cleaned_{}.csv'.format(today_str), index=False)

After this there are some broken codes, due to duplicate entries in the tree, but these are relatively few in number.

In [None]:
desc_2018_df['tree_order'] = code_lengths

Finally export as a json.

In [None]:
reoriented = desc_2018_df.set_index('DescriptorRecordDescriptorUI')

In [None]:
concept_string_dict = reoriented.to_dict(orient='index')

In [None]:
reoriented.to_json(proc_data + '/mesh_codes_processed_DUI_{}.json'.format(today_str), orient='index')

Need to do a second iteration of this where the tree is not built on one of the terms, but rather the tree numbers.

Possible structure that we might want to obtain later:

```
{'A': {'level': 0,
       'term': 'humans',
       'children': {'A01': {...
                           }
                    ...
                   }
       ... 
      }
 ...
}
                   
```

In [None]:
reoriented = desc_2018_df.set_index('TreeNumber')

In [None]:
concept_string_dict = reoriented.to_dict(orient='index')

In [None]:
reoriented.to_json(proc_data + '/mesh_codes_processed_tree_number_{}.json'.format(today_str), orient='index')

In [None]:
desc_2018_df = pd.read_json('../data/processed/mesh_codes_processed_5_4_2018.json')

In [None]:
desc_2018_df.set_index('TreeNumber').to_json('../data/processed/mesh_codes_processed_5_8_2018.json', orient='index')

In [None]:
desc_2018_df_2[desc_2018_df_2['ConceptNameString'].str.contains('Volition')]

In [None]:
df = pd.read_json('../data/processed/mesh_codes_processed_5_4_2018.json', orient='index')

In [None]:
df = df.reset_index()

In [None]:
df.rename(columns={'index': 'DescriptorRecordDescriptorUI'}, inplace=True)

In [None]:
df.columns

In [None]:
df.head()