Importing the necessary libraries.

In [1]:
import csv
import numpy as np
import pandas as pd
from collections import defaultdict
import requests
from bs4 import BeautifulSoup

Load the CSV for UMLS code and symptom pairs.

In [2]:
CSV_FILEPATH = 'dataset/symptom-umls-code_pairs.csv'
data = pd.read_csv(CSV_FILEPATH)
data

Unnamed: 0,umls,symptom
0,C0008031,pain chest
1,C0392680,shortness of breath
2,C0012833,dizziness
3,C0004093,asthenia
4,C0085639,fall
...,...,...
1902,C0741453,bedridden
1903,C0242453,prostatism
1904,C0232257,systolic murmur
1905,C0871754,frail


Extract the entries to lists.

In [3]:
symptom_names = data['symptom'].tolist()
symptom_codes = data['umls'].tolist()

Scraping data from <a href="https://www.ncbi.nlm.nih.gov/medgen/">NCBI MedGen</a> for description on each symptom based on UMLS code.

In [4]:
ROOT_URL = 'https://www.ncbi.nlm.nih.gov/medgen/'
symptom_data = []

for symptom_name, term in zip(symptom_names, symptom_codes):

    scraped_url = "{}?term={}".format(ROOT_URL, term)
    r = requests.get(scraped_url)

    soup = BeautifulSoup(r.content, 'html5lib')
    symptom_info = soup.find("div", {"class": "portlet_content ln"})
    if symptom_info:
        description = symptom_info.text.replace('\xa0',' ')
        if symptom_info.find("a"):
            source_name = symptom_info.find("a").text
            source_link = symptom_info.find("a")['href']
        else:
            source_name = None
            source_link = None
    else:
        description = None
        source_name = None
        source_link = None

    entry = {
        "symptom": symptom_name,
        "code": term,
        "description": description,
        "source_name": source_name,
        "source_link": source_link,
        "root_url": scraped_url
    }

    symptom_data.append(entry)

Preview of the generated dictionary from scraping.

In [5]:
symptom_data

[{'symptom': 'pain chest',
  'code': 'C0008031',
  'description': 'An unpleasant sensation characterized by physical discomfort (such as pricking, throbbing, or aching) localized to the chest. [from HPO]',
  'source_name': 'HPO',
  'source_link': 'http://www.human-phenotype-ontology.org',
  'root_url': 'https://www.ncbi.nlm.nih.gov/medgen/?term=C0008031'},
 {'symptom': 'shortness of breath',
  'code': 'C0392680',
  'description': None,
  'source_name': None,
  'source_link': None,
  'root_url': 'https://www.ncbi.nlm.nih.gov/medgen/?term=C0392680'},
 {'symptom': 'dizziness',
  'code': 'C0012833',
  'description': 'A sensation of lightheadedness, unsteadiness, turning, spinning or rocking. [from NCI]',
  'source_name': 'NCI',
  'source_link': 'http://ncit.nci.nih.gov',
  'root_url': 'https://www.ncbi.nlm.nih.gov/medgen/?term=C0012833'},
 {'symptom': 'asthenia',
  'code': 'C0004093',
  'description': 'A state characterized by a feeling of weakness and loss of strength leading to a general

Converting the dictionary into a data frame.

In [6]:
symptom_code_df = pd.DataFrame(symptom_data)
symptom_code_df

Unnamed: 0,symptom,code,description,source_name,source_link,root_url
0,pain chest,C0008031,An unpleasant sensation characterized by physi...,HPO,http://www.human-phenotype-ontology.org,https://www.ncbi.nlm.nih.gov/medgen/?term=C000...
1,shortness of breath,C0392680,,,,https://www.ncbi.nlm.nih.gov/medgen/?term=C039...
2,dizziness,C0012833,"A sensation of lightheadedness, unsteadiness, ...",NCI,http://ncit.nci.nih.gov,https://www.ncbi.nlm.nih.gov/medgen/?term=C001...
3,asthenia,C0004093,A state characterized by a feeling of weakness...,HPO,http://www.human-phenotype-ontology.org,https://www.ncbi.nlm.nih.gov/medgen/?term=C000...
4,fall,C0085639,"A sudden movement downward, usually resulting ...",NCI,http://ncit.nci.nih.gov,https://www.ncbi.nlm.nih.gov/medgen/?term=C008...
...,...,...,...,...,...,...
1902,bedridden,C0741453,Confined to bed (by illness). [from NCI],NCI,http://ncit.nci.nih.gov,https://www.ncbi.nlm.nih.gov/medgen/?term=C074...
1903,prostatism,C0242453,"Lower urinary tract symptom, such as slow urin...",MeSH,http://www.nlm.nih.gov/pubs/factsheets/mesh.html,https://www.ncbi.nlm.nih.gov/medgen/?term=C024...
1904,systolic murmur,C0232257,"A heart murmur limited to systole, i.e., betwe...",HPO,http://www.human-phenotype-ontology.org,https://www.ncbi.nlm.nih.gov/medgen/?term=C023...
1905,frail,C0871754,,,,https://www.ncbi.nlm.nih.gov/medgen/?term=C087...


Save the data frame into CSV.

In [7]:
CSV_FILEPATH = 'dataset/symptom-description.csv'
symptom_code_df.to_csv(CSV_FILEPATH, index=False)