# Clinical Concept Extraction

This notebook provides an example of extracting clinical concepts from a medical record.

## Imports

Let's begin by importing some of the key utilities from the `cce` library, as well as helpers from the standard Python library.

In [1]:
## Standard Library Imports
import os
import json
from functools import partial

## Custom Library Imports
from cce.util.patterns import ICD10_AUTO_LABELS, AUTO_LABELS
from cce.util.data_loaders import AutoLabeler, get_autolabels

## Synthetic Data

We have provided a synthetic clinical note generated by ChatGPT for the purpose of demonstrating expected format.

In [2]:
## Load Synthetic Note
synthetic_note_file = "../data/resources/synthetic-notes/1.txt"
if not os.path.exists(synthetic_note_file):
    raise FileNotFoundError("Could not find synthetic note file.")
with open(synthetic_note_file,"r") as the_file:
    synthetic_note = the_file.read().strip()

## Show Note
print(synthetic_note)

[[[ENCOUNTER ICD-10 CODES]]]
[[E11.9: Type 2 diabetes mellitus without complications]]
[[H35.81: Mild nonproliferative diabetic retinopathy, bilateral]]

[[[PROGRESS NOTE]]]
The patient with a known history of type 2 diabetes mellitus presented today for a routine eye examination. Reports stable blood glucose levels and compliance with prescribed medications. No recent changes in systemic health.

Visual acuity measured at 20/30 in both eyes without correction. Dilated fundus examination revealed mild nonproliferative diabetic retinopathy (NPDR) characterized by microaneurysms and intraretinal hemorrhages. Macular edema was not detected.

Patient counseled on the importance of glycemic control and regular eye examinations to monitor diabetic retinopathy progression. Scheduled for follow-up in six months.

[[[PROBLEM LIST]]]
[[E11.9: Type 2 diabetes mellitus without complications]]
[OVERVIEW]
The patient has type 2 diabetes mellitus without any current complications.

[ASSESSMENT & PLAN

## Extractor Parameterization

The concept extraction pipeline leverages two stages. The first stage is faciliated by the `AutoLabeler` module, which looks for relevant concepts in free text. The second stage is housed entirely within the `get_autolabels` function, which includes postprocessing steps such as the consolidation of consecutive surgical concept spans.

#### `AutoLabeler` Parameters

* `auto_labels`: Free text search patterns. Default should be the `AUTO_LABELS` variable from the `cce.util.patterns` module. If `None`, no free-text search will be executed.
* `icd10_auto_labels`: ICD-10 code search patterns. Default should be the `ICD10_AUTO_LABELS` variable from the `cce.util.patterns` module. If `None`, no ICD-10 code search will be executed.
* `handle_icd10_expand`: Either `None`, `"all"` or `"unseen"`. Look for ICD-10 E codes and, if any are found, attempt to add the following concepts: DR (Generic), ME, DM, Nephropathy, Neuropathy, Heart Attack, Stroke. If the parameter is `None`, no concepts will be included. If the parameter is `"all"`, it will add all concepts in the prior list. If the parameter is `"unseen"`, only concepts which were not already identifed by the free-text search will be added.
* `handle_icd10_codes`: Either `None`, `"all"` or `"unseen"`. Look for ICD-10 codes provided in the `icd10_auto_labels` mapping. If this parameter is `None`, no concepts will be added based on ICD-10 codes found in the note. If this parameter is `"all"`, all concepts found using ICD-10 codes will be added to the output. If this parameter is `"unseen"`, concepts found by ICD-10 codes will only be added to the output if not already found in the free-text search.
* `handle_icd_10_strings`: Either `None`, `"all"` or `"unseen"`. Behaves similarly to `handle_icd10_codes`, execpt we now look for matches to the free-text search patterns within the short descriptive strings associated with each ICD-10 code.

#### `get_autolabels` Parameters

* `autolabeler`: Initialized `AutoLabeler` module.
* `formatted`: Boolean. Indicates whether the text passed to the function contains section headers (e.g., "[[[PROBLEM LIST]]]"). Set `False` if only providing free-text without additional note formatting.
* `handle_surgical`: Either `None`, `"split"`, or `"merge"`. If `"split"`, retina surgery spans will be split into separate spans (e.g., "PPV, lensectomy" will have a "PPV" span and a "lensectomy" span). If `"merge"`, we will attempt to consolidate retina surgery spans that are separated by conjuctions and common filler (e.g., "PPV combined with lensectomy" will be a single span). If `None`, no additional postprocessing logic will be applied. We recommend using the `"merge"` setting.
* `handle_anti_vegf`: Boolean. Clinical notes may contain a long list of Anti-VEGF procedures contained within a table-like format. If this parameter is set to `True`, we attempt to recognize these long tables and merge them into a single span instead of separate spans.
* `resolve_retinopathy_hierarchy`: Boolean. If `True`, remove DR (Generic) spans if a more specific representation (PDR or NPDR) is found.

In [3]:
## Initialize Concept Extractor
extractor = AutoLabeler(auto_labels=AUTO_LABELS, ## Free-Text Search Patterns
                        icd10_auto_labels=ICD10_AUTO_LABELS, ## ICD-10 Search Patterns
                        handle_icd10_expand="unseen", ## How to handle diabetes mellitus matches (i.e., add common comorbidities)
                        handle_icd10_codes="all", ## How to handle matched ICD-10 codes
                        handle_icd10_strings="all", ## How to handle matches within ICD-10 Code Descriptions
                       )

## Pass Concept Extractor to Helper That Includes Postprocessing Logic
extractor_p = partial(get_autolabels,
                      autolabeler=extractor,
                      formatted=True,
                      handle_surgical="merge",
                      handle_anti_vegf=True,
                      resolve_retinopathy_hierarchy=False)

## Extraction

The parameterized extractor is applied to the synthetic note text. The output is a dictionary with three keys:

* `"document_id"`: Will be null. This can be updated manually if you have an identifier associated with the document.
* `"text"`: The document which was processed.
* `"concepts"`: Dictionary where each key is a clinical concept and values are a list of character spans for that concept.

In [4]:
## Apply Extractor
concepts = extractor_p(synthetic_note)
print(concepts.keys())
## Show Results
print(json.dumps(concepts["annotations"], indent=2, sort_keys=True))

dict_keys(['document_id', 'text', 'annotations'])
{
  "A1 - DR (Generic)": [
    [
      31,
      36,
      "E11.9"
    ],
    [
      743,
      763,
      "diabetic retinopathy"
    ],
    [
      1627,
      1647,
      "diabetic retinopathy"
    ]
  ],
  "A2 - NPDR": [
    [
      102,
      139,
      "nonproliferative diabetic retinopathy"
    ],
    [
      507,
      544,
      "nonproliferative diabetic retinopathy"
    ],
    [
      546,
      550,
      "NPDR"
    ],
    [
      1170,
      1207,
      "nonproliferative diabetic retinopathy"
    ],
    [
      1247,
      1284,
      "nonproliferative diabetic retinopathy"
    ]
  ],
  "B1 - ME": [
    [
      89,
      95,
      "H35.81"
    ],
    [
      614,
      627,
      "Macular edema"
    ]
  ],
  "B1 - ME <<AUTO>>": [
    [
      31,
      36,
      "E11.9"
    ]
  ],
  "F1 - Diabetes Mellitus": [
    [
      31,
      34,
      "E11"
    ],
    [
      45,
      62,
      "diabetes mellitus"
    ],
    [
      