# ICD 11

The data was obtained on 2024-01-22 (version 01/2023) by clicking on `info` -> `spreadsheet file` on https://icd.who.int/browse11/l-m/en.

There are 3 different types of values:
1. chapter
2. block
3. category

There isn't a proper `ontology_id`, so we'll use the linearization URI.
Note that it can have `other` and `unspecified` which pose alternatives. We will keep them but replace them with `o` and `u` respectively.

In [16]:
import pandas as pd
import numpy as np
import re

In [17]:
df = pd.read_excel("icd_11.xlsx")

In [18]:
df = df[["Linearization (release) URI", "Code", "Title"]]
df.head()

Unnamed: 0,Linearization (release) URI,Code,Title
0,http://id.who.int/icd/release/11/2023-01/mms/1...,,Certain infectious or parasitic diseases
1,http://id.who.int/icd/release/11/2023-01/mms/5...,,- Gastroenteritis or colitis of infectious origin
2,http://id.who.int/icd/release/11/2023-01/mms/1...,,- - Bacterial intestinal infections
3,http://id.who.int/icd/release/11/2023-01/mms/2...,1A00,- - - Cholera
4,http://id.who.int/icd/release/11/2023-01/mms/4...,1A01,- - - Intestinal infection due to other Vibrio


In [19]:
df.rename(
    columns={"Code": "code", "Title": "name", "Linearization (release) URI": "URI"},
    inplace=True,
)

In [20]:
def extract_code(url: str) -> str:
    match = re.search(r"/(\d+)(?:/(other|unspecified))?$", url)
    if match:
        code = match.group(1)
        suffix = match.group(2)
        if suffix == "other":
            code += "o"
        elif suffix == "unspecified":
            code += "u"
        return code
    else:
        return "No code found"

In [21]:
# Finding the parent for each term
def find_parent(term, all_terms):
    depth = term.count("-")
    parent_depth = depth - 1
    term_index = all_terms.index(term)

    # Search upwards for the nearest term with one less dash
    for previous_term in reversed(all_terms[:term_index]):
        if previous_term.count("-") == parent_depth:
            return previous_term.strip("- ").strip()
    return None


df["parents"] = df["name"].apply(lambda x: find_parent(x, df["name"].tolist()))

In [22]:
df["ontology_id"] = df["URI"].apply(extract_code)

In [23]:
df.drop("URI", inplace=True, axis=1)

In [24]:
df["name"] = df["name"].str.replace("-", "").str.strip()

In [25]:
title_to_ontology = dict(zip(df["name"], df["ontology_id"]))

df["parents"] = df["parents"].apply(title_to_ontology.get)

In [26]:
df.set_index("ontology_id", inplace=True)

In [27]:
df

Unnamed: 0_level_0,code,name,parents
ontology_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1435254666,,Certain infectious or parasitic diseases,
588616678,,Gastroenteritis or colitis of infectious origin,1435254666
135352227,,Bacterial intestinal infections,588616678
257068234,1A00,Cholera,135352227
416025325,1A01,Intestinal infection due to other Vibrio,135352227
...,...,...,...
1956913761,XD36Q1,"Infusion Pumps, Syringe",1529373361
783787054,XD1N14,"Infusion Pumps, Syringe, Nuclear Magnetic Reso...",1529373361
1524741217,XD80Z7,Medical/medicinal gas systems and relative acc...,1838822834
280385798,XD4U38,General purpose electrocardiographs,1838822834


In [28]:
df.to_parquet("icd-11-2023-01.parquet")