# ICD 11

The data was obtained on 2024-01-22 (version 01/2023) by clicking on `info` -> `spreadsheet file` on https://icd.who.int/browse11/l-m/en.

There are 3 different types of values:
1. chapter
2. block
3. category

There isn't a proper `ontology_id`, so we'll use the linearization URI.
Note that it can have `other` and `unspecified` which pose alternatives. We will keep them but replace them with `o` and `u` respectively.

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
df = pd.read_excel("icd_11.xlsx")

In [None]:
df = df[["Linearization (release) URI", "Code", "Title"]]
df.head()

In [None]:
df.rename(
    columns={"Code": "code", "Title": "name", "Linearization (release) URI": "URI"},
    inplace=True,
)

In [None]:
def extract_code(url: str) -> str:
    match = re.search(r"/(\d+)(?:/(other|unspecified))?$", url)
    if match:
        code = match.group(1)
        suffix = match.group(2)
        if suffix == "other":
            code += "o"
        elif suffix == "unspecified":
            code += "u"
        return code
    else:
        return "No code found"

In [None]:
# Finding the parent for each term
def find_parent(term, all_terms):
    depth = term.count("-")
    parent_depth = depth - 1
    term_index = all_terms.index(term)

    # Search upwards for the nearest term with one less dash
    for previous_term in reversed(all_terms[:term_index]):
        if previous_term.count("-") == parent_depth:
            return previous_term.strip("- ").strip()
    return None


df["parents"] = df["name"].apply(lambda x: find_parent(x, df["name"].tolist()))

In [None]:
df["ontology_id"] = df["URI"].apply(extract_code)

In [None]:
df.drop("URI", inplace=True, axis=1)

In [None]:
df["name"] = df["name"].str.replace("-", "").str.strip()

In [None]:
title_to_ontology = dict(zip(df["name"], df["ontology_id"]))

df["parents"] = df["parents"].apply(title_to_ontology.get)

In [None]:
df.set_index("ontology_id", inplace=True)

In [None]:
df

In [None]:
df.to_parquet("icd-11-2023-01.parquet")