# Classifying text lines

A classifier is trained to classify single text lines of a report. For example, the lines can be classified as containing a diagnosis (`"dm"`) or history (`his`) of a patient. This task was used as a preprocessing step to later steps of structured information extraction. So that only lines classified as containing a diagnosis will be fed to a downstream classifier, extracting the exact diagnosis. This step might be unnecessary with modern transformers that can handle longer text inputs. But it could still help by only feeding relevant input. Even more important old approach only trained and evaluated classifier 2 (MS-Diag) on reports containing "dm", which gives a more accurate idea of the whole pipeline.

The files containing the necessary information are inside the `data/raw/labelling` directory. It contains manually labelled reports from different sessions.

The original classes per label are:

| category    | subcategory       | abbreviation |
|-------------|-------------------|--------------|
| diagnosis   | MS diagnosis      | dm           |
|             | other             | do           |
| current state     |              | cu           |
| history     |                   | his          |
| symptoms    | MS related        | sym          |
|             | other             | so           |
| MRI | results                  | mr           |
| lab | results                  | labr         |
|             | other             | labo         |
| medication  | MS related        | medms        |
|             | other             | medo         |
| test, treatment        | results | tr           |
| header      |                   | head         |
| unknown     |                   | unk          |

I will group the classes according to the original approach, and drop observations with no text or label.

In [6]:
import sys
import os
sys.path.append(os.getcwd()+"/../..")

from src import paths
from src.utils import line_Label_label2id

import pandas as pd

from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict

In [7]:
def load_line_labelling():
    """Loading the data from the nested csv files in the different "imported_time" directories. Labelled reports have a "rev.csv" ending
    and are in a "_Marc" subdirectory. There should only be one entry per rid, that is labelled. Duplicates will be removed.

    Returns:
        pd.DataFrame: Dataframe with columns: "text", "class", "rid"
    """
    df_list = []
    rid_list = []

    for root, dirs, files in os.walk(paths.DATA_PATH_LABELLED):
        if "Marc" not in root:
            continue
        for file in files:

            # Get the research id from filename
            rid = file.split("_")[0]
            
            # Check if the file is a labelled report (and not mri) and if the rid is already in the list
            if (file.endswith("rev.csv") and "mri" not in file and rid not in rid_list):
                
                # Append rid to rid list to keep track of which files have been added
                rid_list.append(rid)

                # Create a dataframe from the csv file
                _df = pd.read_csv(os.path.join(root, file))
                
                # Add the rid to the dataframe
                _df = _df.rename(columns={"text": "text", "class": "class"})
                _df['rid'] = rid
                
                # Append the dataframe to the main dataframe
                try: 
                    df_list.append(_df)
                except:
                    print("Error with file: ", file)
                    print("df head: ", _df.head(5))
                    print("_df head: ", _df.head(5))
                    continue
    print("Number of reports: ", len(df_list))
    return pd.concat(df_list)[["text", "class", "rid"]]

In [8]:
def clean_line_text(df: pd.DataFrame):
    """
    For Transformer to work, there has to be text in the text column. If there is no text, the text column is removed.
    """
    df = df.dropna(subset=["text"])

    return df

def clean_line_class(df: pd.DataFrame):
    """Cleans the dataframe labels in "class".
    1) Removes whitespace from beginning and end of text
    2) Correct spelling mistakes
    3) Fill NaN values with "unk"
    4) Create a new column "class_agg" with the aggregated classes of the original approach.

    Args:
        df (pd.DataFrame): Input dataframe
    """

    # Class mapping spelling mistakes
    class_mapping_spelling = {
        'memds': 'medms',
        'hs': 'his',
        'm': 'mr',
    }
    class_mapping_agg = {
        'his': 'his_sym_cu',
        'sym': 'his_sym_cu',
        'cu': 'his_sym_cu',
        'labr': 'labr_labo',
        'labo': 'labr_labo',
        'to': 'to_tr',
        'tr': 'to_tr',
        'medo': 'medo_unk_do_so',
        'unk': 'medo_unk_do_so',
        'do': 'medo_unk_do_so',
        'so': 'medo_unk_do_so',
    }

    # Cleaning the class column
    df['class'] = df['class'].str.strip() \
                             .replace(class_mapping_spelling) \
                             .fillna("unk") 

    # Creating a new column with the aggregated classes
    df['class_agg'] = df['class'].replace(class_mapping_agg)

    return df

def load_clean_line_df():
    """Loads and cleans the dataframe from the load_line_labelling function.
    """
    df = load_line_labelling()
    df = clean_line_text(df)
    df = clean_line_class(df)

    return df[["rid", "text", "class", "class_agg"]]

In [9]:
df = load_clean_line_df()
display(df.head(5))

# Class distribution
print("class_agg values: ", df.class_agg.value_counts())
print("class values: ", df["class"].value_counts())

# Number of reports with missing values in class_agg
print("Number of reports with missing values in class_agg: ", df[df["class_agg"].isnull()].shape[0])

Number of reports:  74


Unnamed: 0,rid,text,class,class_agg
0,3A79B7BD-39B6-4AE5-82C5-ADAB09B34A41,"Schubförmige Multiple Sklerose (RRMS), (ES/ED ...",dm,dm
1,3A79B7BD-39B6-4AE5-82C5-ADAB09B34A41,Klinisch:- residuell: spastisches Gangbild rec...,sym,his_sym_cu
2,3A79B7BD-39B6-4AE5-82C5-ADAB09B34A41,Verlauf: - 05/1999- 08/2011 rezidivierende Sch...,his,his_sym_cu
3,3A79B7BD-39B6-4AE5-82C5-ADAB09B34A41,Diagnostisch:,head,head
4,3A79B7BD-39B6-4AE5-82C5-ADAB09B34A41,INDENT - 1998 Liquor: oligoklonale Banden posi...,labr,labr_labo


class_agg values:  class_agg
his_sym_cu        312
mr                270
head              236
medms             213
labr_labo         184
medo_unk_do_so    139
to_tr             114
dm                 74
Name: count, dtype: int64
class values:  class
mr       270
head     236
medms    213
his      210
labr     166
tr        77
dm        74
sym       61
unk       50
do        42
cu        41
to        37
medo      35
labo      18
so        12
Name: count, dtype: int64
Number of reports with missing values in class_agg:  0


In [10]:
# Save the dataframe
df.to_csv(os.path.join(paths.DATA_PATH_PREPROCESSED, "line-label/line-label_clean.csv"))

In [11]:
# Create a train test split by rid
rids = df.rid.unique()
rids_train, rids_test = train_test_split(rids, test_size=0.2, random_state=42)
rids_train, rids_val = train_test_split(rids_train, test_size=0.2, random_state=42)

df_train = df[df.rid.isin(rids_train)]
df_val = df[df.rid.isin(rids_val)]
df_test = df[df.rid.isin(rids_test)]

# Save the train, val and test dataframes
df_train.to_csv(os.path.join(paths.DATA_PATH_PREPROCESSED, "line-label/line-label_clean_train.csv"))

df_val.to_csv(os.path.join(paths.DATA_PATH_PREPROCESSED, "line-label/line-label_clean_val.csv"))

df_test.to_csv(os.path.join(paths.DATA_PATH_PREPROCESSED, "line-label/line-label_clean_test.csv"))

In [12]:
# Loading all of the data for dataset clf2 construction
df_all = pd.read_csv(paths.DATA_PATH_PREPROCESSED/"midatams/seantis_kisim.csv")
df_all = df_all[["research_id", "text"]].rename(columns={"research_id": "rid"})
df_all["class_agg"] = None
df_all

Unnamed: 0,rid,text,class_agg
0,016B6D16-2BBA-4C05-A8E4-30F761C95813,Diagnosen allgemein,
1,016B6D16-2BBA-4C05-A8E4-30F761C95813,INDENT Schubförmige Multiple Sklerose EM 03.06...,
2,016B6D16-2BBA-4C05-A8E4-30F761C95813,"INDENT 03/2021: klinisch nicht aktiv, radiolog...",
3,016B6D16-2BBA-4C05-A8E4-30F761C95813,INDENT 06/2019 Schub mit Hypästhesie der Beine...,
4,016B6D16-2BBA-4C05-A8E4-30F761C95813,INDENT MRI WS 06/2019: kontrastmittelaufnehmen...,
...,...,...,...
5664,C3997B98-9A42-4A63-A6EE-FC2BEDB78D14,"INDENT Soor-Stomatitis, ED 07.02.2020",
5665,C3997B98-9A42-4A63-A6EE-FC2BEDB78D14,"INDENT Bakterieller Harnwegsinfekt, ED 11.02.2020",
5666,C3997B98-9A42-4A63-A6EE-FC2BEDB78D14,INDENT Stomatitis aphthosa mit schmerzhafter A...,
5667,C3997B98-9A42-4A63-A6EE-FC2BEDB78D14,"INDENT Re-Soor-Stomatitis, ED 11.03.2020",


In [13]:
# Create HuggingFace Dataset
dataset = DatasetDict({
    "train": Dataset.from_pandas(df_train),
    "val": Dataset.from_pandas(df_val),
    "test": Dataset.from_pandas(df_test),
})

# Drop unnecessary columns
dataset = dataset.remove_columns(["class", "__index_level_0__"])

# Add the all dataset
dataset["all"] = Dataset.from_pandas(df_all)

# Create labels 
dataset = dataset.map(lambda example: {"labels":[line_Label_label2id.get(e, None) for e in example["class_agg"]]}, batched=True)

# Save the dataset
dataset.save_to_disk(os.path.join(paths.DATA_PATH_PREPROCESSED, "line-label/line-label_clean_dataset"))

Map:   0%|          | 0/917 [00:00<?, ? examples/s]

Map:   0%|          | 0/258 [00:00<?, ? examples/s]

Map:   0%|          | 0/367 [00:00<?, ? examples/s]

Map:   0%|          | 0/5669 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/917 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/258 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/367 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5669 [00:00<?, ? examples/s]