# MS Preprocessing
To get comparability to old project I will use parts of their data preprocessing. Specifically I use the seantis_kisim.csv that they have prepared. Their approach:

1. Extract the longest diagnosis per rid (most lines) from the csv and if the rid had a manually line labelled text, they used this instead.
2. Results in dataset consisting of text lines per row with a label for the line.

Further processing:

3. Merge diagnoses.csv and line labelled dataset by rid. Clean the labels. Correct SPMS and PPMS labels that are wrong.
4. Get a list of eligible rid (rids with text that have at least one dm line).
5. Df1: concatenate all dm lines per eligible rid for the text.
6. Df2: concatenate all text per eligible rid.

In [61]:
import pandas as pd
import torch
import sys
import os
sys.path.append(os.getcwd()+"/../..")

from src import paths
from src.utils import ms_label2id

from datasets import DatasetDict, Dataset

from sklearn.model_selection import train_test_split

In [62]:
# Line Labelled dataset
data = torch.load(paths.RESULTS_PATH/"line-label/line-label_medbert-512_token_finetuned_all.pt")

data_list = []
for obs in data:
    _df = pd.DataFrame(obs["text"], columns=["text"])
    _df["class2"] = obs["preds"]
    _df["rid"] = obs["rid"]
    data_list.append(_df)

data_df = pd.concat(data_list)

data_df.to_csv(paths.DATA_PATH_PREPROCESSED/"ms-diag/line-label_medbert-512_token_finetuned_all.csv", index=False)

In [63]:
# Line labelled dataset from classifier 1
df_text = pd.read_csv(os.path.join(paths.DATA_PATH_PREPROCESSED, "ms-diag", "line-label_medbert-512_token_finetuned_all.csv"))[["rid", "text", "class2"]]
df_labels = pd.read_csv(os.path.join(paths.DATA_PATH_SEANTIS, "diagnoses.csv"))

# In old approach they only used confirmed diagnosis
df_labels = df_labels[df_labels["diagnosis_reliability"] == "confirmed"]
df_labels = df_labels[["research_id", "disease"]].rename(columns={"disease": "labels", "research_id": "rid"})

In [64]:
# Merge with diagnoses.csv
df_merged = pd.merge(df_text, df_labels, on="rid", how="inner")

In [65]:
# English Labels
english_labels = set(["relapsing_remitting_multiple_sclerosis", "secondary_progressive_multiple_sclerosis", "primary_progressive_multiple_sclerosis"])
other_labels = set(df_merged["labels"].unique()) - english_labels

In [66]:
other_labels

{'Multiple Sklerose',
 'Multiple Sklerose a.e. primär progredient',
 'Multiple Sklerose mit a.e. primär-progredientem Verlauf',
 'Multiple Sklerose mit am ehesten progredientem Verlauf',
 'Progressive multifokale Leukencephalopathie (PML)',
 'Schubförmig remittierende Multiple Sklerose ',
 'Schubförmig remittierende Multiple Sklerose (RRMS)',
 'Schubförmige Multiple Sklerose',
 'St.n. symptomatischer Epilepsie mit einfach fokalen und komplex fokalen Anfällen'}

In [67]:
# Check non english labels
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

for rid, rid_data in df_merged.groupby("rid"):
    if rid_data.labels.isin(other_labels).any():
        print(rid_data.labels.unique())
        print(rid_data["text"].str.cat(sep = " "))

['Multiple Sklerose a.e. primär progredient']
1. Multiple Sklerose a.e. primär progredient, EM 2011, ED 03/2015, EDSS 3.5 INDENT klinisch: nicht aktiv, radiologisch: nicht aktiv, Progression: nein (nach Lublin et al. 2013) INDENT aktuell: im langfristigen Verlauf langsame Besserung der spastisch-ataktischen Gangstörung  Verlauf: INDENT 1997 Episode mit aufsteigenden Sensibilitätsstörungen der Beine. Abklärungen erfolgten in der Neurologie USZ (LP, MRI, evozierte Potentiale), eine Diagnose wurde jedoch nicht gestellt  INDENT 1997-2011 Anamnestisch keine neurologische Symptomatik eruierbar INDENT 2011-2016 langsam progrediente, rechtsbetonte spastisch-ataktische Gangstörung, distal betonte Pallhypästhesie und eingeschränktes Lageempfinden der unteren Extremität Diagnostik: Bildgebung INDENT 06.02.2015 MRI Wirbelsäule: Multietagere multiple T2 hyperintense Läsionen cervicothorakal, V.a. Gd-Aufnahme auf Niveau BWK 5 - BWK 7 INDENT 18.02.2015 MRI Gehirn: Multiple T2w- und FLAIRw-hyperintens

In [68]:
# Remap non english labels if possible
map_dict = {
    "Multiple Sklerose a.e. primär progredient": "primary_progressive_multiple_sclerosis",
    "Multiple Sklerose mit a.e. primär-progredientem Verlauf": "primary_progressive_multiple_sclerosis",
    "Schubförmig remittierende Multiple Sklerose (RRMS)": "relapsing_remitting_multiple_sclerosis",
}
df_merged = df_merged.replace(map_dict)

In [69]:
# Remove all non english labels
df_merged = df_merged[df_merged["labels"].isin(english_labels)]

In [42]:
# Because mapping was done manually, check if label matches text for classes with low counts like SPMS
for rid, rid_data in df_merged[df_merged["labels"] == "secondary_progressive_multiple_sclerosis"].groupby("rid"):
    print(rid)
    print(rid_data["text"].str.cat(sep = "\n"))
    print("\n")

2A9F4832-B09D-470A-B05F-519854310DBB
Schubförmige Multiple Sklerose (ED 1990), EDSS 6.5, V.a. sekundär-progredienten Verlauf
INDENT aktuell: seit Ende 03/2018 aufgrund schwerer Belastungssituation am Arbeitsplatz zunehmende schmerzhafte Krämpfe aller Extremitäten rechts- und beinbetont, Zunahme der Beinschwäche, zusätzlich Gedächtnis- und Konzentrationsstörungen, am Eintrittstag bei allgemeinem Schwächegefühl nicht mehr aufstehen können
INDENT anamnestisch: Gehstrecke zuvor 150 m mit Gehhilfe, vermehrte Sturzanfälligkeit, Harninkontinenz, leichte Fatigue-Symptomatik
INDENT klinisch: rechts- und beinbetonte spastische Tetraparese, Gang- und Standataxie, Sensibilitätsminderung der rechten Extremität, neurogene Blasenfunktionsstörung 
INDENT Verlauf:
INDENT EM: Müdigkeit schon als Kind
INDENT ED 1990: beide Beine plötzlich gelähmt
INDENT 05/2009: deutliche Verschlechterung der vorbestehenden Gang- und Standunsicherheit durch starken Schwankschwindel
INDENT 2010-2011: Gehhilfe
INDENT 2012:

In [75]:
spms_wrong = ["2A9F4832-B09D-470A-B05F-519854310DBB",
              "39D432B0-902B-49D9-B727-12EDC053B09E",
              "AF834D8D-F7DB-4B22-BB01-29F10EE6A828",
              "B886879A-5109-46FD-A2B0-9DCA2DA733F8",
              "C0784569-1E15-4FBE-A4B2-F9473975D199"
                ]
df_merged[df_merged["labels"] == "secondary_progressive_multiple_sclerosis"].rid.unique().shape
# Because of this exclusion we end up with less training examples than in their original approach

(13,)

In [76]:
# Drop entries with wrong label
df_merged = df_merged[~df_merged["rid"].isin(spms_wrong)]

In [77]:
# Check primary_progressive_multiple_sclerosis
for rid, rid_data in df_merged[df_merged["labels"] == "primary_progressive_multiple_sclerosis"].groupby("rid"):
    print(rid)
    print(rid_data["text"].str.cat(sep = "\n"))
    print("\n")

0A7AF4C2-ADD6-4B72-9869-4D190928F8C3
v.v Primar progrediente multiple Sklerose


11A4453B-76B6-4558-8126-EEFEA6935C91
1. Multiple Sklerose a.e. primär progredient, EM 2011, ED 03/2015, EDSS 3.5
INDENT klinisch: nicht aktiv, radiologisch: nicht aktiv, Progression: nein (nach Lublin et al. 2013)
INDENT aktuell: im langfristigen Verlauf langsame Besserung der spastisch-ataktischen Gangstörung 
Verlauf:
INDENT 1997 Episode mit aufsteigenden Sensibilitätsstörungen der Beine. Abklärungen erfolgten in der Neurologie USZ (LP, MRI, evozierte Potentiale), eine Diagnose wurde jedoch nicht gestellt 
INDENT 1997-2011 Anamnestisch keine neurologische Symptomatik eruierbar
INDENT 2011-2016 langsam progrediente, rechtsbetonte spastisch-ataktische Gangstörung, distal betonte Pallhypästhesie und eingeschränktes Lageempfinden der unteren Extremität
Diagnostik:
Bildgebung
INDENT 06.02.2015 MRI Wirbelsäule: Multietagere multiple T2 hyperintense Läsionen cervicothorakal, V.a. Gd-Aufnahme auf Niveau BWK 5 - 

In [78]:
# Eligible research_ids, i.e. those with at least one "dm" class2
eligible_rids = df_merged[df_merged["class2"] == "dm"]["rid"].unique()

# Filter df_merged for eligible rids
df_merged = df_merged[df_merged["rid"].isin(eligible_rids)]

In [79]:
# Df1: contains all dm entries per rid keeping the labels
df1 = df_merged[df_merged["class2"] == "dm"].groupby("rid").agg({"text": "\n".join, "labels": "first"}).reset_index()

# Df2: contains all text entries per rid keeping the labels
df2 = df_merged.groupby("rid").agg({"text": "\n".join, "labels": "first"}).reset_index()

In [80]:
len(df1) == len(df2) # Check if same number of rids

True

In [81]:
# Train Val Test split
df1train, df1test = train_test_split(df1, test_size=0.3, random_state=42, stratify=df1["labels"])
df1train, df1val = train_test_split(df1train, test_size=0.1, random_state=42, stratify=df1train["labels"])

df2train, df2test = train_test_split(df2, test_size=0.3, random_state=42, stratify=df2["labels"])
df2train, df2val = train_test_split(df2train, test_size=0.1, random_state=42, stratify=df2train["labels"])

In [82]:
# Create HuggingFace Dataset
def create_hf_dataset(train:pd.DataFrame, val:pd.DataFrame, test:pd.DataFrame):
    """Create HuggingFace Dataset from train, val and test dataframes. Remaps labels to ids and drops unnecessary columns.
    
    Args:
        train (pd.DataFrame): Training dataframe
        val (pd.DataFrame): Validation dataframe
        test (pd.DataFrame): Test dataframe
        
        Returns:
            DatasetDict: HuggingFace DatasetDict
            
    """
    dataset = DatasetDict({
        "train": Dataset.from_pandas(train),
        "val": Dataset.from_pandas(val),
        "test": Dataset.from_pandas(test),
    })

    # Map the labels to ids
    dataset = dataset.map(lambda e: {"labels": [ms_label2id[l] for l in e["labels"]]}, batched=True)

    # Drop __index_level_0__ column
    dataset = dataset.remove_columns(["__index_level_0__", "rid"])

    return dataset

dataset1 = create_hf_dataset(df1train, df1val, df1test)
dataset2 = create_hf_dataset(df2train, df2val, df2test)

# Save the dataset
dataset1.save_to_disk(os.path.join(paths.DATA_PATH_PREPROCESSED, "ms-diag/ms_diag_only_dm"))
dataset2.save_to_disk(os.path.join(paths.DATA_PATH_PREPROCESSED, "ms-diag/ms_diag_all_text"))

Map:   0%|          | 0/96 [00:00<?, ? examples/s]

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

Map:   0%|          | 0/47 [00:00<?, ? examples/s]

Map:   0%|          | 0/96 [00:00<?, ? examples/s]

Map:   0%|          | 0/11 [00:00<?, ? examples/s]

Map:   0%|          | 0/47 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/96 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/47 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/96 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/47 [00:00<?, ? examples/s]

# Summary

In [60]:
# Label distribution
print("Label distribution all:")
print(df2.labels.value_counts(), "\n\n")

print("Label distribution dm:")
print(df1.labels.value_counts(), "\n\n")

Label distribution all:
labels
relapsing_remitting_multiple_sclerosis      133
primary_progressive_multiple_sclerosis       13
secondary_progressive_multiple_sclerosis      8
Name: count, dtype: int64 


Label distribution dm:
labels
relapsing_remitting_multiple_sclerosis      133
primary_progressive_multiple_sclerosis       13
secondary_progressive_multiple_sclerosis      8
Name: count, dtype: int64 


