# Analysis of Melda Dataset (US, Pathological Swallowing)
TLDR: 484 swallows of patients with dysphagia

Pros:
- swallows extracted
- patient information

- full swallows + split swallows mp4 files
- X-ray videos of pathological swallows from Dr. Melda Kunduc
- folder structure:
    - patient id mp4 folders
        - 484 identifiers
        - variants:
            - 1 full mp4 video
            - 1 full mp4 video + swallow split videos + excel explaining split +  sc file + crf
              - .sc files: swallow cut tmp
    - abnormals folder
        - 10 mp4/mpg files each having abnormal description in filename (long videos)
    - DICOM folder
        - 7 folders with id in name
                -  each has MKV files
    - pt1_anon - pt8_anon folders
        - video of split swallows
        - each has mp4 files, excel files with split info, some also mask files
    - De-identified_demographic_Sheet.xlsx
        - some information about diseases idk
    - folder_list.xlsx
        - list of patient id folders
    - from_melda.xlsx
        - maping of patient id to video ids and their metadata
        - columns: Patient ID	Video Name	Visit	Height	Width	if CRF	if Extracted
    - from_melda_events.xslx
        - mapping of video ids to video snippet ids + length info
        - columns: Patient ID	Video Name	Event Name	length	if annotated	swallow type	quality
    - from_melda_events_with_rating.xlsx
        - better than above
        - mapping of video ods to snippet ids + length + swallow type
        - columns: Patient ID	Video Name	Event Name	length	if annotated	swallow type	quality	comments
    - spreadsheet.ipynb

- Google Docs: Swallowing Dataset Annotation
    -  Patho VSSS Info


Q:
- patient info?
- Deidentified demo sheet: Is each row one patient (480 rows)? Is row id related to patient id/folder list id?
- whats pt1_anon - pt8_anon folders? Mapping in google doc?
- What are crf videos next to normal ones?
- sollen alle videos noch extracted und verwendet werden?
- 22x age missing, but birth year is there?
- Exlucde data where age or gender missing?
- Usage of dysphagia patients only or also including other pathologies?
- Why are in the execls more swallow splits recorded than in the video folders? Because these other swallows are not frontal videos?
- Why are there (4 times) negative frame counts?

In [None]:
import pandas as pd
import os
import cv2
import matplotlib.pyplot as plt
from tqdm import tqdm

In [None]:
dataset_path = r"\\fauad.fau.de\shares\ANKI\Projects\Swallowing\Data\from_Melda"

In [None]:

def analyze_patient_data(base_directory):
    data = []

    folders = os.listdir(base_directory)
    # only check folders in base directory, also only with numbers
    folders = ([f for f in folders if f.isnumeric()])
    # Iterate over each patient folder
    for patient_id in tqdm(folders):
        patient_folder = os.path.join(base_directory, patient_id)
        if not os.path.isdir(patient_folder):
            continue  # Skip if not a folder

        # Initialize patient data
        patient_data = {
            "PatientID": patient_id,
            "VisitCount": 0,
            "ExcelRowCounts": [],  # List of row counts per video
            "FullVideosTotalFrames": [],  # Frame counts for full videos
            "FullVideosNames": [],  # Names of full videos
            "SwallowFrames": [],  # Frame counts for swallow videos
            "NegativeFramesVideos": "",  # Video IDs with negative frames
        }

        # Iterate over each file in the patient folder
        for file in os.listdir(patient_folder):
            file_path = os.path.join(patient_folder, file)

            # Analyze full videos
            if file.endswith(".mp4"):
                video = cv2.VideoCapture(file_path)
                frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
                video.release()

                if "_fixed_" in file:  # Swallow videos
                    patient_data["SwallowFrames"].append(frame_count)
                elif "_fixed" in file or "_fixed_CRF8":  # Full visit videos
                    patient_data["VisitCount"] += 1
                    patient_data["FullVideosTotalFrames"].append(frame_count)
                    patient_data["FullVideosNames"].append(file)

            # Analyze Excel/CSV files
            if (file.endswith(".xlsx") or file.endswith(".csv")) and not "$" in file: # filter out "hidden" ~$ tmp files
                try:
                    # Read file
                    if file.endswith(".xlsx"):
                        visit_data = pd.read_excel(file_path, engine='openpyxl')
                    else:
                        visit_data = pd.read_csv(file_path)

                    # Count rows and append to ExcelRowCounts
                    row_count = len(visit_data)
                    patient_data["ExcelRowCounts"].append(row_count)

                    # Check for negative frames
                    if "From frame" in visit_data.columns and "To frame" in visit_data.columns:
                        frame_durations = visit_data["To frame"] - visit_data["From frame"]
                        negative_frames_videos = visit_data.loc[frame_durations < 0]
                        if not negative_frames_videos.empty:
                            if patient_data["NegativeFramesVideos"]:
                                patient_data["NegativeFramesVideos"] += f", {file}"
                            else:
                                patient_data["NegativeFramesVideos"] = file
                except Exception as e:
                    print(f"Error reading file {file_path}: {e}")
                    continue  # Skip problematic files

        # Add to the overall data
        data.append(patient_data)

    # Create DataFrame from the collected data
    df = pd.DataFrame(data)

    return df

df_folders = analyze_patient_data(dataset_path)



In [None]:
df_folders

In [None]:
'''
def analyze_patient_data(base_directory):
    data = []
    folders = os.listdir(base_directory)
    # only check folders in base directory, also only with numbers
    folders = ([f for f in folders if f.isnumeric()])
    # Iterate over each patient folder
    for patient_id in tqdm(folders):
        #print(patient_id)
        patient_folder = os.path.join(base_directory, patient_id)
        if not os.path.isdir(patient_folder):
            continue  # Skip if not a folder

        # Initialize patient data
        patient_data = {
            "PatientID": patient_id,
            "VisitCount": 0,
            "IsProcessed": False,
            "ProcessedSwallows": 0,
            "FramesPerSwallow": [],
            "AverageFramesPerSwallow": None,
            "Paradigms": [],
            "IsCRF8": False,
        }

        # Iterate over each file in the patient folder
        for file in os.listdir(patient_folder):
            if file.endswith(".mp4"):
                # Count visits based on video filenames
                if "_Visit" in file:
                    patient_data["VisitCount"] += 1
                    if "CRF8" in file:
                        patient_data["IsCRF8"] = True

            if (file.endswith(".xlsx") or file.endswith(".csv")) and not "$" in file: # filter out "hidden" ~$ tmp files
                # Identify processed data
                if "_Visit" in file:
                    patient_data["IsProcessed"] = True

                    # Read the file (CSV or Excel) to extract swallow information
                    file_path = os.path.join(patient_folder, file)
                    if file.endswith(".xlsx"):
                        visit_data = pd.read_excel(file_path,  engine='openpyxl')
                    else:
                        visit_data = pd.read_csv(file_path)

                    # Extract number of swallows and paradigms
                    patient_data["ProcessedSwallows"] += len(visit_data)
                    if "From frame" in visit_data and "To frame" in visit_data:
                        frames = visit_data["To frame"] - visit_data["From frame"]
                        # check for negative values
                        if (frames < 0).sum() > 0:
                            pass #print(f"Negative frame values in {file_path} of patient {patient_id}")
                        patient_data["FramesPerSwallow"].extend(frames)
                        patient_data["AverageFramesPerSwallow"] = frames.mean()
                    if "Paradigm" in visit_data:
                        patient_data["Paradigms"].extend(visit_data["Paradigm"].unique())

        # Add to the overall data
        data.append(patient_data)
        #break
    # Create DataFrame from the collected data
    df = pd.DataFrame(data)

    # Clean up Paradigms column to be unique lists
    df["Paradigms"] = df["Paradigms"].apply(lambda x: list(set(x)) if isinstance(x, list) else [])

    return df

# Directory containing the patient data
df_folders_excel = analyze_patient_data(dataset_path)
'''

In [None]:
df_folders

In [None]:
# create column num of swallow frames
df_folders["NumSwallowFrames"] = df_folders["SwallowFrames"].apply(lambda x: len(x))
# create column num of total excel swallows (sum of all swallows in all excel files)
df_folders["NumTotalExcelSwallows"] = df_folders["ExcelRowCounts"].apply(lambda x: sum(x))
df_folders

How much patient data already processed?

In [None]:
df_folders["IsProcessed"] = df_folders["SwallowFrames"].apply(lambda x: len(x) > 0)
df_folders_processed = df_folders[df_folders["IsProcessed"]]
df_folders["IsProcessed"].value_counts()

Videos per patient distribution

In [None]:
df_folders["VisitCount"].value_counts()


In [None]:
# Plot the distribution of visits per patient
plt.figure(figsize=(10, 6))
plt.hist(df_folders["VisitCount"], bins=range(1, df_folders["VisitCount"].max() + 2), edgecolor='black')
# add x amd y labels
plt.xlabel("Number of Visits")
plt.ylabel("Number of Patients")
plt.title("Distribution of Visits per Patient")

for the processed data: ...

how many swallows per patient?

In [None]:
# Plot the distribution of processed swallows per patient
plt.figure(figsize=(10, 6))
plt.hist(df_folders_processed["NumSwallowFrames"], bins=range(1, df_folders_processed["NumSwallowFrames"].max() + 2), edgecolor='black')
# add x amd y labels
plt.xlabel("Number of Swallows")
plt.ylabel("Number of Patients")
plt.title("Distribution of Swallows per Patient")

how many frames per swallow?

In [None]:
df_folders_processed

In [None]:
# need to explode frames list col
df_frames = df_folders_processed.explode("SwallowFrames")
df_frames["SwallowFrames"] = df_frames["SwallowFrames"].astype(int)
df_frames["SwallowFrames"].describe()

In [None]:
# filter out values above 150
df_frames = df_frames[df_frames["SwallowFrames"] < 150]

In [None]:
# plot distribution of frames per swallow
plt.figure(figsize=(10, 6))
plt.hist(df_frames["SwallowFrames"], bins=range(1, df_frames["SwallowFrames"].max() + 2), edgecolor='black')
plt.xlabel("Number of Frames")
plt.ylabel("Number of Swallows")
plt.title("Distribution of Frames per Swallow")

 noch sekunden pro video + fps berechnen

numswallowframes != numtotalexcelswallows

In [None]:
# check where numswallowframes != numtotalexcelswallows
df_folders[df_folders["NumSwallowFrames"] != df_folders["NumTotalExcelSwallows"]]

### Analyze demographics

In [None]:
# Load data
df_patient_info = pd.read_csv('Patho_VFSS_Info_1205.csv')
df_patient_info.tail()

In [None]:
# column explanations
df_patient_info[df_patient_info["UID"].isna()]
# drop explanation rows
df_patient_info = df_patient_info.dropna(subset=["UID"])

Check NaN

In [None]:
# which columns have nan values?
df_patient_info.isna().sum()

In [None]:
df_patient_info.head()

preprocess gender

In [None]:
df_patient_info["Gender"] = df_patient_info["Gender"].replace({"0": "M", "1": "F"})

preprocess age

In [None]:
# process columns with multiple values per cell
# Step 1: Convert the Age column to lists
def process_age(value):
    if pd.isna(value):  # Check for NaN
        return None  # Return an empty list for NaN
    elif isinstance(value, str):
        # Split by any combination of commas and spaces, then convert to integers
        return [int(x.strip()) for x in value.replace(',', ', ').split(', ') if x.strip().isdigit()]
    else:
        return [int(value)]

df_patient_info['Age'] = df_patient_info['Age'].apply(process_age)

# Step 2: Expand rows so each age value gets its own row
df_patient_age = df_patient_info.explode('Age')
# drop nan values
df_patient_age = df_patient_age.dropna(subset=["Age"])

# Step 3: Create an ordering index
df_patient_age['Index'] = df_patient_age.groupby('UID').cumcount() + 1

# Step 4: Convert the Age column to integers
#df_patient_age = df_patient_age[['UID','Age','Year of Birth']].copy()
# convert col to int
df_patient_age['Age'] = df_patient_age['Age'].astype(int)
df_patient_age

### patient info

In [None]:
# how many patients?
len(df_patient_info["UID"].unique())

age

In [None]:
# plot distribution of age
print("Age Stats:")
print(df_patient_age["Age"].describe())
# na values are age = 0
print(f"\nPatients with missing age: {df_patient_info["Age"].isna().sum()}")

In [None]:
# Plot the age distribution

# Determine the bin range
min_age = df_patient_age["Age"].min()
max_age = df_patient_age["Age"].max()
bins = range(min_age, max_age + 2, 5)  # Bin width of 2 years

# Plot histogram for all data combined (no gender differentiation)
plt.figure(figsize=(10, 6))
plt.hist(
    df_patient_age["Age"],
    bins=bins,
    density=True,
    edgecolor='black'
)
plt.ylim(0, 0.01)  # Adjust based on expected density range
plt.yticks([0, 0.01, 0.02, 0.03, 0.04, 0.05])  # Standardized ticks

# Add titles and labels
plt.title("Normalized Age Distribution")
plt.xlabel("Age")
plt.ylabel("Density")
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Show the plot
plt.show()

gender

In [None]:
# show gender stats (0 = Male, 1 = Female)
print(f"Missing Gender: {df_patient_info["Gender"].isna().sum()}")
print(df_patient_info["Gender"].value_counts())
print(round(df_patient_info["Gender"].value_counts() / len(df_patient_info["Gender"].dropna()), 2))

age distribution per gender


In [None]:
# Adjust bins to be consistent for both genders
plt.figure(figsize=(10, 6))

# Determine a consistent bin range for both genders
min_age = df_patient_age["Age"].min()
max_age = df_patient_age["Age"].max()
bins = range(min_age, max_age + 2, 5)  # Bin width of 2 years

for gender in df_patient_age["Gender"].unique():
    subset = df_patient_age[df_patient_age["Gender"] == gender]
    plt.hist(subset["Age"], bins=bins, alpha=0.6, density=True, label=f"{gender}", edgecolor='black')

plt.ylim(0, 0.01)  # Adjust based on expected density range
plt.yticks([0, 0.01, 0.02, 0.03, 0.04, 0.05])  # Standardized ticks

plt.title("Normalized Age Distribution by Gender with Consistent Bins")
plt.xlabel("Age")
plt.ylabel("Density")
plt.legend(title="Gender")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

In [None]:
df_patient_info

 acquistion dates by adding age(s) to birth year

In [None]:
# add to the df_patient_age column acquisition date
df_patient_age["Acquisition Date"] = (df_patient_age["Year of Birth"] + df_patient_age["Age"]).astype(int)
df_patient_age

In [None]:
# plot distribution of acquisition dates per year
df_patient_age['Acquisition Date'].value_counts().sort_index().plot(kind='bar')

In [None]:
df_patient_info

types of medical diagnosis

In [None]:
# Convert the string column into lists
def process_conditions(value):
    if pd.isna(value):  # Handle NaN values
        return []
    # Split on commas and strip extra whitespace, remove empty strings
    return [item.strip() for item in value.split(',') if item.strip()]

df_patient_info["Associated Medical diagnosis"] = df_patient_info["Associated Medical diagnosis"].apply(process_conditions)
#df_patient_info

count of conditions

In [None]:
df_medical_conditions = df_patient_info.explode("Associated Medical diagnosis")
df_medical_conditions["Associated Medical diagnosis"].value_counts()

In [None]:
# how many unique medical diagnosis?
print(f"Unique Medical Diagnosis: {df_medical_conditions['Associated Medical diagnosis'].explode().nunique()}")

# how many diagnosis conditions contain the word "swallowing" or "dysphagia"?
print(f"Unique Swallowing Conditions: {df_medical_conditions['Associated Medical diagnosis'].drop_duplicates().str.contains('swallowing|dysphagia', case=False, na=False).sum()}")

# how many patients have diagnosis that contains words "swallowing" or "dysphagia"?
print(f"Patients with swallowing diagnosis: {df_medical_conditions["Associated Medical diagnosis"].str.contains('swallowing|dysphagia', case=False, na=False).sum()} (out of {len(df_patient_info)})")


how many conditions per patient?

In [None]:
# how many conditions per patient?
df_medical_conditions["UID"].value_counts().value_counts()

In [None]:
df_medical_conditions