==============================================================
### Data Source:

- From Kaggle https://www.kaggle.com/datasets/chaitanyakck/medical-text

==============================================================
### Some research works published with data:
Publications
- Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches https://arxiv.org/abs/2211.16285
- Enhancing Multi-Class Disease Classification:
Neoplasms, Cardiovascular, Nervous System, and
Digestive Disorders Using Advanced LLMs https://arxiv.org/pdf/2411.12712
- From NLI to Classification: Entailment Learning for
Low-Resource Text Classification https://www.anlp.jp/proceedings/annual_meeting/2025/pdf_dir/P6-8.pdf

==============================================================

Owner:
- https://huggingface.co/datasets/TimSchopf/medical_abstracts
- https://github.com/sebischair/Medical-Abstracts-TC-Corpus/tree/main

==============================================================

### Data Preparation & Loading from Kaggle

In [None]:
import os
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')


import warnings
warnings.filterwarnings("ignore")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Create a directory for Kaggle API key
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Set correct permissions
!chmod 600 ~/.kaggle/kaggle.json

cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [None]:
from google.colab import files

files.upload()

# Download the dataset
!kaggle datasets download -d chaitanyakck/medical-text

# Unzip the dataset
!unzip medical-text.zip

Saving kaggle.json to kaggle.json
Dataset URL: https://www.kaggle.com/datasets/chaitanyakck/medical-text
License(s): CC0-1.0
Downloading medical-text.zip to /content
 43% 5.00M/11.6M [00:00<00:00, 52.2MB/s]
100% 11.6M/11.6M [00:00<00:00, 93.7MB/s]
Archive:  medical-text.zip
  inflating: test.dat                
  inflating: train.dat               


In [None]:
print(os.listdir())

['.config', 'medical-text.zip', 'drive', 'test.dat', 'kaggle.json', 'train.dat', 'sample_data']


In [None]:
# Check the first few lines of train.dat
with open("train.dat", "r", encoding="utf-8") as file:
    for i in range(10):  # Print first 10 lines
        print(file.readline().strip())

4	Catheterization laboratory events and hospital outcome with direct angioplasty for acute myocardial infarction To assess the safety of direct infarct angioplasty without antecedent thrombolytic therapy, catheterization laboratory and hospital events were assessed in consecutively treated patients with infarctions involving the left anterior descending (n = 100 patients), right (n = 100), and circumflex (n = 50) coronary arteries. The groups of patients were similar for age (left anterior descending coronary artery, 59 years; right coronary artery, 58 years; circumflex coronary artery, 62 years), patients with multivessel disease (left anterior descending coronary artery, 55%; right coronary artery, 55%; circumflex coronary artery, 64%), and patients with initial grade 0/1 antegrade flow (left anterior descending coronary artery, 79%; right coronary artery, 84%; circumflex coronary artery, 90%). Cardiogenic shock was present in eight patients with infarction of the left anterior desce

In [None]:
# Check the first few lines of train.dat
with open("test.dat", "r", encoding="utf-8") as file:
    for i in range(10):  # Print first 10 lines
        print(file.readline().strip())

Excision of limbal dermoids. We reviewed the clinical files of 10 patients who had undergone excision of unilateral epibulbar limbal dermoids. Preoperatively, all of the affected eyes had worse visual acuity (P less than .02) and more astigmatism (P less than .01) than the contralateral eyes. Postoperatively, every patient was cosmetically improved. Of the eight patients for whom both preoperative and postoperative visual acuity measurements had been obtained, in six it had changed minimally (less than or equal to 1 line), and in two it had improved (less than or equal to 2 lines). Surgical complications included persistent epithelial defects (40%) and peripheral corneal vascularization and opacity (70%). These complications do not outweigh the cosmetic and visual benefits of dermoid excision in selected patients.
Bell's palsy. A diagnosis of exclusion. In cases of acute unilateral facial weakness, a careful and systematic evaluation is necessary to identify the cause. Idiopathic facia

In [None]:
# Load train dataset
train_file = "train.dat"

# Read the file and split on tab character
data = []
with open(train_file, "r", encoding="utf-8") as file:
    for line in file:
        parts = line.strip().split("\t")  # Split on tab
        if len(parts) > 1:  # Ensure the split is successful
            label = int(parts[0])  # Disease category
            text = parts[1]  # Abstract
            data.append((label, text))

# Convert to DataFrame
df = pd.DataFrame(data, columns=["Category", "Medical_Abstract"])


# Mapping of category numbers to disease conditions
category_mapping = {
    1: "Neoplasms",
    2: "Digestive System Diseases",
    3: "Nervous System Diseases",
    4: "Cardiovascular Diseases",
    5: "General Pathological Conditions"
}

# Replace numbers with category names
df["Category_Name"] = df["Category"].map(category_mapping)

df.head(10)

Unnamed: 0,Category,Medical_Abstract,Category_Name
0,4,Catheterization laboratory events and hospital...,Cardiovascular Diseases
1,5,Renal abscess in children. Three cases of rena...,General Pathological Conditions
2,2,Hyperplastic polyps seen at sigmoidoscopy are ...,Digestive System Diseases
3,5,Subclavian artery to innominate vein fistula a...,General Pathological Conditions
4,4,Effect of local inhibition of gamma-aminobutyr...,Cardiovascular Diseases
5,1,Infection during chronic epidural catheterizat...,Neoplasms
6,5,Mediastinal tracheostomy using a pectoralis ma...,General Pathological Conditions
7,5,Tumefactive fibroinflammatory lesion of the ex...,General Pathological Conditions
8,3,Multiple representations contribute to body kn...,Nervous System Diseases
9,5,Increasing asthma prevalence in a rural New Ze...,General Pathological Conditions


In [None]:
# Check dataset size
print("Train Data Shape:", df.shape)

# Count the number of abstracts per category
print(df["Category_Name"].value_counts())

Train Data Shape: (14438, 3)
Category_Name
General Pathological Conditions    4805
Neoplasms                          3163
Cardiovascular Diseases            3051
Nervous System Diseases            1925
Digestive System Diseases          1494
Name: count, dtype: int64


In [None]:
dataset_path = "/content/drive/My Drive/medical_abstract/dataset"

save_path = os.path.join(dataset_path, "medical_abstract.csv")
df.to_csv(save_path, index=False)
print(f"Saved merged dataset to {save_path}")

Saved merged dataset to /content/drive/My Drive/medical_abstract/dataset/medical_abstract.csv


-  The data is ready for analysis. We saved the final output Google drive. Feel free to change the path/save to your local machine.