In [None]:
import pandas as pd
import re

In [None]:
import kagglehub
path = kagglehub.dataset_download("tboyle10/medicaltranscriptions")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'medicaltranscriptions' dataset.
Path to dataset files: /kaggle/input/medicaltranscriptions


In [None]:
dataset=pd.read_csv("/kaggle/input/medicaltranscriptions/mtsamples.csv")

In [None]:
dataset.tail(1)

Unnamed: 0.1,Unnamed: 0,description,medical_specialty,sample_name,transcription,keywords
4998,4998,"Acute allergic reaction, etiology uncertain, however, suspicious for Keflex.",Allergy / Immunology,Allergy Evaluation Consult,"HISTORY: , A 34-year-old male presents today self-referred at the recommendation of Emergency Room physicians and his nephrologist to pursue further allergy evaluation and treatment. Please refer to chart for history and physical, as well as the medical records regarding his allergic reaction treatment at ABC Medical Center for further details and studies. In summary, the patient had an acute event of perioral swelling, etiology uncertain, occurring on 05/03/2008 requiring transfer from ABC Medical Center to XYZ Medical Center due to a history of renal failure requiring dialysis and he was admitted and treated and felt that his allergy reaction was to Keflex, which was being used to treat a skin cellulitis dialysis shunt infection. In summary, the patient states he has some problems with tolerating grass allergies, environmental and inhalant allergies occasionally, but has never had anaphylactic or angioedema reactions. He currently is not taking any medication for allergies. He is taking atenolol for blood pressure control. No further problems have been noted upon his discharge and treatment, which included corticosteroid therapy and antihistamine therapy and monitoring.,PAST MEDICAL HISTORY:, History of urticaria, history of renal failure with hypertension possible source of renal failure, history of dialysis times 2 years and a history of hypertension.,PAST SURGICAL HISTORY:, PermCath insertion times 3 and peritoneal dialysis.,FAMILY HISTORY: , Strong for heart disease, carcinoma, and a history of food allergies, and there is also a history of hypertension.,CURRENT MEDICATIONS: , Atenolol, sodium bicarbonate, Lovaza, and Dialyvite.,ALLERGIES: , Heparin causing thrombocytopenia.,SOCIAL HISTORY: , Denies tobacco or alcohol use.,PHYSICAL EXAMINATION: ,VITAL SIGNS: Age 34, blood pressure 128/78, pulse 70, temperature is 97.8, weight is 207 pounds, and height is 5 feet 7 inches.,GENERAL: The patient is healthy appearing; alert and oriented to person, place and time; responds appropriately; in no acute distress.,HEAD: Normocephalic. No masses or lesions noted.,FACE: No facial tenderness or asymmetry noted.,EYES: Pupils are equal, round and reactive to light and accommodation bilaterally. Extraocular movements are intact bilaterally.,EARS: The tympanic membranes are intact bilaterally with a good light reflex. The external auditory canals are clear with no lesions or masses noted. Weber and Rinne tests are within normal limits.,NOSE: The nasal cavities are patent bilaterally. The nasal septum is midline. There are no nasal discharges. No masses or lesions noted.,THROAT: The oral mucosa appears healthy. Dental hygiene is maintained well. No oropharyngeal masses or lesions noted. No postnasal drip noted.,NECK: The neck is supple with no adenopathy or masses palpated. The trachea is midline. The thyroid gland is of normal size with no nodules.,NEUROLOGIC: Facial nerve is intact bilaterally. The remaining cranial nerves are intact without focal deficit.,LUNGS: Clear to auscultation bilaterally. No wheeze noted.,HEART: Regular rate and rhythm. No murmur noted.,IMPRESSION: ,1. Acute allergic reaction, etiology uncertain, however, suspicious for Keflex.,2. Renal failure requiring dialysis.,3. Hypertension.,RECOMMENDATIONS: ,RAST allergy testing for both food and environmental allergies was performed, and we will get the results back to the patient with further recommendations to follow. If there is any specific food or inhalant allergen that is found to be quite high on the sensitivity scale, we would probably recommend the patient to avoid the offending agent to hold off on any further reactions. At this point, I would recommend the patient stopping any further use of cephalosporin antibiotics, which may be the cause of his allergic reaction, and I would consider this an allergy. Being on atenolol, the patient has a more difficult time treating acute anaphylaxis, but I do think this is medically necessary at this time and hopefully we can find specific causes for his allergic reactions. An EpiPen was also prescribed in the event of acute angioedema or allergic reaction or sensation of impending allergic reaction and he is aware he needs to proceed directly to the emergency room for further evaluation and treatment recommendations after administration of an EpiPen.",


In [None]:
import re

def extract_uppercase_labels(text):
    if pd.isna(text) or not isinstance(text, str):
        return []
    # Reusing the regex from the basiccleaning function for uppercase labels
    pattern = r'(FINAL DIAGNOSES|CHIEF COMPLAINT|INDICATION|PROCEDURE|REASON FOR VISIT|HISTORY OF PRESENT ILLNESS|PAST MEDICAL HISTORY|SUBJECTIVE|DESCRIPTION|DIAGNOSIS|PREOPERATIVE DIAGNOSIS|2-D STUDY|CC)'
    labels = re.findall(pattern, text, re.IGNORECASE)
    return [label.upper() for label in labels]

# Apply the function to the 'transcription' column and collect all labels
all_labels = dataset['transcription'].apply(extract_uppercase_labels)

# Flatten the list of lists and get unique labels
unique_labels = sorted(list(set([label for sublist in all_labels for label in sublist])))

print("Unique uppercase labels found in 'transcription' column:")
for label in unique_labels:
    print(label)

Unique uppercase labels found in 'transcription' column:
2-D STUDY
CC
CHIEF COMPLAINT
DESCRIPTION
DIAGNOSIS
FINAL DIAGNOSES
HISTORY OF PRESENT ILLNESS
INDICATION
PAST MEDICAL HISTORY
PREOPERATIVE DIAGNOSIS
PROCEDURE
REASON FOR VISIT
SUBJECTIVE


In [None]:
if "Unnamed: 0" in dataset.columns:
    dataset.drop("Unnamed: 0", axis=1, inplace=True)

In [None]:
def basiccleaning(text):
    if pd.isna(text):
        return ""
    text = str(text)

    text = re.sub(r'\s+', ' ', text) #extra space removing
    text = re.sub(r',\s*,+', ', ', text) #comma removing
    text = re.sub(r'([,.;:-])\1+', r'\1', text) # repeated punctuation removing
    text = re.sub(r'(INDICATION|FINAL DIAGNOSES|CHIEF COMPLAINT|INDICATION|PROCEDURE|REASON FOR VISIT|HISTORY OF PRESENT ILLNESS|PAST MEDICAL HISTORY|SUBJECTIVE|DESCRIPTION|DIAGNOSIS|PREOPERATIVE DIAGNOSIS|2-D STUDY|CC)\s*:?,?',
                  r'\1: ', text, flags=re.I) # Reformat specific uppercase labels to ensure they are followed by a colon and a space, preserving their original case.

    text = text.strip()
    return text

In [None]:
dataset['description']=dataset['description'].apply(basiccleaning)
dataset['medical_specialty']=dataset['medical_specialty'].apply(basiccleaning)
dataset['sample_name']=dataset['sample_name'].apply(basiccleaning)
dataset['transcription']=dataset['transcription'].apply(basiccleaning)
dataset['keywords']=dataset['keywords'].apply(lambda x: basiccleaning(x) if isinstance(x, str) else "")

In [None]:
def merge_fields(row):
    parts = [
        f"Sample Name: {row['sample_name']}",
        f"Medical Specialty: {row['medical_specialty']}",
        f"Description: {row['description']}",
        f"Transcription: {row['transcription']}",
        f"Keywords: {row['keywords']}"
    ]
    return "\n".join(parts)

In [None]:
dataset["full_text"] = dataset.apply(merge_fields, axis=1)

In [None]:
print(dataset["full_text"])

0       Sample Name: Allergic Rhinitis\nMedical Specia...
1       Sample Name: Laparoscopic Gastric Bypass Consu...
2       Sample Name: Laparoscopic Gastric Bypass Consu...
3       Sample Name: 2-D Echocardiogram - 1\nMedical S...
4       Sample Name: 2-D Echocardiogram - 2\nMedical S...
                              ...                        
4994    Sample Name: Chronic Sinusitis\nMedical Specia...
4995    Sample Name: Kawasaki Disease - Discharge Summ...
4996    Sample Name: Followup on Asthma\nMedical Speci...
4997    Sample Name: Asthma in a 5-year-old\nMedical S...
4998    Sample Name: Allergy Evaluation Consult\nMedic...
Name: full_text, Length: 4999, dtype: object


In [None]:
import re
import pandas as pd

def deep_clean(text):
    text = str(text)

    # fix weird CSV commas
    text = re.sub(r',\s*,+', ', ', text)
    text = re.sub(r':\s*,', ': ', text)

    # fix \n
    text = text.replace('\\n', '\n')

    # fix spacing
    text = re.sub(r'\s+', ' ', text)



    return text.strip()


In [None]:
dataset['full_text']=dataset['full_text'].apply(deep_clean)

In [None]:
print(dataset['full_text'])

0       Sample Name: Allergic Rhinitis Medical Special...
1       Sample Name: Laparoscopic Gastric Bypass Consu...
2       Sample Name: Laparoscopic Gastric Bypass Consu...
3       Sample Name: 2-D Echocardiogram - 1 Medical Sp...
4       Sample Name: 2-D Echocardiogram - 2 Medical Sp...
                              ...                        
4994    Sample Name: Chronic Sinusitis Medical Special...
4995    Sample Name: Kawasaki Disease - Discharge Summ...
4996    Sample Name: Followup on Asthma Medical Specia...
4997    Sample Name: Asthma in a 5-year-old Medical Sp...
4998    Sample Name: Allergy Evaluation Consult Medica...
Name: full_text, Length: 4999, dtype: object


In [None]:
import re

def clean_minimal(text):
    if not isinstance(text, str):
        return ""

    # 1. Convert escaped \n into real new lines
    text = text.replace("\\n", "\n")

    # 2. Remove repeated commas (CSV noise)
    text = re.sub(r",\s*,+", ", ", text)

    # 3. Fix CSV artifacts like ",MEDICATIONS:"
    text = re.sub(r",\s*([A-Z][A-Za-z ]+:)", r" \1", text)

    # 4. Remove cases like ": ," → ": "
    text = re.sub(r":\s*,", ": ", text)

    # 5. Normalize large whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text


In [None]:
dataset["full_text"] = dataset["full_text"].apply(clean_minimal)

In [None]:
print(dataset['full_text'])

0       Sample Name: Allergic Rhinitis Medical Special...
1       Sample Name: Laparoscopic Gastric Bypass Consu...
2       Sample Name: Laparoscopic Gastric Bypass Consu...
3       Sample Name: 2-D Echocardiogram - 1 Medical Sp...
4       Sample Name: 2-D Echocardiogram - 2 Medical Sp...
                              ...                        
4994    Sample Name: Chronic Sinusitis Medical Special...
4995    Sample Name: Kawasaki Disease - Discharge Summ...
4996    Sample Name: Followup on Asthma Medical Specia...
4997    Sample Name: Asthma in a 5-year-old Medical Sp...
4998    Sample Name: Allergy Evaluation Consult Medica...
Name: full_text, Length: 4999, dtype: object


In [None]:
!pip install langchain-text-splitters


Collecting langchain-text-splitters
  Downloading langchain_text_splitters-1.0.0-py3-none-any.whl.metadata (2.6 kB)
Downloading langchain_text_splitters-1.0.0-py3-none-any.whl (33 kB)
Installing collected packages: langchain-text-splitters
Successfully installed langchain-text-splitters-1.0.0


Chunks and MetaDATA

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 80,
    separators = ["\n\n", "\n", ". ", "; ", ": ", " "]
)


In [None]:
chunks = []
metadata_list = []

for idx, row in dataset.iterrows():
    text = row["full_text"]
    if not isinstance(text, str) or len(text.strip()) == 0:
        continue

    doc_chunks = text_splitter.split_text(text)

    for i, chunk in enumerate(doc_chunks):
        chunks.append(chunk)
        metadata_list.append({
            "doc_id": idx,
            "chunk_id": i,
            "sample_name": row.get("sample_name", ""),
            "medical_specialty": row.get("medical_specialty", ""),
            "description": row.get("description", ""),
            "keywords": row.get("keywords", ""),
            "source": "medical_transcriptions"
        })


In [None]:
processed_chunks = pd.DataFrame({
    "chunk_text": chunks,
    "metadata": metadata_list
})

processed_chunks.to_json(
    "processed_medical_chunks.json",
    orient="records",
    indent=2
)

print("Saved processed_medical_chunks.json with", len(processed_chunks), "chunks.")


Saved processed_medical_chunks.json with 43420 chunks.


In [None]:
import json
with open("processed_medical_chunks.json", "r") as f:
    chunks = json.load(f)

print(chunks[0])
print("Total chunks:", len(chunks))


{'chunk_text': 'Sample Name: Allergic Rhinitis Medical Specialty: Allergy / Immunology Description: A 23-year-old white female presents with complaint of allergies. Transcription: SUBJECTIVE: This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also', 'metadata': {'doc_id': 0, 'chunk_id': 0, 'sample_name': 'Allergic Rhinitis', 'medical_specialty': 'Allergy / Immunology', 'description': 'A 23-year-old white female presents with complaint of allergies.', 'keywords': 'allergy / immunology, allergic rhinitis, allergies, asthma, nasal sprays, rhinitis, nasal, erythematous, allegra, sprays, allergic,', 'source': 'medical_transcriptions'}}
Total chunks: 43420


Text Embeddings

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [c['chunk_text'] for c in chunks]
embs = model.encode(texts, show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1357 [00:00<?, ?it/s]

In [None]:
pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.7 kB)
Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.0


In [None]:
import faiss
import numpy as np

dimension = embs.shape[1]
index = faiss.IndexFlatL2(dimension)

index.add(np.array(embs))
faiss.write_index(index, "medical_faiss.index")
