

#INSTRUCTIONS:
Store training_20180910 in your google drive for this notebook to work.

If you want to just see the results you can just look at results.json pulled from the github repo and go through those

In [33]:
!wget https://raw.githubusercontent.com/jonahkaye/medical_clustering/main/results.json -O results.json

--2023-10-05 19:41:37--  https://raw.githubusercontent.com/jonahkaye/medical_clustering/main/results.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 82310 (80K) [text/plain]
Saving to: ‘results2.json’


2023-10-05 19:41:37 (10.9 MB/s) - ‘results2.json’ saved [82310/82310]



In [1]:
#REQUIREMENTS
!pip install transformers
!pip install torch

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m35.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m76.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m72.2 MB/s[0m eta [36m0:00:00[0m
Colle

In [18]:
# DATA PROCESSING
import re
import pandas as pd
import os

def is_valid_sentence(sentence):
    """Helper function to check if a sentence is valid (not too short and not just punctuation)."""
    return len(sentence) > 1 and not all(char in '.#,' for char in sentence)

def extract_sections(file_path):
    # List of section titles we care about
    section_titles = [
        "Allergies:",
        "Chief Complaint:",
        "Major Surgical or Invasive Procedure:",
        "History of Present Illness:",
        "Review of Systems:",
        "Past Medical History:",
        "Social History:",
        "Family History:",
        "Physical Exam:",
        "Pertinent Results:",
        "Brief Hospital Course:",
        "Medications on Admission:",
        "Discharge Medications:",
        "Discharge Disposition:",
        "Discharge Diagnosis:",
        "Discharge Condition:"
    ]

    # Read the file
    with open(file_path, 'r') as f:
        lines = f.readlines()

    # Initialize a dictionary to store section texts
    section_texts = {title: "" for title in section_titles}

    current_section = None
    for line in lines:
        line_stripped = line.strip()
        # Check if the line matches any section title
        if line_stripped in section_titles:
            current_section = line_stripped
            continue  # skip the title line itself
        # If inside a section, append the line to the section's text
        elif current_section:
            section_texts[current_section] += line

    # Removing excess newlines for each section
    for key, val in section_texts.items():
        section_texts[key] = val.strip()

    return section_texts

def create_dataframe_from_sections(file_path):
    # Extract sections from the file
    sections = extract_sections(file_path)

    # Keys we are interested in for regular sentence splits
    target_keys = [
        "Major Surgical or Invasive Procedure:",
        "History of Present Illness:",
        "Past Medical History:",
        "Brief Hospital Course:"
    ]

    # Keys we are interested in for comma and newline splits
    special_keys = [
        "Chief Complaint:",
        "Discharge Diagnosis:"
    ]

    # Data to populate the dataframe
    data = {
        'sentence': [],
        'section': [],
        'file': [],
        'type': []
    }

    # Helper function to split text into sentences
    def split_into_sentences(text):
        # Split on periods, question marks
        return re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s*', text)

    # For regular keys
    for key in target_keys:
        sentences = split_into_sentences(sections[key])
        for sentence in sentences:
            cleaned_sentence = sentence.replace("\n", " ").strip()
            if cleaned_sentence and is_valid_sentence(cleaned_sentence):  # Modified condition
                data['sentence'].append(cleaned_sentence)
                data['section'].append(key)
                data['file'].append(file_path)
                data['type'].append("underlying factor")

    # For special keys, split by comma and newline
    for key in special_keys:
        sentences = re.split(r'[\n,]', sections[key])
        for sentence in sentences:
            cleaned_sentence = sentence.replace("\n", " ").strip()
            if cleaned_sentence and is_valid_sentence(cleaned_sentence):  # Modified condition
                data['sentence'].append(cleaned_sentence)
                data['section'].append(key)
                data['file'].append(file_path)
                data['type'].append("condition")

    # Convert data into dataframe
    df = pd.DataFrame(data)
    return df

def process_all_files_in_directory(directory_path):
    # List all files in the given directory
    all_files = os.listdir(directory_path)

    # Filter out files that aren't text files
    text_files = [file for file in all_files if file.endswith('.txt')]

    # Initialize an empty dataframe to store data from all files
    all_data_df = pd.DataFrame(columns=['sentence', 'section', 'file', 'type'])

    # Iterate through each text file and process it
    for text_file in text_files:
        file_path = os.path.join(directory_path, text_file)
        df = create_dataframe_from_sections(file_path)
        all_data_df = pd.concat([all_data_df, df], ignore_index=True)

    return all_data_df

In [11]:
# EMBEDDINGS
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
from transformers import AutoModel, AutoTokenizer
import logging
import json
import torch


logging.getLogger('transformers.modeling_utils').setLevel(logging.ERROR)


class Embedder:
    def __init__(self):
        model_name = "emilyalsentzer/Bio_ClinicalBERT"
        self.model = AutoModel.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        # Check if GPU is available and move the model to GPU
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def create_embedding_bert(self, text):
        inputs = self.tokenizer(text, return_tensors="pt")

        # Move inputs to the same device as the model
        inputs = {name: tensor.to(self.device) for name, tensor in inputs.items()}

        with torch.no_grad():  # Disable gradient calculations during inference
            outputs = self.model(**inputs)
            embedding = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
            embedding = embedding.flatten()
        return embedding

    def embed_texts_in_df(self, df, text_column):
        embeddings = []
        count = 0
        for text in df[text_column]:
            print(f"embedded text {count}")
            embedding = self.create_embedding_bert(text)
            embeddings.append(embedding)
            count = count + 1

        df['embedding'] = [json.dumps(embed.tolist()) for embed in embeddings]
        return df

    def get_k_most_similar_factors_for_condition_vector(self, embedded_df, input_vector, file_paths, k=5):
        # Filter for underlying factors from the specified files
        factors = embedded_df[(embedded_df['type'] == "underlying factor") & (embedded_df['file'].isin(file_paths))]

        results = {}  # To store the results
        similarities = []

        # Reshape the input vector for compatibility with cosine_similarity
        input_vector_reshaped = input_vector.reshape(1, -1)

        for _, factor_row in factors.iterrows():
            factor_text = factor_row['sentence']
            factor_embedding = np.array(factor_row['embedding']).reshape(1, -1)

            similarity = cosine_similarity(input_vector_reshaped, factor_embedding)
            similarities.append((factor_text, similarity))

        # Sort by similarity and select top k
        similarities.sort(key=lambda x: x[1], reverse=True)
        top_k_factors = similarities[:k]

        # Return the results (could be associated with a condition text or an ID for reference if needed)
        return top_k_factors



In [24]:
# CLUSTERING
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from scipy.spatial.distance import euclidean
import matplotlib.pyplot as plt
import warnings

# Suppress FutureWarnings from sklearn
warnings.simplefilter(action='ignore', category=FutureWarning)

def calculate_angle(a, b, c):
	ba = a - b  # vector from b to a
	bc = c - b  # vector from b to c
	cosine_angle = np.dot(ba, bc) / (np.linalg.norm(ba) * np.linalg.norm(bc))
	angle = np.arccos(cosine_angle)
	return np.degrees(angle)

def elbow(embedding_df, random_seed=42, plot_results=False):
	matrix = np.vstack(embedding_df.embedding.values)
	distortions = []
	max_clusters = min(20, len(embedding_df) - 1)
	K_range = range(1, max_clusters + 1)
	for k in K_range:
		kmeanModel = KMeans(n_clusters=k, random_state=random_seed, n_init=10)  # Set seed and multiple initializations
		kmeanModel.fit(matrix)
		distortions.append(kmeanModel.inertia_)

	# Calculating the angles
	angles = []
	for i in range(1, len(distortions)-1):
		a = np.array([K_range[i-1], distortions[i-1]])
		b = np.array([K_range[i], distortions[i]])
		c = np.array([K_range[i+1], distortions[i+1]])
		angles.append(calculate_angle(a, b, c))

	# Finding the elbow
	optimal_k = angles.index(max(angles)) + 2  # +2 because index 0 corresponds to k=2

	# Plotting the elbow graph
	if plot_results:
		plt.figure(figsize=(10,5))
		plt.plot(K_range, distortions, 'bx-')
		plt.xlabel('Number of clusters')
		plt.ylabel('Distortion')
		plt.title('The Elbow Method showing the optimal number of clusters')
		plt.axvline(x=optimal_k, color='r', linestyle='--')
		plt.show()

	return optimal_k

def cluster(embedding_df, k, condition_embedding, plot_results=False):

	embedding_df = embedding_df.copy()
	# Stack them into a matrix for clustering
	matrix = np.vstack(embedding_df.embedding.values)

	# Cluster using K-means
	n_clusters = k
	kmeans = KMeans(n_clusters=n_clusters, init="k-means++", random_state=42)
	kmeans.fit(matrix)
	embedding_df["Cluster"] = kmeans.labels_
	perplexity_value = min(15, len(embedding_df) - 1)  # -1 to ensure it's less than n_samples

	 # Calculate cluster centroids
	centroids = np.array([matrix[embedding_df["Cluster"] == i].mean(axis=0) for i in range(n_clusters)])

	# Find the closest cluster to the condition
	distances_to_condition = np.linalg.norm(centroids - condition_embedding, axis=1)
	closest_cluster = np.argmin(distances_to_condition)

	# Visualize using t-SNE
	if plot_results:
		tsne = TSNE(
			n_components=2, perplexity=perplexity_value, random_state=42, init="random", learning_rate=200
		)
		vis_dims2 = tsne.fit_transform(matrix)

		x = [x for x, y in vis_dims2]
		y = [y for x, y in vis_dims2]

		for category, color in enumerate(["purple", "green", "red", "blue", "yellow"]):
			xs = np.array(x)[embedding_df.Cluster == category]
			ys = np.array(y)[embedding_df.Cluster == category]
			plt.scatter(xs, ys, color=color, alpha=0.3)

			avg_x = xs.mean()
			avg_y = ys.mean()

			plt.scatter(avg_x, avg_y, marker="x", color=color, s=100)
		plt.title("Clusters identified visualized in language 2d using t-SNE")
		plt.show()

	print(f"The cluster closest to the condition is cluster {closest_cluster + 1}.")
  	# Extract embeddings and sentences for the closest cluster
	cluster_embeddings = matrix[embedding_df["Cluster"] == closest_cluster]
	cluster_sentences = embedding_df[embedding_df["Cluster"] == closest_cluster]["sentence"]

	# Calculate distances of each sentence in the cluster to the centroid
	centroid = centroids[closest_cluster]
	distances_to_centroid = np.linalg.norm(cluster_embeddings - centroid, axis=1)

	# Get indices of the 10 most similar sentences
	most_similar_idx = np.argsort(distances_to_centroid)[:10]

	# Extract and print the most similar sentences
	most_similar_sentences = cluster_sentences.iloc[most_similar_idx]
	for sentence in most_similar_sentences:
		print(f"- {sentence}")

	return most_similar_sentences

In [5]:
# IDENTIFYING CONDITIONS
from google.colab import drive
drive.mount('/content/drive')

def load_embedded_df(csv_path):
    try:
        df = pd.read_csv(csv_path)
        df['embedding'] = df['embedding'].apply(lambda x: np.array(json.loads(x)))
        return df
    except FileNotFoundError:
        return pd.DataFrame()

def normalize_condition(sentence):
    if not isinstance(sentence, str):
        print(f"Warning: Non-string input to normalize_condition: {sentence}")
        sentence = str(sentence)  # Convert non-string inputs to string
    sentence = sentence.lower()  # Convert to lowercase
    sentence = re.sub(r'[^a-z\s]', '', sentence)  # Remove non-alphabetic characters
    sentence = re.sub(r'\s+', ' ', sentence).strip()  # Remove extra spaces
    return sentence

def find_overlapping_conditions_with_files(df):
    condition_df = df[df['type'] == 'condition']

    seen_sentences = {}
    overlapping_conditions = {}

    for index, row in condition_df.iterrows():
        original_sentence = row['sentence']
        sentence = normalize_condition(original_sentence)  # Normalize the sentence

        # Ignore conditions with "primary" or "secondary"
        if "primary" in sentence or "secondary" in sentence:
            continue

        file_path = row['file']

        if sentence in seen_sentences:
            if file_path not in seen_sentences[sentence]:
                seen_sentences[sentence].append(file_path)
            overlapping_conditions[sentence] = seen_sentences[sentence]
        else:
            seen_sentences[sentence] = [file_path]

    return overlapping_conditions


def extract_factors_for_condition(embedded_df, input_vector, file_paths):
	"""Extract embeddings for underlying factors associated with a condition across specific files."""
	# Filter for underlying factors from the specified files
	factors = embedded_df[(embedded_df['type'] == "underlying factor") & (embedded_df['file'].isin(file_paths))]
	return factors


Mounted at /content/drive


In [26]:
# MAIN DATA ANALYSIS
embedder_instance = Embedder()
embedded_df = load_embedded_df("/content/drive/MyDrive/embedded_sentences.csv")
if embedded_df.empty:
    file_path = "/content/drive/MyDrive/training_20180910"
    df = process_all_files_in_directory(file_path)
    embedded_df = embedder_instance.embed_texts_in_df(df, 'sentence')
    embedded_df.to_csv("/content/drive/MyDrive/embedded_sentences.csv", index=False)

overlapping = find_overlapping_conditions_with_files(embedded_df)
conditions_to_factors = {}
for condition, associated_files in overlapping.items():
    if len(associated_files) > 2:  # Checking for more than 2 associated files
        print(f"\nCondition: {condition}")
        print("Associated files:", ", ".join(associated_files))
        factors_df = extract_factors_for_condition(embedded_df, condition, associated_files)
        k = elbow(factors_df)
        print(f"Optimal k: {k}")

        condition_embedding = embedder_instance.create_embedding_bert(condition)
        underlying_factors = cluster(factors_df, k, condition_embedding, plot_results=False) # change to true to see cluster
        conditions_to_factors[condition] = underlying_factors
        # remove this to go through every single condition. THIS WILL TAKE A WHILE



Condition: abdominal pain
Associated files: /content/drive/MyDrive/training_20180910/100039.txt, /content/drive/MyDrive/training_20180910/101665.txt, /content/drive/MyDrive/training_20180910/103722.txt, /content/drive/MyDrive/training_20180910/106939.txt, /content/drive/MyDrive/training_20180910/108809.txt, /content/drive/MyDrive/training_20180910/112832.txt, /content/drive/MyDrive/training_20180910/115143.txt, /content/drive/MyDrive/training_20180910/161477.txt, /content/drive/MyDrive/training_20180910/168915.txt, /content/drive/MyDrive/training_20180910/178143.txt, /content/drive/MyDrive/training_20180910/184696.txt, /content/drive/MyDrive/training_20180910/190444.txt
Optimal k: 4
The cluster closest to the condition is cluster 2.
- Reportedly was having diarrhea at Group Home, though pt states he has not had bowel movement in 4 days, no hematochezia.
- Patient was in her USOH until Sunday evening when she developed abdominal pain, non-bloody vomiting, and nausea.
- She last reports

In [29]:
import json
import pandas as pd

def serializer(obj):
    """Serialize non-default datatypes (like Pandas Series) to JSON."""
    if isinstance(obj, pd.Series):
        return obj.to_dict()
    raise TypeError(f'Object of type {obj.__class__.__name__} is not JSON serializable')

# Your code
with open('results.json', 'w', encoding='utf-8') as f:
    json.dump(conditions_to_factors, f, ensure_ascii=False, indent=4, default=serializer)


#REPORT:

This is a first pass at a solution to the problem of analyzing underlying factors for associated medical conditions.

I understood this problem to be a data analysis problem (as opposed to a predictive problem in which I would have been tasked with training a model). The task then was to find a way to parse out the relationship between conditions that recur across multiple files in the dataset and underlying factors within those files.

To get the conditions, I parsed the text within the Chief Complain and Discharge Diagnosis sections. I split the text by new lines and commas, and treated each.To get the underlying factors, I parsed the text within the Major Surgical, HPI, PMHx, and Hospital Course sections and split the text by sentences. I treated each sentence as capturing an underlying factor.

Then, I embedded every sentence using a clinical embeddings model (bioclinical bert) and stored those in a dataframe. Next, I identified conditions that appear in multiple files, aggregated all their underlying factors, ran a kmeans clustering algorithm to cluster the underlying factors, and determined which cluster was closest to the underlying condition. The underlying factors within that cluster represent my attempt at connecting conditions to their underlying factors.

#ANALYSIS:

To be honest, I wasn't so happy with the results. I thought the embeddings clustering would be a creative approach to solving the problem, but the clinical embeddings model I used really seemed to not have a uniformly great capacity to represent different medical concepts in meaningfully distinct ways. The clustering never looked clear and a visual check of the results was inconsistent.

One of the reasons I approached the problem using embeddings was that I thought that using NER tagging to transform the free text sections into medical keyword underlying conditions would actuall obfuscate alot. No evidence of heart disease != heart disease but using simply NER would reduce those to the same thing. Of course you could try to also have negatives detection but that is more complicated.

Nonetheless, some of the results do look good with a gut check.

Here is an example of a "good" result from the embeddings approach. Hyperlipidemia,  


**Condition: coronary artery disease**

**Underlying Factors**.
- Hyperlipidemia #.
- Hyperlipidemia.
- Hypertension.
- Hypertension #.
- Hyperparathyroidism #.
- Cataracts bilaterally - s/p surgery for both 11.
- Hypertension.
- Osteoarthritis #.
- Steroid induced hyperglycemia 6.
- Peptic ulcer disease.

Here is what GPT-4 said about the relevance of these as underlying factors:

**Related Factors**:
*Hyperlipidemia*: Yes, it's a significant risk factor for CAD.

*Hypertension*: Absolutely, chronic high blood pressure can damage arteries, making them more susceptible to the accumulation of plaques.

*Hyperparathyroidism*: It can be related indirectly. Hyperparathyroidism can lead to hypercalcemia, which has been associated with heart disease in some studies, potentially through the calcification of the arterial wall.

*Steroid-induced hyperglycemia*: Steroids can induce hyperglycemia, and chronic high blood sugar levels can lead to damage to the arteries, potentially contributing to CAD.

*Peptic ulcer disease*: While not directly related to CAD, it's worth noting that some risk factors (like the use of certain anti-inflammatory medications or stress) and lifestyle aspects might overlap between peptic ulcer disease and CAD.

**Less Directly Related or Unrelated Factors**:
*Cataracts bilaterally - s/p surgery for both 11*: This doesn’t seem directly related to CAD, although both conditions might coexist in older populations.

*Osteoarthritis*: While osteoarthritis itself is not a direct risk factor for CAD, some lifestyle limitations due to osteoarthritis (like reduced physical activity) or medications used might have indirect relationships.

So thats not bad. But lets go through another example which is less clear and shows you why this is a tough problem:

**Condition: gastroesophageal reflux disease**
**Underlying Factors**.
- On postop day 4, she developed ecchymosis around incision with extension to right hip and lower abdomen.
- It was noted that she had a very thin posterior fascia and peritoneum.
- KUB revealed persistenly dilated loops of small bowel and air fluid levels.
- Liver duplex was done twice noting a small amount of perihepatic fluid, no intra or extra-hepatic bile duct dilatation.
- 9.   A KUB was done showing distended loops of small bowel.
- She also started to drain sero-sanguinous fluid from her incision.
- Abd CT ([**4-25**])revealed distended small bowel with decompressed distal small bowel and relatively decompressed colon consistent with small bowel obstruction.
- KUBs continued to show distended loops with air fluid levels despite BMs.  Nausea/vomiting and diarrhea continued.
- Postop diagnosis was small bowel obstruction.
- Catheterization revealed diffuse left anterior descending narrowing with new narrowing at branching of ostium which is not amenable to stenting without stenting over circumflex.

**Here is what GPT-4 had to say about the relevance of these as underlying factors:**

Ecchymosis around incision: Not a direct underlying factor for GERD but indicates a postoperative condition.

Thin posterior fascia and peritoneum: Not a direct factor for GERD but could be relevant in surgical considerations.

Persistently dilated loops of small bowel and air fluid levels: Indicates a gastrointestinal issue like obstruction, which is not a direct cause of GERD but could be related to overall gastrointestinal health.

Liver duplex findings: Without evidence of liver disease or bile reflux, this might not be a direct factor for GERD.

KUB showing distended loops of small bowel: Indicates a gastrointestinal issue, potentially obstruction or ileus, which is not a direct cause of GERD.

Draining sero-sanguinous fluid from incision: Indicates a postoperative issue, not directly related to GERD.

Abdominal CT findings of small bowel obstruction: While a significant gastrointestinal issue, it’s not a direct cause of GERD.

KUB findings and small bowel obstruction: Indicates a significant gastrointestinal issue, not a direct cause of GERD.

Catheterization findings: Related to coronary or heart issues and isn’t a direct cause of GERD.

Analysis:
Items 3, 5, 7, and 8 all relate to issues with the small bowel, potentially obstruction, which can impact gastrointestinal function and might indirectly relate to GERD if they impact gastric emptying or if they are related to a systemic issue affecting gastrointestinal motility. However, they are not direct causes or underlying factors for GERD.

Item 4 (Liver duplex) might be relevant if there were findings related to bile reflux or if liver disease were present, which could impact gastrointestinal function and potentially be related to GERD, but the provided note does not indicate this.

Item 9 (Catheterization findings) is not directly related to GERD but is crucial for understanding overall health and risk, especially if the patient were to undergo surgical intervention for GERD in the future.

Items 1, 2, and 6 are not directly related to GERD but are relevant for understanding the patient’s postoperative status and potential risk factors for other complications.

**As you can see, the relationship here is more questionable.** Partially, its hard to interpret because the free text format of the underlying factors (as opposed to NER tagged) makes it tough to understand. But also the relationship is much more ambigious.

#IMPROVEMENTS:

If I was improving this and had more time I would try doing NER first on all the underlying factors and then embedding them and see how the analysis changed.

On the code side, the kmeans clustering takes forever so I would try to speed that up or figure out a different approach. Maybe just taking the top embeddings for each conditions underlying factors aggregated over a few files would be sufficient.

I think a data analysis approach is probably not the best way to go about here. I think it would be pretty interesting to turn everything here into NER tagged items and then try to build something actually predictive so that testing could be done more robustly.