# **1. Data Preprocessing**

**Objective**: Load, clean, and structure the Common Data Elements (CDE) JSON data into a tabular format suitable for embedding and analysis.

**Input**:  
- JSON file (`SearchExport.json`) with nested CDE metadata. Download URL: https://cde.nlm.nih.gov/cde/search

**Output**:  
- A flattened CSV file (`NLM_CDE_JSON2DF.csv`) containing:
  - `tinyId`
  - `designation`
  - `definition`
  - `permissible values`
  - `stewardOrg`


In [None]:
import json
import pandas as pd

# Load JSON data
with open("CDE_data/SearchExport.json", "r", encoding="utf-8", errors="ignore") as file:
    data = json.load(file)

# Parse JSON into rows
df_list = []
for index, item in enumerate(data):
    tinyId = item["tinyId"]
    stewardOrg = item["stewardOrg"]["name"]

    # Extract preferred designation
    designations = [d["designation"] for d in item["designations"] if "Preferred Question Text" in d["tags"]]
    if not designations:
        designations = [item["designations"][0]["designation"]]

    # Handle missing definitions
    definition = item["definitions"][0]["definition"] if item["definitions"] else "No definition available"

    # Format permissible values
    permissible_values = [
        f"{v['permissibleValue']}: {v.get('valueMeaningName', v.get('valueMeaningDefinition', ''))}"
        for v in item["valueDomain"]["permissibleValues"]
        if "permissibleValue" in v
    ]
    permissible_values_str = ", ".join(permissible_values) if permissible_values else "No permissible values"

    for designation in designations:
        df_list.append({
            "index": index,
            "tinyId": tinyId,
            "designation": designation,
            "definition": definition,
            "permissible values": permissible_values_str,
            "stewardOrg": stewardOrg
        })

# Save DataFrame
df = pd.DataFrame(df_list)
df.to_csv('NLM_CDE_JSON2DF.csv', index=False)
print("Data has been successfully saved to 'NLM_CDE_JSON2DF.csv'")


# **2. Embedding Generation**

**Objective**: Convert combined textual data into numerical vectors using OpenAI's embedding model.

**Input**:
- Cleaned DataFrame (`NLM_CDE_JSON2DF.csv`)

**Output**:
- A new column `embedding` in the DataFrame, storing 1D embedding vectors for each row.


In [None]:
import openai
import pandas as pd

# Load processed data
df = pd.read_csv("/content/drive/My Drive/CDE_CMDR_MK/NLM_CDE_JSON2DF.csv")

openai.api_key = "your-openai-api-key"

def generate_embedding(text, model="text-embedding-3-small"):
    if not isinstance(text, str):
        return None if pd.isna(text) else str(text)
    text = text.replace("\n", " ")
    response = openai.Embedding.create(input=text, model=model)
    return response['data'][0]['embedding']

df['embedding'] = df['combined'].apply(generate_embedding)


# **3. Clustering**

**Objective**: Group semantically similar CDE entries using HDBSCAN based on cosine similarity of embeddings.

**Input**:
- Embedding vectors (`embedding` column)

**Output**:
- Cluster labels assigned to each row
- Evaluation metrics per cluster setting:
  - Silhouette Score
  - Dunn Index
  - Davies-Bouldin Index

In [None]:
import hdbscan
import numpy as np
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.metrics.pairwise import cosine_distances
from scipy.spatial.distance import cdist

def compute_dunn_index(embeddings, labels):
    unique_clusters = np.unique(labels[labels != -1])
    if len(unique_clusters) < 2:
        return None
    centroids = np.array([embeddings[labels == c].mean(axis=0) for c in unique_clusters])
    inter_cluster = cdist(centroids, centroids)
    np.fill_diagonal(inter_cluster, np.inf)
    min_inter = inter_cluster.min()
    max_intra = max(np.max(cdist(embeddings[labels == c], embeddings[labels == c])) for c in unique_clusters)
    return min_inter / max_intra if max_intra > 0 else None

# Prepare embeddings
embeddings = np.array(df['embedding'].tolist())
cos_dist = cosine_distances(embeddings)

min_cluster_sizes = [5, 10, 15, 20, 25, 50, 75, 100, 250]

for size in min_cluster_sizes:
    clusterer = hdbscan.HDBSCAN(min_cluster_size=size)
    labels = clusterer.fit_predict(cos_dist)
    valid = labels != -1
    if valid.sum() > 1:
        silhouette = silhouette_score(embeddings[valid], labels[valid])
        dunn = compute_dunn_index(embeddings[valid], labels[valid])
        db = davies_bouldin_score(embeddings[valid], labels[valid])
        print(f"min_cluster_size={size} | Silhouette: {silhouette:.3f}, Dunn: {dunn:.3f}, DBI: {db:.3f}")
    else:
        print(f"min_cluster_size={size} | Not enough valid clusters")


# **4. Cluster Labeling**

**Objective**: Assign human-readable labels to each cluster by summarizing representative cluster content with a language model.

**Input**:
- Grouped text for each cluster
- OpenAI's GPT model (`gpt-3.5-turbo`)

**Output**:
- A `Cluster Name` column with a concise LLM-generated label for each cluster

In [None]:
def generate_cluster_prompt(cluster):
    combined_text = "\n".join(cluster['combined'])
    return f"Generate a meaningful name for this cluster based on the following combined text:\n\n{combined_text}\n\nProvide a concise cluster name."

def generate_cluster_name(cluster_label, cluster_data):
    prompt = generate_cluster_prompt(cluster_data)
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100
    )
    return response['choices'][0]['message']['content'].strip()

df["Cluster Label"] = clusterer.labels_
cluster_names = {label: generate_cluster_name(label, group) for label, group in df.groupby("Cluster Label")}
df["Cluster Name"] = df["Cluster Label"].map(cluster_names)

df.to_csv("CDE_NLM_Cluster_Names.csv", index=False)


# **5. Classification Model**

**Objective**: Train a classifier to predict cluster labels from embeddings, verifying the quality of unsupervised clusters.

**Input**:
- Embeddings (`embedding`)
- Cluster Names (`Cluster Name` as labels)

**Output**:
- Classification performance (Accuracy and Classification Report)


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import ast

# Load data
df = pd.read_csv('CDE_NLM_Complete.csv')

X = np.array(df['embedding'].tolist())
y = df['Cluster Name'].values

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
