<a href="https://colab.research.google.com/github/neelsoumya/python_machine_learning/blob/main/EHR_data_unsupervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Notebook to teach unsupervised machine learning using electronic healthcare data (EHR data)

In [1]:
pip install pandas scikit-learn seaborn matplotlib openml

Collecting openml
  Downloading openml-0.15.1-py3-none-any.whl.metadata (10 kB)
Collecting liac-arff>=2.4.0 (from openml)
  Downloading liac-arff-2.5.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting xmltodict (from openml)
  Downloading xmltodict-0.14.2-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting minio (from openml)
  Downloading minio-7.2.15-py3-none-any.whl.metadata (6.7 kB)
Collecting pycryptodome (from minio->openml)
  Downloading pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Downloading openml-0.15.1-py3-none-any.whl (160 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.4/160.4 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading minio-7.2.15-py3-none-any.whl (95 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.1/95.1 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xmltodict-0.14.2-py2.py3-none-any.whl (10.0 kB)
Downloading pycr

## 🧠 Teaching Points

### ✅ Concepts to cover:
- **Preprocessing**: Dropping NAs, scaling
- **Dimensionality reduction**: PCA + t-SNE
- **Clustering**: k-Means, DBSCAN, hierarchical clustering
- **Clinical interpretation**: What does each cluster represent?

---

### 🔬 Exploratory Ideas:
- Color by **readmission** or **gender**
- Add **medication** or **diagnosis** categories
- Compare **k-means** vs **DBSCAN**

---

### 📝 Optional Exercises for Students
- Try different `n_clusters` in **k-means** — what’s the best number?
- Use **DBSCAN** or **AgglomerativeClustering** instead.
- Visualize the **PCA plot** alongside **t-SNE** — which is more informative?
- Explore correlations with **hospital readmission** or **comorbidities**.


In [3]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
import openml

# Step 1: Load dataset from OpenML
diabetes_data = openml.datasets.get_dataset(43569)  # Diabetes 130-US hospitals
df, *_ = diabetes_data.get_data()

print(df)

# Step 2: Select numeric + relevant features for clustering
cols = ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications',
        'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']
data = df[cols].dropna()

# Step 3: Normalize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data)

# Step 4: Dimensionality reduction
pca = PCA(n_components=10).fit_transform(X_scaled)
tsne = TSNE(n_components=2, perplexity=30, learning_rate=200, random_state=42).fit_transform(pca)

# Step 5: Clustering
kmeans = KMeans(n_clusters=5, random_state=42)
clusters = kmeans.fit_predict(pca)

# Step 6: Visualize t-SNE + clusters
plt.figure(figsize=(10, 6))
sns.scatterplot(x=tsne[:, 0], y=tsne[:, 1], hue=clusters, palette='tab10')
plt.title("t-SNE of EHR Data with K-Means Clusters")
plt.xlabel("t-SNE 1")
plt.ylabel("t-SNE 2")
plt.legend(title="Cluster")
plt.show()


     column_a  column_b     column_c column_d column_e column_f     column_g  \
0           3       NaN  alfa-romero      gas      std      two  convertible   
1           3       NaN  alfa-romero      gas      std      two  convertible   
2           1       NaN  alfa-romero      gas      std      two    hatchback   
3           2     164.0         audi      gas      std     four        sedan   
4           2     164.0         audi      gas      std     four        sedan   
..        ...       ...          ...      ...      ...      ...          ...   
200        -1      95.0        volvo      gas      std     four        sedan   
201        -1      95.0        volvo      gas    turbo     four        sedan   
202        -1      95.0        volvo      gas      std     four        sedan   
203        -1      95.0        volvo   diesel    turbo     four        sedan   
204        -1      95.0        volvo      gas    turbo     four        sedan   

    column_h column_i  column_j  ...  c

KeyError: "None of [Index(['time_in_hospital', 'num_lab_procedures', 'num_procedures',\n       'num_medications', 'number_outpatient', 'number_emergency',\n       'number_inpatient', 'number_diagnoses'],\n      dtype='object')] are in the [columns]"