<a href="https://colab.research.google.com/github/mraaheb/DataMining/blob/main/Phase3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Step 1 — Load the Preprocessed Dataset

We begin Phase 3 by loading the preprocessed dataset generated in Phase 2.  
This dataset is already cleaned and encoded, so it is ready for applying K-means clustering.


In [None]:
import pandas as pd

df = pd.read_csv("lung_cancer_preprocessed.csv")
print(df.shape)
df.head()



(50000, 11)


Unnamed: 0,patient_id,age,gender,pack_years,radon_exposure,asbestos_exposure,secondhand_smoke_exposure,copd_diagnosis,alcohol_consumption,family_history,lung_cancer
0,100000,2,1,0.660248,2,0,0,1,1,0,No
1,100001,0,0,0.127785,2,0,1,1,1,1,Yes
2,100002,2,0,0.004055,1,1,1,1,0,0,Yes
3,100003,2,0,0.44064,0,0,1,0,1,0,Yes
4,100004,0,0,0.444313,1,1,0,1,0,1,Yes


## Phase 3 – Part 2: K-Means Clustering

In this part, we apply K-Means clustering to discover natural groups in the preprocessed lung cancer dataset.  
We will:

1. Prepare the feature matrix by removing non-informative columns (ID and class label).
2. Standardize all numeric attributes so that they are on a comparable scale.
3. Apply K-Means with three different values of K (K = 2, 3, 4).
4. Evaluate and compare the clustering results using:
   - Inertia (within-cluster sum of squares / Elbow method)
   - Silhouette coefficient
5. Interpret the clusters and discuss what they reveal about patient risk profiles.


In [12]:
import pandas as pd

# 1) Prepare feature matrix for K-Means
# Drop ID and class label, we don't want them in clustering
cols_to_drop = ['patient_id', 'lung_cancer']  # adjust if your column names differ
feature_cols = [c for c in df.columns if c not in cols_to_drop]

X = df[feature_cols]

print("Feature columns used for clustering:")
print(feature_cols)

print("\nShape of X:", X.shape)
print("\nData types:")
print(X.dtypes)

print("\nFirst 5 rows of features:")
X.head()


Feature columns used for clustering:
['age', 'gender', 'pack_years', 'radon_exposure', 'asbestos_exposure', 'secondhand_smoke_exposure', 'copd_diagnosis', 'alcohol_consumption', 'family_history']

Shape of X: (50000, 9)

Data types:
age                            int64
gender                         int64
pack_years                   float64
radon_exposure                 int64
asbestos_exposure              int64
secondhand_smoke_exposure      int64
copd_diagnosis                 int64
alcohol_consumption            int64
family_history                 int64
dtype: object

First 5 rows of features:


Unnamed: 0,age,gender,pack_years,radon_exposure,asbestos_exposure,secondhand_smoke_exposure,copd_diagnosis,alcohol_consumption,family_history
0,2,1,0.660248,2,0,0,1,1,0
1,0,0,0.127785,2,0,1,1,1,1
2,2,0,0.004055,1,1,1,1,0,0
3,2,0,0.44064,0,0,1,0,1,0
4,0,0,0.444313,1,1,0,1,0,1


In [13]:
# ---------------------------------------------------------
# Step 2: Scale the features before applying K-Means
# K-Means is sensitive to feature scale, so we standardize.
# ---------------------------------------------------------

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame for easy display
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

print("Shape after scaling:", X_scaled_df.shape)
X_scaled_df.head()


Shape after scaling: (50000, 9)


Unnamed: 0,age,gender,pack_years,radon_exposure,asbestos_exposure,secondhand_smoke_exposure,copd_diagnosis,alcohol_consumption,family_history
0,1.042532,1.003406,0.556701,1.226026,-0.993978,-1.002684,0.997363,-0.002446,-0.99932
1,-1.38275,-0.996606,-1.288746,1.226026,-0.993978,0.997324,0.997363,-0.002446,1.00068
2,1.042532,-0.996606,-1.717576,0.001053,1.006058,0.997324,0.997363,-1.225532,-0.99932
3,1.042532,-0.996606,-0.204431,-1.223919,-0.993978,0.997324,-1.002643,-0.002446,-0.99932
4,-1.38275,-0.996606,-0.191703,0.001053,1.006058,-1.002684,0.997363,-1.225532,1.00068


In [14]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Apply K-means with K=2
k2 = KMeans(n_clusters=2, random_state=42)
clusters_2 = k2.fit_predict(X_scaled_df)

# Add cluster labels to the DataFrame
X_scaled_df["cluster_k2"] = clusters_2

# Evaluation metrics
wcss_k2 = k2.inertia_          # Elbow metric (within-cluster sum of squares)
silhouette_k2 = silhouette_score(X_scaled_df.drop("cluster_k2", axis=1), clusters_2)

print("K=2 Results:")
print("WCSS:", wcss_k2)
print("Silhouette Score:", silhouette_k2)

# Preview
X_scaled_df.head()


K=2 Results:
WCSS: 412338.27081007516
Silhouette Score: 0.08571893148660199


Unnamed: 0,age,gender,pack_years,radon_exposure,asbestos_exposure,secondhand_smoke_exposure,copd_diagnosis,alcohol_consumption,family_history,cluster_k2
0,1.042532,1.003406,0.556701,1.226026,-0.993978,-1.002684,0.997363,-0.002446,-0.99932,1
1,-1.38275,-0.996606,-1.288746,1.226026,-0.993978,0.997324,0.997363,-0.002446,1.00068,1
2,1.042532,-0.996606,-1.717576,0.001053,1.006058,0.997324,0.997363,-1.225532,-0.99932,0
3,1.042532,-0.996606,-0.204431,-1.223919,-0.993978,0.997324,-1.002643,-0.002446,-0.99932,0
4,-1.38275,-0.996606,-0.191703,0.001053,1.006058,-1.002684,0.997363,-1.225532,1.00068,1
