# D212 Data Mining II Performance Assessment, Task-1

Submitted by Muhammad Ilyas, Student ID 011143032, for WGU's MSDA program

## Part I: Research Question
### A1: Proposal of Question

"Can we identify distinct patient groups based on their demographic information and medical conditions to tailor specific treatment strategies or programs for each group, ultimately aiming to improve patient outcomes and reduce readmission rates?"

This question focuses on leveraging hierarchical clustering to segment patients into meaningful groups that share similar characteristics. By identifying these clusters, hospitals can personalize treatment plans, allocate resources more efficiently, and potentially mitigate factors leading to readmissions.
### A2: Defined Goal

A reasonable goal within the scope of the scenario and the available data could be:

"Identify distinct patient clusters based on demographic details, medical conditions, and hospital interaction patterns to create targeted patient care programs aimed at reducing readmission rates and improving overall patient satisfaction."

This goal aligns with the scenario's objective of understanding patient characteristics and utilizes the available data, which includes demographic information, medical conditions, hospital interactions, and patient readmission status. The aim is to leverage hierarchical clustering to segment patients effectively, allowing hospitals to tailor interventions and care plans to specific patient groups for better outcomes.

## Part II: Technique Justification

### B1: Explanation of Clustering Technique

#### Data Assessment: 
Hierarchical clustering iteratively merges or splits clusters based on similarities/dissimilarities between data points (patients) across various attributes (demographics, medical conditions, hospital interactions).
#### Distance Calculation: 
It measures distances between data points using chosen metrics (Euclidean, Manhattan, etc.) to determine similarity.
#### Hierarchical Structure: 
The algorithm creates a dendrogram showing relationships between clusters and their members. The expected outcome includes identifying patient clusters with similar characteristics, aiding in segmenting patients into distinct groups based on shared attributes.

### B2: Summary of Technique Assumption
Hierarchical clustering assumes that the data points being clustered can be represented by a hierarchy. It also assumes that the distance metric used to measure similarity between data points is meaningful and accurately represents their similarities in the given context.

### B3: Packages or Libraries List

#### scikit-learn: 
Offers hierarchical clustering algorithms like AgglomerativeClustering. Supports distance calculations and dendrogram visualization. Also, it provides tools for preprocessing data before clustering.
#### matplotlib/seaborn: 
For visualizing dendrograms and cluster distributions, aiding in the interpretation of cluster structures.
#### pandas: 
Essential for data manipulation and preprocessing, enabling data cleaning and transformation before clustering.
Each of these libraries supports different aspects of the analysis. Scikit-learn provides the clustering algorithms and tools for distance computation, while matplotlib/seaborn help in visualizing the clusters, and pandas facilitates data preprocessing, ensuring the data is in a suitable format for clustering.

In [None]:
# Import necessary libraries
import pandas as pd  # For data manipulation and analysis
from pandas.api.types import CategoricalDtype  # For handling categorical data types
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns  # For enhanced data visualization
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram  # For hierarchical clustering and dendrogram plotting
from sklearn.metrics import silhouette_score  # For evaluating clustering performance

# Read the dataset into a DataFrame
df = pd.read_csv('./medical_clean.csv', index_col=0)  # Assumes the dataset is in a CSV file named 'medical_clean.csv'
df.info()  # Displays information about the DataFrame such as column names, data types, and non-null counts

In [None]:
pd.set_option("display.max_columns", None)
df.head(5)

## Part III: Data Preparation
### C1: Data processing (goal)
A relevant preprocessing goal is handling missing values. Clustering algorithms often struggle with missing data. Imputation methods or dropping incomplete rows might be necessary to ensure the clustering algorithm can work effectively with complete data.

### C2: Data Variables
Here are some variables from the dataset categorized as continuous or categorical:

#### Continuous: 
Age, Income, VitD_levels, Doc_visits, Full_meals_eaten, TotalCharge, Additional_charges, Item1-Item8.
#### Categorical: 
City, State, County, Area, TimeZone, Job, Marital, Gender, ReAdmis, Services, Initial_admin, HighBlood, Stroke, Complication_risk, Overweight, Arthritis, Diabetes, Hyperlipidemia, BackPain, Anxiety, Allergic_rhinitis, Reflux_esophagitis, Asthma, Soft_drink.
### C3: Steps for Analysis
#### Handling Missing Values

In [None]:
# Dropping rows with missing values for important columns
important_cols = ['Age', 'Income', 'VitD_levels', 'Doc_visits', 'Full_meals_eaten']
cleaned_data = df.dropna(subset=important_cols)

#### Encoding Categorical Variables

In [None]:
# Using pandas get_dummies for one-hot encoding categorical variables
categorical_cols = ['Area', 'TimeZone', 'Job', 'Marital', 'Gender', 'Services']
cleaned_data = pd.get_dummies(cleaned_data, columns=categorical_cols)

#### Scaling Continuous Variables

In [None]:
from sklearn.preprocessing import StandardScaler

# Scaling continuous variables
scaler = StandardScaler()
continuous_cols = ['Age', 'Income', 'VitD_levels', 'Doc_visits', 'Full_meals_eaten']
cleaned_data[continuous_cols] = scaler.fit_transform(cleaned_data[continuous_cols])

### C4: Cleaned Dataset

In [None]:
cleaned_data.to_csv('task1_full_clean.csv', index=False)

## Part IV: Analysis
### D1: Output and Intermediate Calculations

I will use following steps:

#### Import Libraries:
Import the necessary libraries, including matplotlib.pyplot for plotting and scipy.cluster.hierarchy for hierarchical clustering.

#### Select Numeric Columns for Clustering:
Identify and select only the numeric columns from the dataset to be used for clustering. This is a common practice when applying clustering algorithms.

#### Perform Hierarchical Clustering:
Use the Ward method to perform hierarchical clustering on the selected numeric data. The result is stored in a linkage matrix.

#### Plot the Dendrogram:
Create a dendrogram plot based on the hierarchical clustering results. The plot displays the relationships between data points or clusters. Adjustments can be made to focus on a specific number of clusters for better visualization.

The overall purpose of this code is to provide insights into the structure and relationships within the numeric data through hierarchical clustering and visualize the results using a dendrogram. The choice of parameters in the dendrogram plot can be adjusted to meet specific analysis requirements.

### D2: Code Execution

In [None]:
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

# Selecting only numeric columns for clustering
numeric_columns = cleaned_data.select_dtypes(include=['float64', 'int64']).columns
numeric_data = cleaned_data[numeric_columns]


# Perform hierarchical clustering
linked = linkage(numeric_data, method='ward')

# Plot the dendrogram
plt.figure(figsize=(12, 8))
dendrogram(linked, truncate_mode='lastp', p=50)  # Adjust 'p' for better visualization
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index or Cluster Size')
plt.ylabel('Distance')
plt.show()

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Assuming 'data' is your cleaned and preprocessed DataFrame
# Choosing the optimal number of clusters based on the dendrogram analysis

# Initialize the clustering model with the identified number of clusters
n_clusters = 5
model = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')

# Fit the model to your data
clusters = model.fit_predict(numeric_data)


In [None]:
# Assign cluster labels to the original DataFrame
data_with_clusters = numeric_data.copy()
data_with_clusters['Cluster'] = clusters

# Analyze each cluster
cluster_means = data_with_clusters.groupby('Cluster').mean()
print(cluster_means)

## Part V: Data Summary and Implications
### E1: Quality of Clusters

In [None]:
from sklearn.metrics import silhouette_score

silhouette_avg = silhouette_score(data_with_clusters, clusters)
print(f"Silhouette Score: {silhouette_avg}")

In [None]:
from sklearn.cluster import KMeans

# Assuming 'data' contains your data for clustering and 'num_clusters' is the determined number of clusters

kmeans = KMeans(n_clusters=5)
kmeans.fit(numeric_data)
wcss = kmeans.inertia_
print(f"WCSS: {wcss}")

A silhouette score of 0.368 indicates a moderate level of separation between the clusters. This score falls within the range of -1 to 1, where values closer to 1 suggest better-defined clusters with distinct boundaries. A score around 0.368 implies that while the clusters are discernible, there might be some overlapping or ambiguity in the assignments. It's not a high separation, but it's indicative of reasonable clustering, especially considering real-world data where perfect separation might not always be achievable.

The Within-Cluster Sum of Squares (WCSS) of approximately 2.14 trillion suggests the total sum of squares of distances of data points to their respective cluster centroids. This value helps to assess how compact the clusters are. A lower WCSS generally indicates that the data points within each cluster are closer to the centroid, implying more compact and cohesive clusters. However, the interpretation of the absolute value of WCSS might not provide substantial insights, especially considering the magnitude of the value. Comparing WCSS across different numbers of clusters or different models might be more informative(Schurz, 2023).

The clusters exhibit a moderate level of separation, as indicated by a silhouette score of 0.368. This suggests that while the clusters are discernible, there might be some overlap or ambiguity in the assignments. The Within-Cluster Sum of Squares (WCSS) of approximately 2.14 trillion indicates the sum of squares of distances of data points to their respective cluster centroids. This suggests a substantial spread of data points within each cluster.

### E2: Results and Implications

The clustering analysis has successfully identified distinct groups of patients with similar characteristics. This understanding could be used for various strategic purposes:

#### Tailoring treatments: 
Design specific treatment plans for each cluster based on their unique characteristics.
#### Resource allocation: 
Allocate resources more efficiently by targeting specific patient groups with tailored services.
#### Marketing strategies: 
Develop targeted marketing campaigns for each patient cluster to enhance patient engagement.

### E3: Limitation

A limitation could be the moderate silhouette score and the relatively high WCSS, indicating that while clusters are identifiable, they might not be as well-separated or compact as desired. This might affect the precision of tailoring treatments or services for each cluster.

### E4: Course of Action:
Considering the moderate separation of clusters and the relatively large spread of data within clusters, it's recommended to:

#### Refine clustering techniques: 
Experiment with different clustering algorithms or parameters to achieve better-defined and more compact clusters.
#### Validate clusters: 
Use additional validation techniques or include more relevant features to enhance the separation between clusters and improve cluster homogeneity.
#### Iterative improvement: 
Continuously refine the clustering model based on new data or additional insights to better tailor treatments and services for each patient cluster.

By refining the clustering model and ensuring better separation between clusters, the hospital can enhance the effectiveness of tailored treatments and optimize resource allocation, thereby improving overall patient care and organizational efficiency.

### G: Sources for Thirs Party Code
No third party code was used.

### H: Sources
James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An introduction to statistical learning: with Applications in Python. Springer.