# Clustering Iris Flower Species

Let's walk through a machine learning workflow to classify the iris flower species based on the features provided using K-Means.

Source of dataset from [Kaggle](https://www.kaggle.com/datasets/uciml/iris)

## Data Attributes
1. `sepal_length`: The length of the sepal (in cm).
2. `sepal_width`: The width of the sepal (in cm).
3. `petal_length`: The length of the petal (in cm).
4. `petal_width`: The width of the petal (in cm).
5. `species`: The species of the iris flower (e.g., setosa, versicolor, virginica).

## Step 1: Load and Explore the Data

Load the dataset from a CSV file and understand its structure.

Process:
- Use `pandas` to read the CSV file.
- Display the first few rows, summary statistics, and information about the dataset.


In [None]:
import pandas as pd

# Load the dataset from the CSV file
file_path = 'iris-flower.csv'
data = pd.read_csv(file_path)

# Display basic information about the dataset
print(data.head())
print(data.describe())
print(data.info())


## Step 2: Preprocess the Data

Prepare the data for clustering by scaling the features.

Process:
- Extract the feature columns (drop the target column 'species').
- Use StandardScaler to scale the features to have mean 0 and variance 1.

In [None]:
from sklearn.preprocessing import StandardScaler

# Extract features and scale the data
features = data.drop('species', axis=1)  # Assuming 'species' is the target column
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

## Step 3: Apply K-Means Clustering

Cluster the data into groups using the K-Means algorithm.

Process:
- Use KMeans from sklearn.cluster to apply K-Means clustering.
- Fit the model to the scaled features and predict cluster labels.
- Add the cluster labels to the original dataset.

In [None]:
from sklearn.cluster import KMeans

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, n_init=1)
kmeans.fit(scaled_features)
clusters = kmeans.labels_

# Add the cluster labels to the original data
data['cluster'] = clusters


## Step 4: Visualize the Clusters

Visualize the clustering results and compare them with the actual labels.

Process:
- Use matplotlib to plot the actual labels and the cluster labels.

In [None]:
import matplotlib.pyplot as plt

# Plot the clusters
plt.figure(figsize=(10, 5))

# Actual labels plot
plt.subplot(1, 2, 1)
plt.scatter(scaled_features[:, 0], scaled_features[:, 1], c=data['species'].astype('category').cat.codes, cmap='viridis')
plt.title('Actual Labels')

# K-Means clusters plot
plt.subplot(1, 2, 2)
plt.scatter(scaled_features[:, 0], scaled_features[:, 1], c=data['cluster'], cmap='viridis')
plt.title('K-Means Clusters')

plt.show()

## Step 5: Evaluate the Clustering

Evaluate the clustering results using the Adjusted Rand Index (ARI).

Process:
- Use `adjusted_rand_score` from sklearn.metrics to compare the cluster labels with the actual labels.


In [None]:
from sklearn.metrics import adjusted_rand_score

# Evaluate the clustering
ari = adjusted_rand_score(data['species'].astype('category').cat.codes, data['cluster'])
print(f'Adjusted Rand Index: {ari}')


## Step 6: Create Cluster to Species Mapping

Map each cluster to the most common species within that cluster.

Process:
- For each cluster, find the most common species.
- Create a mapping from cluster labels to species names.

In [None]:
# Create Cluster to Species Mapping
species_names = data['species'].unique()
cluster_to_species = {}

for cluster in range(3):
    species_in_cluster = data[data['cluster'] == cluster]['species']
    most_common_species = species_in_cluster.value_counts().idxmax()
    cluster_to_species[cluster] = most_common_species

print('Cluster to species mapping:')
print(cluster_to_species)

## Step 7: Predict the Cluster for New Data and Map to Species Name

Predict the cluster for a new data point and map the cluster to the species name.

Process:
- Create a new data point and scale it.
- Predict the cluster using the trained K-Means model.
- Map the predicted cluster to the species name using the created mapping.

In [None]:
# Predict the Cluster for New Data
new_data = pd.DataFrame([[5.0, 3.5, 1.5, 0.2]], columns=features.columns)
new_data_scaled = scaler.transform(new_data)
predicted_cluster = kmeans.predict(new_data_scaled)
predicted_species = cluster_to_species[predicted_cluster[0]]

print(f'Predicted cluster for new data: {predicted_cluster[0]}')
print(f'Predicted species for new data: {predicted_species}')

## Step 8: Analyze the Cluster Centers

Understand the characteristics of each cluster by analyzing the cluster centers.

Process:
- Inverse transform the cluster centers to the original scale.
- Create a DataFrame of the cluster centers.

In [None]:
# Analyze the cluster centers
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=features.columns)
print('Cluster Centers:')
print(cluster_centers_df)

## Step 9: Compare with Original Labels

Create a confusion matrix to compare the clusters with the original labels.

Process:
- Use confusion_matrix from sklearn.metrics to compare the predicted clusters with the actual species labels.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Create the confusion matrix
conf_matrix = confusion_matrix(data['species'].astype('category').cat.codes, data['cluster'])

# Plot the confusion matrix as a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=species_names, yticklabels=species_names)
plt.xlabel('Predicted Clusters')
plt.ylabel('Actual Species')
plt.title('Confusion Matrix')
plt.show()

## Understanding the Confusion Matrix
- Rows represent the actual classes (species).
- Columns represent the predicted clusters.

Each element in the matrix at position (i, j) indicates the number of samples of the actual class i that were predicted to be in cluster j.

## Interpreting the Confusion Matrix

1. First Row (Actual: Setosa):

- [50, 0, 0]: All 50 samples of Setosa were correctly clustered into a single cluster (predicted cluster 0).
- This indicates that the K-Means algorithm did a perfect job in identifying Setosa samples.

2. Second Row (Actual: Versicolor):

- [0, 38, 12]: Out of 50 samples of Versicolor, 38 were correctly clustered into one cluster (predicted cluster 1), but 12 were incorrectly clustered into another cluster (predicted cluster 2).
- This indicates that while the algorithm identified most Versicolor samples correctly, there is some confusion with Virginica samples.

3. Third Row (Actual: Virginica):

- [0, 14, 36]: Out of 50 samples of Virginica, 36 were correctly clustered into one cluster (predicted cluster 2), but 14 were incorrectly clustered into another cluster (predicted cluster 1).
- This shows that there is a notable overlap/confusion between Versicolor and Virginica samples.

## Summary
- Setosa: Perfectly clustered (50 out of 50 correctly clustered).
- Versicolor: Majority correctly clustered (38 out of 50), with some confusion (12 samples) with Virginica.
- Virginica: Majority correctly clustered (36 out of 50), with some confusion (14 samples) with Versicolor.