# Diamonds Dataset - Clustering with Machine Learning
This notebook demonstrates how to train an unsupervised machine learning model to cluster diamonds based on various features. We will also discuss underfitting, overfitting, and evaluation metrics with visualizations.

## Step 1: Load the Dataset
We load the diamonds dataset and preprocess it using techniques applied previously, such as encoding categorical variables and scaling numerical variables.

In [None]:
import seaborn as sns
import pandas as pd
from sklearn.preprocessing import StandardScaler
diamonds = sns.load_dataset('diamonds')

# Apply one-hot encoding to the categorical columns
diamonds_encoded = pd.get_dummies(diamonds, columns=['cut', 'color', 'clarity'], drop_first=True)

# Select only the numerical columns
numerical_cols = ['carat', 'depth', 'table', 'price', 'x', 'y', 'z']

# Instantiate the StandardScaler
scaler = StandardScaler()

# Apply scaling to the numerical columns
diamonds_encoded[numerical_cols] = scaler.fit_transform(diamonds_encoded[numerical_cols])
diamonds_encoded.head()

## Step 2: Model Training - KMeans Clustering
We use the KMeans algorithm to cluster the diamonds. KMeans is an unsupervised learning algorithm that groups data into a predefined number of clusters based on feature similarity.

### Explanation of KMeans:
- **Initialization**: The algorithm selects 'k' random points as initial centroids.
- **Assignment**: Each point is assigned to the nearest centroid, forming clusters.
- **Update**: The centroids are updated by calculating the mean of the assigned points.
- **Repeat**: The assignment and update steps repeat until convergence.

### Why KMeans?
KMeans is simple, efficient, and effective for grouping data based on similarity. It's suitable for datasets where we want to find natural groupings or patterns.

In [None]:
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Define the model with a predefined number of clusters (e.g., 5 clusters)
kmeans = KMeans(n_clusters=5, random_state=42)

# Fit the model to the dataset
clusters = kmeans.fit_predict(diamonds_encoded)

# Add the cluster labels to the dataset
diamonds_encoded['cluster'] = clusters
diamonds_encoded.head()

## Step 3: Visualizing the Clusters
We visualize the clusters using a scatter plot based on the carat and price features. The colors represent different clusters identified by the KMeans algorithm.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(diamonds_encoded['carat'], diamonds_encoded['price'], c=diamonds_encoded['cluster'], cmap='viridis', alpha=0.6)
plt.title('Clusters of Diamonds Based on Carat and Price')
plt.xlabel('Carat')
plt.ylabel('Price (Scaled)')
plt.colorbar(label='Cluster')
plt.show()

## Step 4: Evaluation Metrics - Elbow Method
To determine the optimal number of clusters, we use the Elbow Method, which plots the Within-Cluster Sum of Squares (WCSS) for different values of 'k'. The 'elbow' point indicates the best 'k' value where adding more clusters does not significantly improve the fit.

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(diamonds_encoded)
    wcss.append(kmeans.inertia_)

# Plot the WCSS
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

## Step 5: Underfitting and Overfitting Explained
Underfitting and overfitting are common issues in machine learning. We can visualize them using simple models:

- **Underfitting**: When the model is too simple to capture the underlying patterns, it performs poorly on both training and testing data.
- **Overfitting**: When the model learns the noise in the training data, it performs well on training but poorly on testing data.

### Visualizing Underfitting and Overfitting

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

# Generate synthetic data
np.random.seed(0)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Underfitting with Linear Model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
y_pred_linear = linear_model.predict(X)

# Overfitting with a High Degree Polynomial Model
poly = PolynomialFeatures(degree=15)
X_poly = poly.fit_transform(X)
poly_model = LinearRegression()
poly_model.fit(X_poly, y)
y_pred_poly = poly_model.predict(X_poly)

# Plot
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Data')
plt.plot(X, y_pred_linear, color='red', label='Linear Model (Underfitting)')
plt.plot(X, y_pred_poly, color='green', label='Polynomial Model (Overfitting)')
plt.title('Underfitting and Overfitting Visualization')
plt.legend()
plt.show()

## Step 6: Practice Questions
1. Why is it important to scale numerical features before training a model?
2. What is the purpose of the Elbow Method in KMeans clustering?
3. How does one-hot encoding differ from label encoding, and when should each be used?
4. Explain how underfitting can be identified during model evaluation.
5. Describe a scenario where overfitting might occur and how you could prevent it.

## Conclusion
This notebook demonstrated the use of KMeans clustering to group diamonds based on different features. We visualized clusters and explored evaluation techniques like the Elbow Method to determine the optimal number of clusters. Additionally, we explained underfitting and overfitting using visual examples. These concepts and techniques are crucial for building effective machine learning models.

### Additional Advanced Section

**Explanation**
- PCA: PCA transforms the data into a lower-dimensional space while retaining the most variance (information). In this case, we'll reduce the dataset to two components and then plot the clusters.
- This method helps declutter the scatter plot and gives a clearer visual representation of the clusters.

In [None]:
from sklearn.decomposition import PCA

# Apply PCA to reduce dimensions to 2
pca = PCA(n_components=2)
pca_components = pca.fit_transform(diamonds_encoded.drop('cluster', axis=1))

# Create a new DataFrame with PCA components
pca_df = pd.DataFrame(data=pca_components, columns=['PCA1', 'PCA2'])
pca_df['cluster'] = diamonds_encoded['cluster']

# Plot the clusters based on PCA components
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PCA1', y='PCA2', hue='cluster', data=pca_df, palette='viridis', alpha=0.6)
plt.title('Clusters of Diamonds (PCA Reduced Dimensions)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Cluster')
plt.show()


### How It Works:
- PCA Transformation: The PCA object reduces the dataset dimensions to 2, retaining the maximum variance.
- DataFrame Creation: A new DataFrame is created to hold the principal components (PCA1 and PCA2) and the cluster labels.
- Scatter Plot: The clusters are visualized in the reduced 2D space, making it less cluttered and easier to interpret.


This visualization method helps you observe clear separation (or overlap) between clusters without the clutter of the original multi-dimensional space.