<a href="https://colab.research.google.com/github/hg24abd/Clustering-and-Fitting/blob/main/clustering_and_fitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering and Regression Analysis of the Iris Dataset

This notebook analyzes the Iris dataset using:
1. **Exploratory data analysis** to understand the dataset.
2. **Histogram plot** to visualize the distribution of sepal lengths.
3. **K-means clustering** to group the dataset into meaningful clusters.
4. **Elbow plot** to determine the optimal number of clusters.
5. **Scatter plot** to visualize the clusters.
6. **Linear regression** to examine the relationship between sepal length and petal length.

The results are presented with clear visualizations, and the Python code adheres to professional standards.

## Step 1: Import Required Libraries

We use popular Python libraries like `pandas` for data handling, `matplotlib` for visualizations, and `scikit-learn` for clustering and regression tasks.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from google.colab import files

In [None]:
# Set Seaborn Style for Better Aesthetics
sns.set(style="whitegrid", palette="muted", font_scale=1.2)

In [None]:
# Load the dataset
from google.colab import files
uploaded = files.upload()
file_path = 'iris_cluster_vibes.data'
iris_df = load_iris_dataset(file_path)

## Step 2: Load the Dataset

The Iris dataset contains information about three species of flowers. We load it into a DataFrame and add appropriate column names for better readability.

In [None]:
# Load the Iris Dataset
def load_iris_dataset(filepath):
    """
    Load the Iris dataset from the provided file path.

    Args:
        filepath (str): Path to the Iris dataset file.

    Returns:
        pd.DataFrame: Loaded dataset with column names.
    """
    columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
    return pd.read_csv(filepath, header=None, names=columns)

## Step 3: Exploratory Data Analysis

We explore the dataset to understand its structure and contents:
1. Print dataset information.
2. Display summary statistics.
3. Check the unique species available.

In [None]:
# Exploratory Analysis
def exploratory_analysis(df):
    """
    Perform basic exploratory analysis and display summary statistics.

    Args:
        df (pd.DataFrame): The dataset.
    """
    print("Dataset Info:")
    print(df.info())
    print("\nStatistical Summary:")
    print(df.describe())
    print("\nUnique Species:")
    print(df['species'].unique())

# Perform exploratory analysis
exploratory_analysis(iris_df)

## Step 4: Histogram for Sepal Length

We create a histogram to visualize the distribution of sepal lengths across the dataset.

In [None]:
# Plot histogram with enhanced style
def plot_histogram(df, column):
    """
    Plot a histogram for the specified column.

    Args:
        df (pd.DataFrame): The dataset.
        column (str): Column name to plot the histogram.
    """
    plt.figure(figsize=(10, 6))
    sns.histplot(df[column], bins=15, kde=True, color="skyblue", edgecolor='black', linewidth=1.5)
    plt.title(f"Distribution of {column}", fontsize=16)
    plt.xlabel(column, fontsize=14)
    plt.ylabel("Frequency", fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.savefig("sepal_length_histogram.png", bbox_inches='tight')
    plt.show()

# Plot the histogram for sepal length
plot_histogram(iris_df, 'sepal_length')

## Step 5: Elbow Method for Clustering

The elbow method helps determine the optimal number of clusters for k-means. We plot inertia (sum of squared distances) for different cluster numbers.

In [None]:
# Elbow method for K-Means with enhanced plot style
def plot_elbow_method(df, columns):
    """
    Determine the optimal number of clusters using the Elbow Method.

    Args:
        df (pd.DataFrame): The dataset.
        columns (list): Columns used for clustering.
    """
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(df[columns])

    inertia = []
    for k in range(1, 11):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled_data)
        inertia.append(kmeans.inertia_)

    plt.figure(figsize=(10, 6))
    plt.plot(range(1, 11), inertia, marker='o', linestyle='-', color='#FF6347', linewidth=2)
    plt.title("Elbow Method to Determine Optimal Clusters", fontsize=16)
    plt.xlabel("Number of Clusters", fontsize=14)
    plt.ylabel("Inertia", fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.xticks(range(1, 11))
    plt.savefig("elbow_plot.png", bbox_inches='tight')
    plt.show()

# Plot the elbow method
plot_elbow_method(iris_df, ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])

## Step 6: K-Means Clustering

We apply k-means clustering to group the dataset into clusters. We use normalized data for better clustering performance.

In [None]:
# Perform K-Means clustering with unique color palette
def perform_kmeans_clustering(df, columns, n_clusters):
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(df[columns])

    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    df['Cluster'] = kmeans.fit_predict(scaled_data)
    return df

# Apply k-means clustering
iris_df = perform_kmeans_clustering(iris_df, ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], n_clusters=3)

## Step 7: Scatter Plot for Clusters

We create a scatter plot to visualize how data points are grouped into clusters based on sepal length and petal length.

In [None]:
# Scatter plot for clusters with unique and bright colors
def plot_cluster_scatter(df, x_col, y_col, cluster_col):
    """
    Create a scatter plot to visualize clusters.

    Args:
        df (pd.DataFrame): The dataset.
        x_col (str): Column for x-axis.
        y_col (str): Column for y-axis.
        cluster_col (str): Column for cluster labels.
    """
    plt.figure(figsize=(10, 6))
    scatter = plt.scatter(df[x_col], df[y_col], c=df[cluster_col], cmap='Set2', alpha=0.8, s=100, edgecolor='black')
    plt.colorbar(scatter, label='Cluster')
    plt.title("Scatter Plot of Clusters", fontsize=16)
    plt.xlabel(x_col, fontsize=14)
    plt.ylabel(y_col, fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.savefig("cluster_scatter_plot.png", bbox_inches='tight')
    plt.show()

# Plot the scatter plot for clusters
plot_cluster_scatter(iris_df, 'sepal_length', 'petal_length', 'Cluster')

## Step 8: Linear Regression Analysis

We analyze the relationship between sepal length and petal length using linear regression. The regression line shows the trend in the data.

In [None]:
# Linear regression with stylish plot
def perform_linear_regression(df, x_col, y_col):
    """
    Perform linear regression and plot the regression line.

    Args:
        df (pd.DataFrame): The dataset.
        x_col (str): Independent variable.
        y_col (str): Dependent variable.
    """
    X = df[[x_col]].values
    y = df[y_col].values

    model = LinearRegression()
    model.fit(X, y)
    predictions = model.predict(X)

    plt.figure(figsize=(10, 6))
    plt.scatter(X, y, color="deepskyblue", label="Data Points", alpha=0.7, s=100, edgecolor='black')
    plt.plot(X, predictions, color="darkorange", linewidth=3, label="Regression Line")
    plt.title(f"Linear Regression: {y_col} vs {x_col}", fontsize=16)
    plt.xlabel(x_col, fontsize=14)
    plt.ylabel(y_col, fontsize=14)
    plt.legend(fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.savefig("regression_plot.png", bbox_inches='tight')
    plt.show()

    return model

# Perform linear regression
perform_linear_regression(iris_df, 'sepal_length', 'petal_length')

## Conclusion

This notebook successfully performs clustering and regression analysis on the Iris dataset:
1. A histogram visualized the distribution of sepal length.
2. K-means clustering grouped the dataset into three clusters.
3. Linear regression showed a strong relationship between sepal length and petal length.