## Section 1: Project Setup and Data Ingestion
Our first step is to set up the environment by importing the necessary Python libraries. We'll use `pandas` and `numpy` for data manipulation, `matplotlib` and `seaborn` for static visualizations, and `plotly` for interactive plots. We also set some default styles for our plots to ensure they are clear and aesthetically pleasing.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

sns.set_style('whitegrid')
plt.style.use('seaborn-v0_8-talk')

print("Libraries imported successfully.")

With the libraries loaded, we can now ingest our dataset. For this project, we are using a publicly available, anonymized insurance dataset. We'll load this directly into a pandas DataFrame from a new, stable URL.

In [None]:
try:
    # Corrected URL for the dataset
    url = 'https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv'
    df = pd.read_csv(url)
    print("Dataset loaded successfully.")
    print(f"Dataset shape: {df.shape}")
except Exception as e:
    print(f"Error loading dataset: {e}")

## Section 2: Data Cleaning and Initial Inspection
Before diving into analysis, it's crucial to get a high-level overview of the data. We'll use `.head()` to see the first few rows, `.info()` to understand the data types and non-null counts, and `.describe()` to get a statistical summary of the numerical columns. This helps us quickly identify the structure and scale of our data.

In [None]:
print("\n--- First 5 Rows of the Dataset ---")
print(df.head())

print("\n--- Dataset Information ---")
df.info()

print("\n--- Statistical Summary ---")
print(df.describe())

Data quality is paramount for any reliable analysis. A key step in data cleaning is checking for missing values. We'll use `.isnull().sum()` to count the number of nulls in each column. Fortunately, this dataset is also clean.

In [None]:
print("\n--- Missing Values per Column ---")
print(df.isnull().sum())

## Section 3: Exploratory Data Analysis (EDA)
Now we move to Exploratory Data Analysis (EDA). The first step is to understand the distribution of our key numerical features. By plotting histograms for attributes like Age, BMI, and Charges, we can visualize their shape, central tendency, and spread, which provides initial insights into our customer base.

In [None]:
print("\n--- Starting Exploratory Data Analysis ---")

fig, axes = plt.subplots(1, 3, figsize=(21, 7))
fig.suptitle('Distribution of Key Customer Attributes', fontsize=20)

sns.histplot(df['age'], kde=True, ax=axes[0], color='skyblue')
axes[0].set_title('Age Distribution')

sns.histplot(df['bmi'], kde=True, ax=axes[1], color='salmon')
axes[1].set_title('BMI Distribution')

sns.histplot(df['charges'], kde=True, ax=axes[2], color='lightgreen')
axes[2].set_title('Insurance Charges Distribution')

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

To understand how our numerical variables relate to one another, we'll create a correlation heatmap. This visualization shows the correlation coefficient between each pair of features. It's a powerful tool for identifying interesting relationships in the data.

In [None]:
plt.figure(figsize=(12, 8))
numeric_df = df.select_dtypes(include=np.number)
correlation_matrix = numeric_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features', fontsize=16)
plt.show()

## Section 4: Feature Engineering & Preprocessing
Machine learning models require all input features to be numerical. Our dataset contains categorical columns (`sex`, `smoker`, `region`) that we need to convert. We will use **one-hot encoding** for this, which creates new binary (0 or 1) columns for each category.

Additionally, distance-based algorithms like K-Means are sensitive to the scale of the data. Features with large ranges (like `charges`) can dominate the clustering process. To prevent this, we will standardize all our numerical features using `StandardScaler` to give them a mean of 0 and a standard deviation of 1.

In [None]:
from sklearn.preprocessing import StandardScaler

# Create a copy for preprocessing
df_processed = df.copy()

# One-hot encode categorical features
df_processed = pd.get_dummies(df_processed, columns=['sex', 'smoker', 'region'], drop_first=True)

# Scale the data
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df_processed)

print("Data preprocessed and scaled successfully.")

## Section 5: Dimensionality Reduction with PCA
Our preprocessed data now has many dimensions (features), which can be difficult to visualize and may contain redundant information. We will use **Principal Component Analysis (PCA)** to reduce the dimensionality. PCA transforms the data into a smaller set of uncorrelated variables called 'principal components' while retaining most of the original information. For this project, we'll reduce the data to two principal components, which will allow us to easily plot and visualize the customer clusters in a 2D scatter plot.

In [None]:
from sklearn.decomposition import PCA

# Apply PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_features)

# Create a DataFrame with the principal components
df_pca = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

print("PCA completed. Data reduced to 2 components.")

## Section 6: Model Development - K-Means Clustering
With our data prepared, we can now apply the K-Means clustering algorithm. A critical step in K-Means is to determine the optimal number of clusters, `k`. We will use the **Elbow Method** for this.

The Elbow Method involves running the K-Means algorithm for a range of `k` values and calculating the Within-Cluster Sum of Squares (WCSS) for each. WCSS is the sum of the squared distances between each data point and its assigned cluster's center. We then plot these WCSS values against `k`. The 'elbow' on the plot—the point where the rate of decrease in WCSS slows down significantly—is considered a good estimate for the optimal `k`.

In [None]:
from sklearn.cluster import KMeans

# Use the Elbow Method to find the optimal number of clusters
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(df_pca)
    wcss.append(kmeans.inertia_)

# Plot the Elbow Method graph
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()

## Section 7: Final Model Training and Visualization
The elbow plot suggests that the optimal number of clusters is around 3 or 4. We will choose `k=4` to potentially capture more granular segments. Now, we'll train the final K-Means model with our chosen `k` and assign a cluster label to each customer.

In [None]:
# Set the optimal number of clusters
optimal_k = 4

# Train the final K-Means model
kmeans = KMeans(n_clusters=optimal_k, init='k-means++', max_iter=300, n_init=10, random_state=42)
cluster_labels = kmeans.fit_predict(df_pca)

# Add the cluster labels to our PCA dataframe
df_pca['Cluster'] = cluster_labels

print(f"Final model trained with {optimal_k} clusters.")

With the cluster labels assigned, we can now visualize the segments. We will create a scatter plot of the two principal components, with each point colored according to its assigned cluster. This plot provides a clear, intuitive view of how the customer groups are separated.

In [None]:
# Visualize the clusters
plt.figure(figsize=(12, 8))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=df_pca, palette='viridis', s=100, alpha=0.8)
plt.title('Customer Segments Visualized with PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Customer Segment')
plt.grid(True)
plt.show()

## Section 8: Cluster Profiling and Interpretation
The final and most critical step is to understand what defines each cluster. We will add the cluster labels back to our original, unscaled DataFrame. Then, by grouping the data by cluster and calculating the average value for each feature, we can create a detailed profile for each segment. This analysis is what transforms the abstract clusters into actionable business personas.

In [None]:
# Add cluster labels back to the original dataframe
df['Cluster'] = cluster_labels

# Analyze the characteristics of each cluster
cluster_profile = df.groupby('Cluster').mean()

print("\n--- Customer Segment Profiles ---")
print(cluster_profile)