<!-- Main Title with color and center alignment -->
<h1 style="color:darkblue; text-align:center; font-family:Arial;">ðŸ’Ž Diamonds Dataset Analysis ðŸ’Ž</h1>

<!-- Subtitle with italic and color -->
<h3 style="color:darkred; text-align:center; font-family:Verdana;">
Data Modification: <em>EDA, Cleaning, and Transformation</em>
</h3>

<!-- Description paragraph with font size and color -->
<p style="color:black; font-size:16px; font-family:Georgia; text-align:justify;">
This notebook focuses on exploring, cleaning, and transforming the diamonds dataset. 
We will perform step-by-step analysis including data inspection, cleaning, creating new features, 
and summarizing insights for visualization and further analysis.
</p>


<!-- Section Header with color and underline -->
<h2 style="color:darkblue; font-family:Verdana;">
1. Setup
</h2>
<hr style="border:1px solid darkblue;">

<!-- Description with italic and different font -->
<p style="font-family:Georgia; font-size:15px; color:black;">
Import the required libraries and load the <strong>diamonds</strong> dataset.
</p>


In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Set style for better-looking plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

# Load the diamonds dataset
diamonds = sns.load_dataset("diamonds")

print("Dataset loaded successfully!")
print(f"Shape: {diamonds.shape}")


<!-- Section Header with color and underline -->
<h2 style="color:darkblue; font-family:Verdana;">
2. Exploratory Data Analysis (EDA)
</h2>
<hr style="border:1px solid darkblue;">

<!-- Description with styled font -->
<p style="font-family:Georgia; font-size:15px; color:black;">
Let's explore the dataset structure, data types, and basic statistics.
</p>


In [None]:
# Display basic information about the dataset
print("=" * 60)
print("DATASET SHAPE")
print("=" * 60)
print(f"Rows: {diamonds.shape[0]:,}")
print(f"Columns: {diamonds.shape[1]}")
print()

print("=" * 60)
print("FIRST FEW ROWS")
print("=" * 60)
diamonds.head(10)


In [None]:
# Data types and basic info
print("=" * 60)
print("DATA TYPES AND INFO")
print("=" * 60)
print(diamonds.dtypes)
print()
print(diamonds.info())


In [None]:
# Check for missing values
print("=" * 60)
print("MISSING VALUES")
print("=" * 60)
missing = diamonds.isnull().sum()
missing_pct = (missing / len(diamonds)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing Percentage': missing_pct
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
if len(missing_df) > 0:
    print(missing_df)
else:
    print("No missing values found!")


In [None]:
# Unique values for categorical columns
print("=" * 60)
print("UNIQUE VALUES FOR CATEGORICAL COLUMNS")
print("=" * 60)
categorical_cols = diamonds.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    unique_vals = diamonds[col].unique()
    print(f"\n{col}:")
    print(f"  Unique count: {len(unique_vals)}")
    print(f"  Values: {list(unique_vals)}")
    print(f"  Value counts:")
    print(diamonds[col].value_counts().to_string())


In [None]:
# Basic statistics for numerical columns
print("=" * 60)
print("NUMERICAL SUMMARY STATISTICS")
print("=" * 60)
diamonds.describe()


<!-- Section Header with color and underline -->
<h2 style="color:darkblue; font-family:Verdana;">
3. Data Cleaning
</h2>
<hr style="border:1px solid darkblue;">

<!-- Description with styled font -->
<p style="font-family:Georgia; font-size:15px; color:black;">
Handle missing values, set proper data types, and rename columns if needed.
</p>


In [None]:
# Create a copy for cleaning
df_clean = diamonds.copy()

print("Starting data cleaning process...")
print(f"Original shape: {df_clean.shape}")


In [None]:
# Handle missing values (if any)
# Since diamonds dataset typically has no missing values, we'll document the approach
if df_clean.isnull().sum().sum() > 0:
    print("Missing values detected. Handling them...")
    # For numerical columns, we could use median or mean
    # For categorical columns, we could use mode or 'Unknown'
    # For this dataset, we'll document the approach but likely won't need it
    print("Missing value handling strategy:")
    print("  - Numerical: Fill with median (robust to outliers)")
    print("  - Categorical: Fill with mode or 'Unknown'")
else:
    print("No missing values to handle. Dataset is complete.")


In [None]:
# Set proper data types
# Convert categorical columns to category type for better performance and memory usage
print("\nSetting proper data types...")

# Categorical columns that should be ordered
ordered_categories = {
    'cut': ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'],
    'color': ['J', 'I', 'H', 'G', 'F', 'E', 'D'],  # D is best, J is worst
    'clarity': ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']  # IF is best, I1 is worst
}

for col, order in ordered_categories.items():
    if col in df_clean.columns:
        df_clean[col] = pd.Categorical(df_clean[col], categories=order, ordered=True)
        print(f"  - {col}: Converted to ordered categorical")

# Check for any other object columns that should be categorical
other_categorical = df_clean.select_dtypes(include=['object']).columns
for col in other_categorical:
    if col not in ordered_categories:
        df_clean[col] = df_clean[col].astype('category')
        print(f"  - {col}: Converted to categorical")

print("\nData types after conversion:")
print(df_clean.dtypes)

In [None]:
# Check for outliers in numerical columns
print("\nChecking for potential outliers...")
numerical_cols = df_clean.select_dtypes(include=[np.number]).columns

for col in numerical_cols:
    Q1 = df_clean[col].quantile(0.25)
    Q3 = df_clean[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = ((df_clean[col] < lower_bound) | (df_clean[col] > upper_bound)).sum()
    if outliers > 0:
        print(f"  - {col}: {outliers:,} potential outliers ({outliers/len(df_clean)*100:.2f}%)")
    else:
        print(f"  - {col}: No outliers detected")

print("\nNote: Outliers in diamond data may be legitimate (e.g., very large or expensive diamonds).")
print("We'll keep them unless they represent data entry errors.")

<!-- Section Header with color and underline -->
<h2 style="color:darkblue; font-family:Verdana;">
4. Data Transformation
</h2>
<hr style="border:1px solid darkblue;">

<!-- Description with styled font -->
<p style="font-family:Georgia; font-size:15px; color:black;">
Create helpful derived columns (ratios, bins, flags) and summarize the data.
</p>


In [None]:
# Create helpful derived columns
print("Creating derived columns...")

# 1. Price per carat (important metric for diamond value)
df_clean['price_per_carat'] = df_clean['price'] / df_clean['carat']
print("  - Created 'price_per_carat': price divided by carat weight")

# 2. Volume (approximate, using x * y * z)
df_clean['volume'] = df_clean['x'] * df_clean['y'] * df_clean['z']
print("  - Created 'volume': x * y * z (cubic mm)")

# 3. Depth percentage (already exists, but let's verify it's correct)
# depth = (z / mean(x, y)) * 100
df_clean['depth_calculated'] = (df_clean['z'] / ((df_clean['x'] + df_clean['y']) / 2)) * 100
print("  - Created 'depth_calculated': calculated depth percentage")

# 4. Table percentage (already exists, but let's verify)
# table = (table width / average of x and y) * 100
print("  - 'table' column already exists as table percentage")

# 5. Size category based on carat
df_clean['size_category'] = pd.cut(
    df_clean['carat'],
    bins=[0, 0.5, 1.0, 2.0, float('inf')],
    labels=['Small', 'Medium', 'Large', 'Very Large']
)
print("  - Created 'size_category': Small (<0.5), Medium (0.5-1.0), Large (1.0-2.0), Very Large (>2.0)")

# 6. Price category
df_clean['price_category'] = pd.qcut(
    df_clean['price'],
    q=4,
    labels=['Budget', 'Mid-Range', 'Premium', 'Luxury']
)
print("  - Created 'price_category': Quartile-based price categories")

# 7. Flag for ideal cut
df_clean['is_ideal_cut'] = (df_clean['cut'] == 'Ideal').astype(int)
print("  - Created 'is_ideal_cut': Binary flag for ideal cut diamonds")

# 8. Flag for best color (D, E, F are considered colorless/premium)
df_clean['is_premium_color'] = df_clean['color'].isin(['D', 'E', 'F']).astype(int)
print("  - Created 'is_premium_color': Binary flag for premium color grades (D, E, F)")

# 9. Flag for best clarity (IF, VVS1, VVS2 are considered flawless/near-flawless)
df_clean['is_premium_clarity'] = df_clean['clarity'].isin(['IF', 'VVS1', 'VVS2']).astype(int)
print("  - Created 'is_premium_clarity': Binary flag for premium clarity grades")

print(f"\nNew shape: {df_clean.shape}")
print(f"New columns: {df_clean.shape[1] - diamonds.shape[1]} additional columns created")


In [None]:
# Display the new columns
print("New columns created:")
new_cols = [col for col in df_clean.columns if col not in diamonds.columns]
print(new_cols)
print("\nSample of new columns:")
df_clean[['carat', 'price', 'price_per_carat', 'size_category', 'price_category', 
          'is_ideal_cut', 'is_premium_color', 'is_premium_clarity']].head(10)


In [None]:
# Summary by cut quality
print("=" * 60)
print("SUMMARY BY CUT QUALITY")
print("=" * 60)
cut_summary = df_clean.groupby('cut', observed=False).agg({
    'price': ['mean', 'median', 'count'],
    'carat': 'mean',
    'price_per_carat': 'mean'
}).round(2)
cut_summary.columns = ['Avg Price', 'Median Price', 'Count', 'Avg Carat', 'Avg Price/Carat']
cut_summary


In [None]:
# Summary by color grade
print("=" * 60)
print("SUMMARY BY COLOR GRADE")
print("=" * 60)
color_summary = df_clean.groupby('color', observed=False).agg({
    'price': ['mean', 'median', 'count'],
    'carat': 'mean',
    'price_per_carat': 'mean'
}).round(2)
color_summary.columns = ['Avg Price', 'Median Price', 'Count', 'Avg Carat', 'Avg Price/Carat']
color_summary


In [None]:
# Summary by clarity grade
print("=" * 60)
print("SUMMARY BY CLARITY GRADE")
print("=" * 60)
clarity_summary = df_clean.groupby('clarity', observed=False).agg({
    'price': ['mean', 'median', 'count'],
    'carat': 'mean',
    'price_per_carat': 'mean'
}).round(2)
clarity_summary.columns = ['Avg Price', 'Median Price', 'Count', 'Avg Carat', 'Avg Price/Carat']
clarity_summary


In [None]:
# Summary by size category
print("=" * 60)
print("SUMMARY BY SIZE CATEGORY")
print("=" * 60)
size_summary = df_clean.groupby('size_category', observed=False).agg({
    'price': ['mean', 'median', 'count'],
    'carat': ['mean', 'min', 'max'],
    'price_per_carat': 'mean'
}).round(2)
size_summary.columns = ['Avg Price', 'Median Price', 'Count', 'Avg Carat', 'Min Carat', 'Max Carat', 'Avg Price/Carat']
size_summary


In [None]:
# Pivot table: Average price by cut and color
print("=" * 60)
print("PIVOT TABLE: AVERAGE PRICE BY CUT AND COLOR")
print("=" * 60)
pivot_cut_color = pd.pivot_table(
    df_clean,
    values='price',
    index='cut',
    columns='color',
    aggfunc='mean',
    observed=False
).round(0)
pivot_cut_color


In [None]:
# Pivot table: Average price per carat by cut and clarity
print("=" * 60)
print("PIVOT TABLE: AVERAGE PRICE PER CARAT BY CUT AND CLARITY")
print("=" * 60)
pivot_cut_clarity = pd.pivot_table(
    df_clean,
    values='price_per_carat',
    index='cut',
    columns='clarity',
    aggfunc='mean',
    observed=False
).round(0)
pivot_cut_clarity


In [None]:
# Value counts for categorical variables
print("=" * 60)
print("VALUE COUNTS FOR CATEGORICAL VARIABLES")
print("=" * 60)

print("\nCut distribution:")
print(df_clean['cut'].value_counts().sort_index())

print("\nColor distribution:")
print(df_clean['color'].value_counts().sort_index())

print("\nClarity distribution:")
print(df_clean['clarity'].value_counts().sort_index())

print("\nSize category distribution:")
print(df_clean['size_category'].value_counts().sort_index())

print("\nPrice category distribution:")
print(df_clean['price_category'].value_counts())


In [None]:
# Cross-tabulation: Cut vs Color
print("=" * 60)
print("CROSS-TABULATION: CUT vs COLOR")
print("=" * 60)
pd.crosstab(df_clean['cut'], df_clean['color'], margins=True)


In [None]:
# Summary statistics for premium flags
print("=" * 60)
print("PREMIUM FEATURES SUMMARY")
print("=" * 60)
premium_summary = pd.DataFrame({
    'Ideal Cut': [df_clean['is_ideal_cut'].sum(), f"{df_clean['is_ideal_cut'].mean()*100:.1f}%"],
    'Premium Color (D/E/F)': [df_clean['is_premium_color'].sum(), f"{df_clean['is_premium_color'].mean()*100:.1f}%"],
    'Premium Clarity (IF/VVS)': [df_clean['is_premium_clarity'].sum(), f"{df_clean['is_premium_clarity'].mean()*100:.1f}%"]
}, index=['Count', 'Percentage'])
premium_summary


<!-- Section Header with color and underline -->
<h2 style="color:darkblue; font-family:Verdana;">
5. Visualization
</h2>
<hr style="border:1px solid darkblue;">

<!-- Short Description with styled font -->
<p style="font-family:Georgia; font-size:15px; color:black;">
Use <strong>Seaborn</strong> and <strong>Manim</strong> to visualize distributions, comparisons, and relationships with clear titles and labels.
</p>


<!-- Section Header with color and underline -->
<h2 style="color:darkblue; font-family:Verdana;">
6. Deeper Analysis: Price per Carat Controlling for Size
</h2>
<hr style="border:1px solid darkblue;">

<!-- Description with styled font -->
<p style="font-family:Georgia; font-size:15px; color:black;">
The previous summaries showed counterintuitive patterns because better color/clarity diamonds tend to be smaller. Let's analyze price per carat while controlling for size to see the true relationship.
</p>


In [None]:
# Price per carat by color, controlling for size category
print("=" * 60)
print("PRICE PER CARAT BY COLOR (CONTROLLING FOR SIZE)")
print("=" * 60)
color_size_price = df_clean.groupby(['color', 'size_category'], observed=False)['price_per_carat'].mean().unstack()
print(color_size_price.round(2))
print("\nNote: Within each size category, better color (D/E/F) should show higher price per carat")


In [None]:
# Price per carat by clarity, controlling for size category  
print("=" * 60)
print("PRICE PER CARAT BY CLARITY (CONTROLLING FOR SIZE)")
print("=" * 60)
clarity_size_price = df_clean.groupby(['clarity', 'size_category'], observed=False)['price_per_carat'].mean().unstack()
print(clarity_size_price.round(2))
print("\nNote: Within each size category, better clarity (IF/VVS) should show higher price per carat")


In [None]:
# Focus on medium-sized diamonds (0.5-1.0 carat) where most data is
print("=" * 60)
print("ANALYSIS FOR MEDIUM DIAMONDS (0.5-1.0 carat)")
print("=" * 60)
medium_diamonds = df_clean[(df_clean['carat'] >= 0.5) & (df_clean['carat'] <= 1.0)]
print(f"Number of diamonds in this range: {len(medium_diamonds):,} ({len(medium_diamonds)/len(df_clean)*100:.1f}% of total)")

print("\nPrice per carat by color (0.5-1.0 carat range):")
color_medium = medium_diamonds.groupby('color', observed=False)['price_per_carat'].agg(['mean', 'count']).round(2)
color_medium.columns = ['Avg Price/Carat', 'Count']
print(color_medium.sort_index())


In [None]:
print("\nPrice per carat by clarity (0.5-1.0 carat range):")
clarity_medium = medium_diamonds.groupby('clarity', observed=False)['price_per_carat'].agg(['mean', 'count']).round(2)
clarity_medium.columns = ['Avg Price/Carat', 'Count']
print(clarity_medium.sort_index())


In [None]:
# Visual comparison: Average carat size by color and clarity
print("=" * 60)
print("AVERAGE CARAT SIZE BY COLOR AND CLARITY")
print("=" * 60)
print("\nThis shows why the unadjusted averages were misleading:")
print("\nAverage carat by color:")
print(df_clean.groupby('color', observed=False)['carat'].mean().sort_index().round(3))
print("\nAverage carat by clarity:")
print(df_clean.groupby('clarity', observed=False)['carat'].mean().sort_index().round(3))


<!-- Section Header with color and underline -->
<h2 style="color:darkblue; font-family:Verdana; border-bottom:2px solid darkblue; padding-bottom:5px;">
Summary of Data Modification Steps
</h2>

<!-- Subheader -->
<h3 style="color:darkred; font-family:Verdana; margin-top:15px;">
Steps Completed
</h3>

<!-- Description with styled font and spacing -->
<div style="font-family:Georgia; font-size:15px; color:#333; line-height:1.6;">

<p><strong>1. Data Loading:</strong> Loaded the diamonds dataset using Seaborn (53,940 rows, 10 columns)</p>

<p><strong>2. Exploratory Data Analysis:</strong><br>
&nbsp;&nbsp;- Examined dataset shape, data types, and structure<br>
&nbsp;&nbsp;- Checked for missing values (none found)<br>
&nbsp;&nbsp;- Analyzed unique values for categorical columns (cut, color, clarity)<br>
&nbsp;&nbsp;- Reviewed numerical summary statistics
</p>

<p><strong>3. Data Cleaning:</strong><br>
&nbsp;&nbsp;- Set proper data types: Converted categorical columns to ordered categories<br>
&nbsp;&nbsp;&nbsp;&nbsp;- Cut: Fair &lt; Good &lt; Very Good &lt; Premium &lt; Ideal<br>
&nbsp;&nbsp;&nbsp;&nbsp;- Color: J (worst) to D (best)<br>
&nbsp;&nbsp;&nbsp;&nbsp;- Clarity: I1 (worst) to IF (best)<br>
&nbsp;&nbsp;- Checked for outliers (found some, but kept them as they may be legitimate)<br>
&nbsp;&nbsp;- No missing values to handle
</p>

<p><strong>4. Data Transformation:</strong><br>
&nbsp;&nbsp;- Created derived columns:<br>
&nbsp;&nbsp;&nbsp;&nbsp;- <code>price_per_carat</code>: Price divided by carat weight<br>
&nbsp;&nbsp;&nbsp;&nbsp;- <code>volume</code>: x * y * z (cubic mm)<br>
&nbsp;&nbsp;&nbsp;&nbsp;- <code>depth_calculated</code>: Calculated depth percentage<br>
&nbsp;&nbsp;&nbsp;&nbsp;- <code>size_category</code>: Binned carat into Small, Medium, Large, Very Large<br>
&nbsp;&nbsp;&nbsp;&nbsp;- <code>price_category</code>: Quartile-based price categories (Budget, Mid-Range, Premium, Luxury)<br>
&nbsp;&nbsp;&nbsp;&nbsp;- <code>is_ideal_cut</code>: Binary flag for ideal cut diamonds<br>
&nbsp;&nbsp;&nbsp;&nbsp;- <code>is_premium_color</code>: Binary flag for premium color grades (D, E, F)<br>
&nbsp;&nbsp;&nbsp;&nbsp;- <code>is_premium_clarity</code>: Binary flag for premium clarity grades (IF, VVS1, VVS2)
</p>

<p><strong>5. Data Summarization:</strong><br>
&nbsp;&nbsp;- Groupby aggregations by cut, color, clarity, and size category<br>
&nbsp;&nbsp;- Pivot tables showing relationships between cut/color and cut/clarity<br>
&nbsp;&nbsp;- Value counts for all categorical variables<br>
&nbsp;&nbsp;- Cross-tabulations for cut vs color<br>
&nbsp;&nbsp;- Summary of premium features
</p>

<p>The cleaned and transformed dataset is now ready for visualization and further analysis.</p>
</div>


<!-- Section Header with color and underline -->
<h2 style="color:darkblue; font-family:Verdana;">
7. Insights & Summary
</h2>
<hr style="border:1px solid darkblue;">

<!-- Short Description with styled font -->
<p style="font-family:Georgia; font-size:15px; color:black;">
Summarize key findings, highlight patterns in price, cut, color, and clarity, and note potential business insights such as which diamonds may be overpriced or underpriced.
</p>


????????????????????????????????????????????