# CSCI 381: Applied Data Science: Quiz 1

Mauricio Monje
June 10, 2025

## Data Summary

In this section, we summarize the diamonds dataset, including the number of use cases (number of records), the number of attributes, and the data types for each attribute. The dataset is loaded directly from my Github repository.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

data_url = "https://github.com/mmonj/CSCI-381-Applied-Data-Science/raw/refs/heads/main/quiz1/data/diamonds.csv"
df_original = pd.read_csv(data_url, index_col=0)

df_diamonds = df_original.drop_duplicates()

print(f"Dropped {df_original.shape[0] - df_diamonds.shape[0]} duplicates")

num_use_cases, num_attributes_per_use_case = df_diamonds.shape
print(f"Total number of use cases: {num_use_cases}")
print(f"Number of attributes per use case: {num_attributes_per_use_case}")

print("\nData types of the attributes:")
print(df_diamonds.dtypes)

print("\nFirst 5 rows of dataframe:")
print(df_diamonds.head())

**Summary:**
- The dataset contains over 50,000 diamond records, each with several attributes such as carat, cut, color, clarity, depth, table, price, and dimensions (x, y, z)
- Data types are a mix of numeric (e.g., carat, price) and categorical (e.g., cut, color, clarity) variables.

## Exploratory Data Analysis (EDA)

- Numeric variables (carat, depth, table, price, x, y, z) are analyzed with histograms and boxplots to understand distribution, skewness, and detect outliers
- Categorical variables (cut, color, clarity) are analyzed with count plots to understand category distribution and with boxplots of price to examine relationship to value.

In [None]:
print("Summary statistics for numeric attributes:")
df_diamonds.describe()

### Numeric Variable Analysis

For each numeric attribute, we present histograms and boxplots using both Matplotlib and Seaborn to visualize distributions and identify outliers.

- **Histograms** were used here for the diamonds dataset because we want to see how things like carat, price, and dimensions are distributed. For example, we can quickly see if most diamonds are small or if there are a lot of expensive ones. This helps us understand if the data is skewed or if there are common value ranges.
- **Boxplots** are useful here because the diamonds dataset can have outliers (like a few very large or expensive diamonds). Boxplots make it easy to spot these outliers and compare the spread and center of each numeric attribute, which is important for understanding the variety in diamond characteristics.
- Using both Matplotlib and Seaborn lets us double-check our findings and see the data in different styles, which can help to discern patterns.

In [None]:
attributes = ["carat", "depth", "table", "price", "x", "y", "z"]

for attr in attributes:
    # histogram matploylib
    plt.figure(figsize=(6, 4))
    plt.hist(df_diamonds[attr], bins=30, edgecolor="black")
    plt.title(f"Histogram of {attr} (Matplotlib)")
    plt.xlabel(attr)
    plt.ylabel("Frequency")
    plt.grid(True)
    plt.show()

    # histogram seaborn
    plt.figure(figsize=(6, 4))
    sns.histplot(df_diamonds[attr], bins=30, kde=True)
    plt.title(f"Histogram of {attr} (Seaborn)")
    plt.xlabel(attr)
    plt.ylabel("Frequency")
    plt.show()

    # boxplot matplotlib
    plt.figure(figsize=(6, 1.5))
    plt.boxplot(df_diamonds[attr], vert=False)
    plt.title(f"Boxplot of {attr} (Matplotlib)")
    plt.xlabel(attr)
    plt.show()

    # boxplot seaborn
    plt.figure(figsize=(6, 1.5))
    sns.boxplot(x=df_diamonds[attr])
    plt.title(f"Boxplot of {attr} (Seaborn)")
    plt.xlabel(attr)
    plt.show()

### Categorical Variable Analysis

For each categorical attribute, we present bar charts and count plots to visualize category frequencies, and boxplots of price by category to explore relationships with price.

In [None]:
attributes = ["cut", "color", "clarity"]

for attr in attributes:
    # bar chart matplotlib
    plt.figure(figsize=(6, 4))
    df_diamonds[attr].value_counts().plot(kind="bar")
    plt.title(f"Count of {attr} categories (Matplotlib)")
    plt.xlabel(attr)
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()

    # countplot seaborn
    plt.figure(figsize=(6, 4))
    sns.countplot(x=attr, data=df_diamonds, order=df_diamonds[attr].value_counts().index)
    plt.title(f"Count of {attr} categories (Seaborn)")
    plt.xlabel(attr)
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()

    # boxplot of price by category with Seaborn
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=attr, y="price", data=df_diamonds, order=df_diamonds[attr].value_counts().index)
    plt.title(f"Price distribution by {attr} (Seaborn)")
    plt.xlabel(attr)
    plt.ylabel("Price")
    plt.tight_layout()
    plt.show()

**Findings:**
- 'carat' and 'price' are right-skewed; most diamonds are small and inexpensive, but a few large/expensive ones increase the spread
- 'depth' and 'table' are roughly normally distributed.
- 'x', 'y', 'z' follow similar patterns as 'carat' but include outliers.
- Categorical features like 'cut', 'color', and 'clarity' show uneven class distributions (e.g., 'Ideal' cut is most common).
- Price increases generally with better clarity and color, but variation is significant.

## Inferences

### 1. What proportion of diamonds are between 0.30 and 1.08 carats?

In [None]:
lower = 0.30
upper = 1.08

count_in_range = df_diamonds[(df_diamonds["carat"] >= lower) & (df_diamonds["carat"] <= upper)].shape[0]
proportion = count_in_range / df_diamonds.shape[0]

print(f"Number in range: {count_in_range}")
print(f"Proportion in range: {proportion}")

### 2. How many of the diamonds have equal x and y dimensions?

In [None]:
result = df_diamonds[df_diamonds["x"] == df_diamonds["y"]]
print(f"Diamonds with x == y: {result.shape[0]}")

### 3. How many of the diamonds have a carat value less than the mean carat value?

In [None]:
mean_carat = df_diamonds["carat"].mean()
below_mean = df_diamonds[df_diamonds["carat"] < mean_carat].shape[0]
print(f"Mean carat: {mean_carat:.3f}")
print(f"Diamonds with carat < mean: {below_mean}")

### 4. How many diamonds have a Premium cut or better?

In [None]:
premium_or_better = df_diamonds[df_diamonds["cut"].isin(["Premium", "Ideal"])]
print(f"Diamonds with Premium cut or better: {premium_or_better.shape[0]}")

### 5. Which diamond has the highest price per carat? What is its value?

In [None]:
df_diamonds["price_per_carat"] = df_diamonds["price"] / df_diamonds["carat"]
idx_max = df_diamonds["price_per_carat"].idxmax()
max_row = df_diamonds.loc[idx_max]
print("Diamond with highest price per carat:")
print(max_row)
print(f"Highest price per carat: {max_row['price_per_carat']:.2f}")

### 6. Boxplots of diamond price for each cut (Matplotlib and Seaborn)

In [None]:
# boxplot by cut with matplotlib
plt.figure(figsize=(8, 5))
df_diamonds.boxplot(column="price", by="cut")
plt.title("Diamond Price by Cut (Matplotlib)")
plt.suptitle("")
plt.xlabel("Cut")
plt.ylabel("Price")
plt.show()

# boxplot by cut using seaborn.
plt.figure(figsize=(8, 5))
sns.boxplot(x="cut", y="price", data=df_diamonds, order=["Fair", "Good", "Very Good", "Premium", "Ideal"])
plt.title("Diamond Price by Cut (Seaborn)")
plt.xlabel("Cut")
plt.ylabel("Price")
plt.show()

Boxplots show that Ideal and Premium cuts have a wide range of prices, but the median price is not always highest for the best cut. Outliers occur in all categories and price variation is significant within each cut.

### 7. Scatter plot of price vs. carat

In [None]:
# scatter plot price vs. carat with matploblib
plt.figure(figsize=(7, 5))
plt.scatter(df_diamonds["carat"], df_diamonds["price"], alpha=0.3, s=10)
plt.title("Price vs. Carat (Matplotlib)")
plt.xlabel("Carat")
plt.ylabel("Price")
plt.show()

# scatter plot price vs. carat with seaborn
plt.figure(figsize=(7, 5))
sns.scatterplot(x="carat", y="price", data=df_diamonds, alpha=0.3, s=10)
plt.title("Price vs. Carat (Seaborn)")
plt.xlabel("Carat")
plt.ylabel("Price")
plt.show()

There is a strong positive relationship between carat and price, but the relationship is nonlinear. Price increases rapidly for larger carats, and there is a large spread at each carat value.

## Conclusion

In this project, we explored the diamonds dataset by summarizing the data, creating visualizations, and answering specific questions about the diamonds. Most diamonds in the dataset are small and not very expensive, and we saw that as carat increases, price usually goes up as well, but in a non-linear fashion - bigger diamonds can be much more expensive. The cut, color, and clarity of a diamond also affect its price, but having the best cut doesn't always mean the highest price. We also noticed some odd data points, like diamonds with zero for some dimensions, which likely means there were mistakes when the data was collected. Overall, the dataset gives a comprehensive overview of how different features relate to diamond prices.

## References

- [ggplot2 diamonds documentation](https://ggplot2.tidyverse.org/reference/diamonds.html) - for attribute definitions, expected value ranges, and dataset context.
- [GIA 4Cs of Diamond Quality](https://www.gia.edu/diamond-quality-factor) - for understanding carat, cut, color, and clarity grading standards and impacts on value
