Exploratory analysis:
# Styling plot

sns.set_style("whitegrid")


#  Relationship between Carat (weight) and Price

# Let's explore how a diamond's weight (carat) influences its price.
plt.figure(figsize=(10, 6))
sns.scatterplot(x='carat', y='price', data=diamonds_df, alpha=0.1)
plt.title('Price vs. Carat')
plt.xlabel('Carat')
plt.ylabel('Price')
plt.show()


# Observation 1: Effect of Cut Quality on Price

# Now, let’s examine how the quality of the cut impacts the price.
# We’ll use a boxplot to compare price distributions across different cut grades.
plt.figure(figsize=(10, 6))
sns.boxplot(
    x='cut',
    y='price',
    data=diamonds_df,
    order=['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']  # Order by increasing quality
)
plt.title('Price Distribution by Cut Quality')
plt.xlabel('Cut Quality')
plt.ylabel('Price')
plt.show()


# Observation 2: Correlation Between Numeric Features

# Next, let’s look at how all the numeric attributes relate to each other.
# We’ll calculate and visualize the correlation matrix using a heatmap.
plt.figure(figsize=(10, 7))
corr_matrix = diamonds_df.corr(numeric_only=True)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numeric Features')
plt.show()


# Observation 3:

# The price of a diamond is very strongly linked to its carat weight (correlation of 0.92)
# and its physical dimensions (x, y, z). Carat and the dimensions are also almost perfectly
# correlated with each other — which makes sense, since heavier diamonds tend to be larger.
# In contrast, features like depth and table have only a weak connection to price.


# Step 2: Create a Sample Dataset for Modeling

# Finally, we’ll create a smaller random sample from the original dataset
# to make modeling and experimentation faster. Using a random_state ensures reproducibility.
diamonds_model = diamonds_df.sample(n=12500, random_state=42)
print(f"\nCreated 'diamonds_model' sample with shape: {diamonds_model.shape}")



Linear regression:


# Step 1: Define the Proper Order of Categorical Features

# We’ll specify the natural order of quality for each categorical variable.
# This ensures our encoding reflects the true ranking of quality levels.
cut_order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
color_order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
clarity_order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']


# Step 2: Create a Copy for Modeling

# To keep our original sample intact, we’ll work on a copy.
df_model1 = diamonds_model.copy()


# Step 3: Encode Categorical Features into Numeric Form
# Since machine learning models require numeric input, we’ll use an OrdinalEncoder
# that assigns integer values based on the defined order above.
encoder = OrdinalEncoder(categories=[cut_order, color_order, clarity_order], dtype=int)
df_model1[['cut', 'color', 'clarity']] = encoder.fit_transform(df_model1[['cut', 'color', 'clarity']])

print("\nData after Ordinal Encoding (first 5 rows):")
print(df_model1.head())


# Step 4: Define Features (X) and Target (y)

# Our goal is to predict diamond price, so it will serve as the target variable (y).
# The remaining columns are our predictors (X).
X = df_model1.drop('price', axis=1)
y = df_model1['price']


# Step 5: Split the Data into Training and Test Sets

# We’ll use an 80/20 split to train the model on most of the data
# and evaluate its performance on unseen data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Step 6: Standardize the Features
# Scaling helps ensure all numeric features contribute equally to the model.
# We’ll apply standard scaling (mean = 0, std = 1) to both training and test sets.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Step 7: Train a Linear Regression Model
# We’ll start with a simple Linear Regression model to understand
# how well the numeric and encoded features can explain diamond prices.
model_lr = LinearRegression()
model_lr.fit(X_train_scaled, y_train)


# Step 8: Evaluate Model Performance
# Using the trained model, we’ll make predictions on the test set
# and calculate two key metrics:
# - R²: how much of the price variation the model explains
# - RMSE: the average prediction error in the same units as price
y_pred_lr = model_lr.predict(X_test_scaled)
r2_lr = r2_score(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))

print("\n--- Model 1: Full Linear Regression Results ---")
print(f"R-squared (R²): {r2_lr:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_lr:.2f}")


# Step 9: Store Results for Future Comparison
# We’ll save the model’s performance scores to compare with future models.
model_scores = {
    'Full Linear Regression': {'R²': r2_lr, 'RMSE': rmse_lr}
}


PCA, Lasso and Ridge:


# Step 1: Select Only Continuous (Numeric) Features
# For PCA, we’ll focus on the continuous numerical features that describe
# a diamond’s physical properties — these are the ones most suitable for
# dimensionality reduction.
continuous_features = ['carat', 'depth', 'table', 'x', 'y', 'z']
y_pca = diamonds_model['price']  # The target remains the same
X_continuous = diamonds_model[continuous_features]


# Step 2: Standardize the Data
# PCA is very sensitive to feature scale, so we standardize the data first.
# This ensures that each feature contributes equally to the analysis.
scaler_pca = StandardScaler()
X_continuous_scaled = scaler_pca.fit_transform(X_continuous)


# Step 3: Apply PCA (Principal Component Analysis)
# We’ll reduce the dataset to 2 principal components — these capture
# the most important patterns and structure in the data while minimizing
# information loss.
pca = PCA(n_components=2)
X_pca_components = pca.fit_transform(X_continuous_scaled)

print(f"\nExplained variance by 2 PCA components: {pca.explained_variance_ratio_.sum():.4f}")
# This value shows how much of the total variance (information) 
# the first two components managed to preserve.


# Step 4: Train/Test Split for the PCA Components
# We’ll split the reduced data into training and testing sets, just like before.
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(
    X_pca_components, y_pca, test_size=0.2, random_state=42
)


# Step 5: Train a Linear Regression Model on PCA Data
# Now, we train a simple Linear Regression model using the 2 PCA components.
# This helps us see how well a lower-dimensional representation performs.
model_pca_lr = LinearRegression()
model_pca_lr.fit(X_train_pca, y_train_pca)


# Step 6: Evaluate PCA Model Performance
y_pred_pca = model_pca_lr.predict(X_test_pca)
r2_pca = r2_score(y_test_pca, y_pred_pca)
rmse_pca = np.sqrt(mean_squared_error(y_test_pca, y_pred_pca))

print("\n--- Model 2: PCA (2 Components) Linear Regression Results ---")
print(f"R-squared (R²): {r2_pca:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_pca:.2f}")

# Save results for later comparison
model_scores['PCA Regression'] = {'R²': r2_pca, 'RMSE': rmse_pca}

# Step 7: Lasso Regression (L1 Regularization)
# Lasso helps by shrinking less important feature coefficients to zero,
# effectively performing feature selection while controlling overfitting.
model_lasso = Lasso(alpha=1.0, random_state=42)  # alpha controls the penalty strength
model_lasso.fit(X_train_scaled, y_train)

# Evaluate the Lasso model
y_pred_lasso = model_lasso.predict(X_test_scaled)
r2_lasso = r2_score(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))

print("\n--- Model 3: Lasso Regression Results ---")
print(f"R-squared (R²): {r2_lasso:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_lasso:.2f}")

model_scores['Lasso Regression'] = {'R²': r2_lasso, 'RMSE': rmse_lasso}



# Step 8: Ridge Regression (L2 Regularization)
# Ridge regression works similarly to Lasso but shrinks coefficients
# without setting them exactly to zero — good for handling multicollinearity.
model_ridge = Ridge(alpha=1.0, random_state=42)
model_ridge.fit(X_train_scaled, y_train)

# Evaluate the Ridge model
y_pred_ridge = model_ridge.predict(X_test_scaled)
r2_ridge = r2_score(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))

print("\n--- Model 4: Ridge Regression Results ---")
print(f"R-squared (R²): {r2_ridge:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse_ridge:.2f}")

model_scores['Ridge Regression'] = {'R²': r2_ridge, 'RMSE': rmse_ridge}



Comparison:

# Put the scores into a nice DataFrame for comparison
scores_df = pd.DataFrame(model_scores).T
scores_df = scores_df.sort_values(by='R²', ascending=False)

print("\n--- Final Model Comparison ---")
print(scores_df)

Which model had the highest accuracy?

The Full Linear Regression model (along with its close variants, Ridge and Lasso) achieved the best accuracy overall. Among them, Ridge Regression performed slightly better, reaching an R² of 0.9205. The standard Linear Regression and Lasso Regression models were almost identical in performance, showing only minor differences.
In contrast, the PCA Regression model performed noticeably worse, with an R² of just 0.8526 — a clear drop in accuracy.

Why did this happen?

Full Model (Winner):
The full models — Linear, Ridge, and Lasso — performed best because they had access to all the relevant features of the dataset. These models used carat, depth, table, all three physical dimensions (x, y, z), and the quality attributes: cut, color, and clarity.
Having this complete set of information gave the models a much richer understanding of what drives diamond prices. The small improvement seen in Ridge and Lasso suggests that the plain Linear model may have had a touch of overfitting, which the regularization techniques (Ridge/Lasso) helped to smooth out.

PCA Model (Loser):
The PCA-based model struggled because we intentionally limited its knowledge. It was trained using only two principal components derived from six continuous features, and it did not include the categorical features like cut, color, or clarity — which are critical factors in determining price.
In other words, the PCA model was trying to predict diamond prices with most of the key information missing, which naturally led to weaker performance.