In [None]:
import pandas as pd

# Load the dataset
file_path = '/kaggle/input/the-largest-diamond-dataset-currely-on-kaggle/diamonds.csv'
diamonds_df = pd.read_csv(file_path)

# Display the first few rows of the dataset for an overview
diamonds_df.head()


In [None]:
diamonds_df.info()

In [None]:
# Initial Data Analysis: Descriptive statistics and distributions

# Descriptive Statistics
descriptive_stats = diamonds_df.describe()

# Distribution of Categorical Features
categorical_features = diamonds_df.select_dtypes(include=['object']).columns
categorical_distribution = diamonds_df[categorical_features].describe()

descriptive_stats, categorical_distribution


Basic Introspection of the Dataset:

Cut:

Represents the diamond's shape and quality of the cut. Includes standard cuts and the 'Cushion Modified' cut.

Color:

Grades from D (colorless) to Z (yellowish). Color is a subtle feature but important in valuation.

Clarity:

Indicates the presence of inclusions and blemishes. Clarity is a key factor in evaluating diamond quality.

Carat Weight:

Refers to the diamond's mass. Larger carat weight can significantly increase a diamond's value.

Cut Quality:

Based on the GIA Cut Grading System. A crucial factor in determining a diamond's brilliance and value.

Lab:

Indicates the certification lab (GIA, IGI, HRD). Certification authenticity affects value.

Polish and Symmetry:

Reflects the finishing touches on the diamond, affecting its sparkle and overall appearance.

Eye-Clean:

Describes whether inclusions are visible to the naked eye. Affects the perceived quality of the diamond.

Culet Size and Condition:

Pertains to the bottom point of the diamond. Ideal culets maximize light reflection; chipping affects value.

Fancy Color Attributes:

Concerns colored diamonds, their hues, and intensities. Colored diamonds have gained popularity and value.

Fluorescence:

Refers to how diamonds react to UV light. Affects appearance and sometimes value.

Dimensions and Proportions:

Includes depth percentage, table percentage, and absolute measurements. These factors influence a diamond's light reflection and overall aesthetics.

Girdle Thickness:

Impacts how a diamond is set and its overall profile. Varies from extremely thin to extremely thick.

Total Sales Price:

The final price in dollars. This is likely the target variable for predictive modeling.

Initial Data Analysis
Let's conduct an initial analysis to understand the distribution and basic statistics of these features. This analysis will provide insights into the characteristics of the dataset, which is essential before proceeding to anomaly detection.

Initial Data Analysis Results:


Descriptive Statistics of Numerical Features:

Carat Weight: Ranges from 0.08 to 19.35, with a median of 0.50, indicating a wide variety of diamond sizes.

Depth Percent: Varies greatly from 0 to 98.7, which might indicate potential outliers or data entry errors.

Table Percent: Also shows a wide range from 0 to 94, suggesting the need for further investigation into potential anomalies.

Measurements (Length, Width, Depth): Have a broad range, reflecting the diversity in diamond sizes and shapes.

Total Sales Price: Ranges from 200 to over 1.44 million, indicating a diverse dataset in terms of value.

Distribution of Categorical Features:

Cut: 11 unique types, with 'Round' being the most common.

Color: 11 grades, 'E' being the most frequent.

Clarity: 11 categories, with 'SI1' as the most common.

Cut Quality: 6 levels, 'Excellent' being the top.

Lab: 3 main labs, with GIA being the predominant one.

Symmetry and Polish: Various grades, mostly 'Excellent'.

Eye Clean: 5 grades, but 'unknown' is the most frequent.

Culet Size and Condition: Various sizes and conditions, with 'None' and 'unknown' being the most common respectively.

Fancy Color Attributes: A wide range of colors and intensities, but 'unknown' dominates these columns.

Fluorescence: Different levels of intensity, with 'None' being the most common.


Observations:
The dataset is rich with diverse characteristics of diamonds, reflecting real-world variability.
Some columns have a significant number of 'unknown' entries, particularly in the fancy color and eye-clean categories.
The wide ranges in certain numerical features suggest the presence of potential outliers.

Next Steps:
Proceeding with anomaly detection is advisable to identify and address outliers, especially in numerical columns like depth percent, table percent, and measurements.

Outlier Analysis

Data Skewness: The dataset is skewed with a mix of many small diamonds and a few large, expensive ones.

Melee Diamonds: These are diamonds ≤ 0.2 carats, often used for their reflective properties rather than size. They make up a small portion of the dataset.

Minimal Impact of Very Small Diamonds: Removing diamonds smaller than a certain threshold (e.g., ≤ 0.15 carats) only eliminates a small number of rows, indicating that they do not significantly skew the data.

Upper-End Analysis: The presence of very large diamonds (e.g., 18 carats) might seem unusual, but they are valuable for analysis because:

The data collection is unlikely to be erroneous.
It's uncertain how these outliers will affect model results.
Some models are more robust to outliers, and data transformation can mitigate outlier effects.

Outlier Analysis Approach:

Understanding Skewness:
We'll look at the skewness of the numerical features, especially those related to size and price, to understand the distribution better.

Boxplot Analysis:
Generate boxplots for key numerical features to visually inspect for outliers, especially at the upper end of the distributions.

Handling Outliers:
The dataset is unlikely to have errors in collection, we should be cautious in outright removing outliers. Instead, we might consider data transformation methods if necessary. We'll also look on the smaller diamonds and the rationale for not removing the lower-end outliers.

Focus on All Columns:
We'll conduct this analysis across all relevant numerical columns to ensure a comprehensive understanding of the dataset's characteristics.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#skewness = diamonds_df.skew()

numeric_columns = diamonds_df.select_dtypes(include=['float64', 'int64']).columns
numeric_skewness = diamonds_df[numeric_columns].skew()

numeric_skewness

# Boxplots for key numerical features
plt.figure(figsize=(20, 10))

# List of key numerical features for boxplots
key_features = ['carat_weight', 'depth_percent', 'table_percent', 'meas_length', 'meas_width', 'meas_depth', 'total_sales_price']

for i, feature in enumerate(key_features):
    plt.subplot(2, 4, i+1)
    sns.boxplot(y=diamonds_df[feature])
    plt.title(f'Boxplot of {feature}')

plt.tight_layout()
plt.show()

numeric_skewness[key_features]


Outlier Analysis Results:


Skewness of Key Numerical Features:

* Carat Weight: Highly positively skewed (Skewness = 6.04), indicating a concentration of smaller diamonds and a few larger ones.
* Depth Percent: Negatively skewed (Skewness = -5.13). This suggests some irregularities, possibly due to data entry errors or unique diamond cuts.
* Table Percent: Also negatively skewed (Skewness = -4.54), indicating a similar pattern of potential irregularities.
* Measurements (Length, Width, Depth): All are positively skewed, especially depth (Skewness = 24.15), indicating a concentration of smaller-sized diamonds with a few larger ones.
* Total Sales Price: Highly positively skewed (Skewness = 19.41), reflecting the presence of a few very high-priced diamonds among mostly lower-priced ones.

Boxplot Observations:

The boxplots reveal a number of outliers, particularly on the higher end of the scale for carat weight and total sales price.

Depth percent and table percent show a spread of data points that might not necessarily be outliers but could indicate a variety of diamond shapes and cuts.

Interpretation and Next Steps:

The dataset's skewness corroborates the observation of a mixture of many small and a few large, expensive diamonds.

Given the skewness and the boxplot observations, it seems prudent to retain these outliers as they represent legitimate variations in diamond characteristics and prices, and are not due to data collection errors.

Transforming the data, especially for features like carat weight and total sales price, may help in modeling these skewed distributions more effectively.

Since the data is "well-curated" and the outliers are likely representative of real-world scenarios (like large diamonds), preserving the original data distribution might be more valuable for understanding and predicting real-world outcomes.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Extracting categorical and numerical columns from the dataset
categorical_columns = [feature for feature in diamonds_df.columns if diamonds_df[feature].dtypes == 'O']
numerical_columns = [feature for feature in diamonds_df.columns if diamonds_df[feature].dtypes != 'O']
numerical_columns.remove('Unnamed: 0')  # Removing the 'Unnamed: 0' column as it's just an index

# List of features for analysis
features = numerical_columns + categorical_columns
target = ['total_sales_price']

# EDA: Pairplot for numerical features with hue based on 'cut_quality'
sns.pairplot(diamonds_df[numerical_columns + ['cut_quality']], hue="cut_quality")
plt.show()

# EDA: Bar plots for categorical features against total sales price
for cat in categorical_columns:
    plt.figure(figsize=(10, 6))
    sns.barplot(x='total_sales_price', y=cat, data=diamonds_df)
    plt.xticks(rotation=75)
    plt.title("Total Sales Price vs " + cat)
    plt.show()

# EDA: Scatter plots for numerical features against total sales price with hue based on 'cut_quality'
for num in numerical_columns:
    if num != 'total_sales_price':  # Avoiding plotting the target against itself
        sns.relplot(x='total_sales_price', y=num, hue='cut_quality', data=diamonds_df)
        plt.xticks(rotation=75)
        plt.title("Total Sales Price vs " + num)
        plt.show()

# EDA: Distribution plots for numerical features
for num in numerical_columns:
    sns.kdeplot(diamonds_df[num])
    plt.xticks(rotation=75)
    plt.title("Distribution of " + num)
    plt.show()

# Distribution of the target variable
sns.kdeplot(diamonds_df['total_sales_price'], gridsize=100)
plt.title('Distribution of Total Sales Price')
plt.show()




EDA - Cut Analysis
Let's start with an analysis of the 'cut' feature. We'll create a count plot to visualize the distribution of different diamond cuts and then calculate the percentage of each cut type in the dataset.

I will proceed with the cut analysis first.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Count plot for the distribution of diamond cuts
plt.figure(figsize=(12, 6))
sns.set_palette(palette="ch:s=.25,rot=-.25", n_colors=14)
sns.countplot(x='cut', data=diamonds_df, order=diamonds_df['cut'].value_counts().index)
plt.title('Distribution of Diamond Cuts')
plt.xticks(rotation=45)
plt.show()

# Calculate the percentage of each cut type
cut_counts = diamonds_df['cut'].value_counts()
total_diamonds = len(diamonds_df)
cut_percentages = (cut_counts / total_diamonds) * 100

cut_percentages


Cut Analysis Results:

Distribution of Diamond Cuts:

Round Cut: The most prevalent, comprising 72.06% of all diamonds in the dataset.
Oval Cut: The second most common, accounting for 6.31%.
Other cuts like Emerald, Pear, and Princess follow, each representing smaller percentages of the total.

Insights:
The dominance of the round cut aligns with market trends, where round diamonds are often preferred for their brilliance and traditional appeal.
The diversity in cuts reflects a range of preferences and uses in the diamond market.

Color Analysis 

In [None]:
# Count plot for the distribution of diamond colors
plt.figure(figsize=(12, 6))
sns.set_palette(palette="light:#edf5dc", n_colors=11)
colors_order = ['D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'fancy']
sns.countplot(x='color', data=diamonds_df, order=colors_order)
plt.title('Distribution of Diamond Colors')
plt.show()


The count plot illustrates the distribution of diamond colors in the dataset. The colors range from D (colorless) to M (yellowish) and include fancy colored diamonds.

Observations:
The plot reveals the prevalence of certain color grades over others, but the specific counts and percentages are not shown in this visualization.
This distribution provides insights into the popularity and availability of different color grades in the diamond market.
Next Step:
We can proceed with the analysis of the 'clarity' feature, which will involve examining the distribution of clarity grades among the diamonds in the dataset. 

Clarity Analysis 

In [None]:
# Count plot for the distribution of diamond clarity
plt.figure(figsize=(12, 6))
sns.set_palette(palette="ch:s=.55,rot=-.75", n_colors=14)
clarities_order = ['FL', 'IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'SI3', 'I1', 'I2', 'I3']
sns.countplot(x='clarity', data=diamonds_df, order=clarities_order)
plt.title('Distribution of Diamond Clarity')
plt.show()


The count plot presents the distribution of clarity grades among the diamonds in the dataset. The grades range from FL (Flawless) to I3 (Included), providing a comprehensive view of clarity variations.

Observations:
Similar to the color analysis, this plot shows the prevalence of certain clarity grades over others, offering insights into the quality distribution of the diamonds in the market.

Carat Weight Analysis

In [None]:
# Scatter plot to show the relationship between carat weight and total sales price
plt.figure(figsize=(15, 9))
sns.scatterplot(data=diamonds_df, x="carat_weight", y="total_sales_price")
plt.title('Carat Weight vs Total Sales Price')
plt.show()

# Investigating common carat sizes and price breakpoints
# Identifying gaps around $1000 in total sales price
gap_analysis = diamonds_df.loc[diamonds_df['total_sales_price'].between(500, 1700)]
plt.figure(figsize=(15, 9))
sns.countplot(data=gap_analysis, x='total_sales_price', color='green')
plt.title('Distribution of Total Sales Price (Between $500 and $1700)')
plt.xticks(rotation=90)
plt.show()


Carat Weight vs Total Sales Price:
The scatter plot illustrates the relationship between carat weight and total sales price.
There are several vertical lines in the plot, indicating common carat sizes. These may reflect psychological price points in the diamond market.
The plot also shows a general trend of increasing price with increasing carat weight, though with considerable variability.
Price Breakpoints Analysis:
The count plot for the total sales price (ranging from 500 to 1700) does not show any significant gaps or breakpoints around $1000. The values appear continuous over this range.
This continuity suggests that there is no distinct breakpoint in pricing within this specific range.

In [None]:
# Segmenting one-carat diamonds and examining their price ranges and attributes
one_carat_diamonds = diamonds_df.loc[diamonds_df['carat_weight'].between(0.9, 1.2)]

# Analyzing one-carat diamonds within a specific price range (between $1800 and $12000)
one_carat_price_range = one_carat_diamonds.loc[one_carat_diamonds['total_sales_price'].between(1800, 12000)]

# Boxen plots for different attributes of one-carat diamonds within the specified price range
# Cut analysis
plt.figure(figsize=(15, 6))
sns.set_palette(palette="colorblind")
sns.boxenplot(data=one_carat_price_range, x="cut", y="total_sales_price", order=one_carat_price_range['cut'].value_counts().index)
plt.title('One Carat Diamonds (Between $1800 and $12k) - Cut Analysis')
plt.show()

# Color analysis
plt.figure(figsize=(15, 3))
sns.set_palette(palette="light:#edf5dc", n_colors=11)
colors = ['D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'fancy']
sns.boxenplot(data=one_carat_price_range, x="color", y="total_sales_price", order=colors)
plt.title('One Carat Diamonds (Between $1800 and $12k) - Color Analysis')
plt.show()

# Clarity analysis
plt.figure(figsize=(15, 3))
sns.set_palette(palette='Blues', n_colors=12)
clarities = ['FL', 'IF', 'VVS1', 'VVS2', 'VS1', 'VS2', 'SI1', 'SI2', 'SI3', 'I1', 'I2', 'I3']
sns.boxenplot(data=one_carat_price_range, x="clarity", y="total_sales_price", order=clarities)
plt.title('One Carat Diamonds (Between $1800 and $12k) - Clarity Analysis')
plt.show()

# Cut quality analysis
plt.figure(figsize=(15, 3))
sns.set_palette(palette='Greens', n_colors=5)
cut_qualities =['Excellent', 'Very Good', 'Good', 'Fair', 'Ideal']
sns.boxenplot(data=one_carat_price_range, x="cut_quality", y="total_sales_price", order=cut_qualities)
plt.title('One Carat Diamonds (Between $1800 and $12k) - Cut Quality Analysis')
plt.show()


Segment Analysis Results: One Carat Diamonds (Price Range 1800 to 12,000)

Cut Analysis:
The boxen plot for different cuts shows a variation in price distribution. Some cuts like Round, Princess, and Oval seem to have a wider price range.

Color Analysis:
The color analysis indicates a trend in price variability across different color grades. Notably, D and E color diamonds have higher median prices.

Clarity Analysis:
Clarity grades show distinct price distributions. Higher clarity grades like FL, IF, VVS1, and VVS2 tend to have higher median prices.

Cut Quality Analysis:
There is a noticeable variation in price based on cut quality. Diamonds with 'Excellent' and 'Very Good' cut qualities show higher price ranges compared to others.

Insights:
These analyses provide valuable insights into how cut, color, clarity, and cut quality influence the prices of one-carat diamonds within this specific price range.

The variability in prices based on these attributes highlights the complex interplay of factors that determine a diamond's value.

Given the extensive EDA we've already conducted on the 4Cs (Cut, Color, Clarity, and Carat Weight), we can now explore some of the other interesting aspects of the dataset. Here are a few ideas:

Lab Analysis:
Investigate the distribution and impact of different diamond grading labs (GIA, IGI, HRD) on prices.

Polish and Symmetry:
Analyze how polish and symmetry ratings influence the total sales price.

Fancy Color Diamonds:
Explore the distribution and price implications of fancy colored diamonds.

Fluorescence Analysis:
Examine the impact of fluorescence on diamond prices.

Culet Size and Condition:
Investigate how the culet size and condition affect the diamond's value.

In [None]:
# Lab Analysis: Distribution and impact on prices
plt.figure(figsize=(12, 6))
sns.boxenplot(x='lab', y='total_sales_price', data=diamonds_df)
plt.title('Impact of Grading Lab on Total Sales Price')
plt.show()

# Polish and Symmetry Analysis
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
sns.boxenplot(x='polish', y='total_sales_price', data=diamonds_df)
plt.title('Impact of Polish on Total Sales Price')

plt.subplot(1, 2, 2)
sns.boxenplot(x='symmetry', y='total_sales_price', data=diamonds_df)
plt.title('Impact of Symmetry on Total Sales Price')
plt.show()

# Fancy Color Diamonds Analysis
plt.figure(figsize=(12, 6))
sns.boxenplot(x='fancy_color_dominant_color', y='total_sales_price', data=diamonds_df)
plt.title('Impact of Fancy Color on Total Sales Price')
plt.xticks(rotation=45)
plt.show()

# Fluorescence Analysis
plt.figure(figsize=(12, 6))
sns.boxenplot(x='fluor_intensity', y='total_sales_price', data=diamonds_df)
plt.title('Impact of Fluorescence on Total Sales Price')
plt.xticks(rotation=45)
plt.show()

# Culet Size and Condition Analysis
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
sns.boxenplot(x='culet_size', y='total_sales_price', data=diamonds_df)
plt.title('Impact of Culet Size on Total Sales Price')
plt.xticks(rotation=45)

plt.subplot(1, 2, 2)
sns.boxenplot(x='culet_condition', y='total_sales_price', data=diamonds_df)
plt.title('Impact of Culet Condition on Total Sales Price')
plt.xticks(rotation=45)
plt.show()


1. Lab Analysis:
The boxen plots show variation in total sales prices across different grading labs (GIA, IGI, HRD). Some labs appear to have a higher median price, possibly indicating a perceived higher quality or market preference.
2. Polish and Symmetry Analysis:
Polish: Different polish grades show varied impacts on total sales price, with some grades associated with higher prices.
Symmetry: Similar to polish, symmetry grades also influence the price, with certain grades fetching higher prices.
3. Fancy Color Diamonds Analysis:
Fancy colored diamonds show a diverse range of prices. Some colors appear to command higher prices, reflecting rarity or market demand for specific colors.
4. Fluorescence Analysis:
The impact of fluorescence intensity on diamond prices is illustrated. There's noticeable variability in price based on fluorescence characteristics.
5. Culet Size and Condition Analysis:
Culet Size: Different culet sizes impact the price, with some sizes being associated with higher or lower prices.
Culet Condition: The condition of the culet also influences the price, indicating the importance of this feature in overall diamond valuation.
Insights:
These analyses reveal the multifaceted nature of diamond valuation, where factors beyond the 4Cs also play significant roles.
Understanding these additional attributes can provide a more rounded view of what influences diamond prices.

Applying min-max scaling for the numerical features

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

numeric_features = diamonds_df.select_dtypes(include=['float64', 'int64']).columns
numeric_features = numeric_features.drop('Unnamed: 0')  # Exclude ID column

# min_max_scaler = MinMaxScaler()
# diamonds_df_scaled = diamonds_df.copy()

# numeric_features_to_scale = [feature for feature in numeric_features if feature != 'total_sales_price']

# # Applying Min-Max Scaling
# diamonds_df_scaled[numeric_features_to_scale] = min_max_scaler.fit_transform(diamonds_df[numeric_features_to_scale])

# # Displaying the first few rows of the scaled dataset (excluding total_sales_price)
# diamonds_df_scaled[numeric_features_to_scale].head()

from sklearn.preprocessing import RobustScaler

# Initialize the Robust Scaler
robust_scaler = RobustScaler()

# Select numeric features for scaling, excluding the target variable
numeric_features_to_scale = [feature for feature in numeric_features if feature != 'total_sales_price']

# Apply Robust Scaler to the numeric features
diamonds_df_robust_scaled = diamonds_df.copy()
diamonds_df_robust_scaled[numeric_features_to_scale] = robust_scaler.fit_transform(diamonds_df[numeric_features_to_scale])

# Check the first few rows of the scaled data
print(diamonds_df_robust_scaled[numeric_features_to_scale].head())



Applying One Hot Encoding for Categorical Features

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Identifying categorical columns
categorical_columns = diamonds_df.select_dtypes(include=['object']).columns

# Applying One-Hot Encoding to categorical variables
one_hot_encoder = OneHotEncoder(sparse=False, drop='first')
encoded_categorical_data = one_hot_encoder.fit_transform(diamonds_df[categorical_columns])

# Creating a DataFrame for encoded categorical features
encoded_categorical_df = pd.DataFrame(encoded_categorical_data, columns=one_hot_encoder.get_feature_names_out(categorical_columns))

# Concatenating the encoded categorical features with the scaled numeric features
diamonds_df_preprocessed = pd.concat([diamonds_df_robust_scaled.drop(categorical_columns, axis=1), encoded_categorical_df], axis=1)

# Displaying the first few rows of the preprocessed dataset
diamonds_df_preprocessed.head()


Splitting and making a baseline Linear Regressor Model

In [None]:
from sklearn.model_selection import train_test_split

# Define the features and target variable
X = diamonds_df_preprocessed.drop(['total_sales_price', 'Unnamed: 0'], axis=1)  # Exclude target and ID column
y = diamonds_df_preprocessed['total_sales_price']

# Split the dataset into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Linear Regression model
linear_model = LinearRegression()

# Fit the model on the training data
linear_model.fit(X_train, y_train)


In [None]:
# Predict on the test set
y_pred = linear_model.predict(X_test)

# Calculate RMSE and R^2
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print("RMSE:", rmse)
print("R^2:", r2)


In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest Regressor
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model on the training data
random_forest_model.fit(X_train, y_train)

# Predict on the test set
y_pred_rf = random_forest_model.predict(X_test)

# Calculate RMSE and R^2 for Random Forest
rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest RMSE:", rmse_rf)
print("Random Forest R^2:", r2_rf)


In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize the Gradient Boosting Regressor
gradient_boosting_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Fit the model on the training data
gradient_boosting_model.fit(X_train, y_train)

# Predict on the test set
y_pred_gb = gradient_boosting_model.predict(X_test)

# Calculate RMSE and R^2 for Gradient Boosting
rmse_gb = mean_squared_error(y_test, y_pred_gb, squared=False)
r2_gb = r2_score(y_test, y_pred_gb)

print("Gradient Boosting RMSE:", rmse_gb)
print("Gradient Boosting R^2:", r2_gb)


In [None]:
import xgboost as xgb

# Initialize the XGBoost Regressor
xgb_model = xgb.XGBRegressor(objective ='reg:squarederror', 
                             colsample_bytree = 0.3, 
                             learning_rate = 0.1,
                             max_depth = 5, 
                             alpha = 10, 
                             n_estimators = 100)

# Fit the model on the training data
xgb_model.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = xgb_model.predict(X_test)

# Calculate RMSE and R^2 for XGBoost
rmse_xgb = mean_squared_error(y_test, y_pred_xgb, squared=False)
r2_xgb = r2_score(y_test, y_pred_xgb)

print("XGBoost RMSE:", rmse_xgb)
print("XGBoost R^2:", r2_xgb)



In [None]:
from sklearn.preprocessing import RobustScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import SGDRegressor, Lasso, Ridge
from sklearn.svm import SVR
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, learning_curve, RandomizedSearchCV
from sklearn.metrics import r2_score,make_scorer,mean_squared_error

In [None]:
models = [KNeighborsRegressor(), SGDRegressor(), Lasso(), Ridge(), CatBoostRegressor(), XGBRegressor()]
for model in models:
    model.fit(X_train, y_train)
    ypred = model.predict(X_test)
    score = r2_score(y_test, ypred)
    print("model: {}  score {}".format(model, score))

In [None]:
from sklearn.model_selection import GridSearchCV
from catboost import CatBoostRegressor

# Define the parameter grid
param_grid = {
    'depth': [4, 6, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'iterations': [30, 50, 100],
    'l2_leaf_reg': [1, 3, 5]
}

# Initialize the CatBoost Regressor
catboost_model = CatBoostRegressor()

# Initialize the Grid Search
grid_search = GridSearchCV(estimator=catboost_model, param_grid=param_grid, 
                           cv=3, n_jobs=-1, verbose=2, scoring='r2')

# Fit the grid search to the data
grid_search.fit(X_train, y_train)


In [None]:
# Best parameters
best_params = grid_search.best_params_  # Or random_search.best_params_
print("Best parameters:", best_params)

# Best model
best_model = grid_search.best_estimator_  # Or random_search.best_estimator_

# Evaluate the best model
y_pred = best_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("Best Model R^2:", r2)
