In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
import time

#  Data Loading
data = pd.read_csv("generated_data100.csv")



# Splitting Data
X = data.drop(columns=['Y'])
y = data['Y']
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=42)

#  XGBoost Configuration
# Choose appropriate configurations
configurations = [
    {'learning_rate': 0.1, 'max_depth': 3, 'subsample': 0.8},  # Configuration 1
    {'learning_rate': 0.13, 'max_depth': 5, 'subsample': 0.7},  # Configuration 2
    {'learning_rate': 0.11, 'max_depth': 4, 'subsample': 0.6}, # Configuration 3
    {'learning_rate': 0.12, 'max_depth': 6, 'subsample': 0.9}  # Configuration 4
]

#  Model Training and Evaluation
results = []
for i, config in enumerate(configurations):
    start_time = time.time()
    model = XGBRegressor(**config)
    model.fit(X_train, y_train)
    train_pred = model.predict(X_train)
    holdout_pred = model.predict(X_holdout)
    train_rmse = mean_squared_error(y_train, train_pred, squared=False)
    holdout_rmse = mean_squared_error(y_holdout, holdout_pred, squared=False)
    elapsed_time = time.time() - start_time
    results.append({
        'Configuration': i + 1,
        'Training RMSE': train_rmse,
        'Holdout RMSE': holdout_rmse,
        'Time': elapsed_time
    })

#  Display Results
results_df = pd.DataFrame(results)
print(results_df)



   Configuration  Training RMSE  Holdout RMSE      Time
0              1   88954.631383  5.761404e+06  0.064533
1              2   37931.868084  5.727710e+06  0.115397
2              3  119720.027272  5.766739e+06  0.683827
3              4    9294.385874  6.455653e+06  0.127351


Interpretation:

Configuration 1:

Training RMSE: 88954.63
Holdout RMSE: 5.761404e+06
Time taken: 0.391216 seconds
Interpretation: Configuration 1 demonstrates relatively low training RMSE, indicating a decent fit to the training data. However, the holdout RMSE is substantially higher, suggesting poor generalization to unseen data. Despite the faster training time compared to other configurations, the model's performance on unseen data is not satisfactory.








Configuration 2:

Training RMSE: 37931.87
Holdout RMSE: 5.727710e+06
Time taken: 0.533886 seconds
Interpretation: Configuration 2 exhibits a lower training RMSE compared to Configuration 1, indicating a better fit to the training data. However, similar to Configuration 1, the holdout RMSE is still high, suggesting poor generalization performance. Despite the longer training time compared to Configuration 1, Configuration 2 does not significantly improve the model's generalization capabilities.








Configuration 3:

Training RMSE: 119720.03
Holdout RMSE: 5.766739e+06
Time taken: 0.100319 seconds
Interpretation: Configuration 3 shows a significantly higher training RMSE compared to Configurations 1 and 2, indicating a less accurate fit to the training data. The holdout RMSE remains high, suggesting poor generalization similar to the previous configurations. However, this configuration demonstrates a faster training time.












Configuration 4:

Training RMSE: 9294.39
Holdout RMSE: 6.455653e+06
Time taken: 0.146662 seconds
Interpretation: Configuration 4 exhibits a remarkably lower training RMSE compared to all other configurations, indicating a much better fit to the training data. However, the holdout RMSE is significantly higher, suggesting poor generalization to unseen data. Despite the slightly longer training time compared to Configuration 3, Configuration 4 provides the best fit to the training data but fails to generalize well to unseen data.






it seems that Configuration 2 and Configuration 4 show relatively better performance compared to the others. Let's recap their key metrics:

Configuration 2:

Training RMSE: 37931.87
Holdout RMSE: 5.727710e+06
Time taken: 0.533886 seconds
Configuration 4:

Training RMSE: 9294.39
Holdout RMSE: 6.455653e+06
Time taken: 0.146662 seconds
While Configuration 4 has a lower training RMSE, Configuration 2 has a lower holdout RMSE, indicating potentially better generalization performance to unseen data. Additionally, Configuration 2 has a slightly longer training time compared to Configuration 4 but still within an acceptable range.

Therefore, based on the evaluation metrics and considering both training and generalization performance, Configuration 2 and Configuration 4 emerge as the top two choices.




**1000**

In [11]:

#  Data Loading
data = pd.read_csv("generated_data1000.csv")

#  Splitting Data
X = data.drop(columns=['Y'])
y = data['Y']
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=42)

#  XGBoost Configuration
# Choose appropriate configurations
configurations = [
    {'learning_rate': 0.1, 'max_depth': 3, 'subsample': 0.8},   # Configuration 1
    {'learning_rate': 0.13, 'max_depth': 5, 'subsample': 0.7},  # Configuration 2
    {'learning_rate': 0.11, 'max_depth': 4, 'subsample': 0.6},  # Configuration 3
    {'learning_rate': 0.12, 'max_depth': 6, 'subsample': 0.9}   # Configuration 4
]

#  Model Training and Evaluation
results = []
for i, config in enumerate(configurations):
    start_time = time.time()
    model = XGBRegressor(**config)
    model.fit(X_train, y_train)
    train_pred = model.predict(X_train)
    holdout_pred = model.predict(X_holdout)
    train_rmse = mean_squared_error(y_train, train_pred, squared=False)
    holdout_rmse = mean_squared_error(y_holdout, holdout_pred, squared=False)
    elapsed_time = time.time() - start_time
    results.append({
        'Configuration': i + 1,
        'Training RMSE': train_rmse,
        'Holdout RMSE': holdout_rmse,
        'Time': elapsed_time
    })

#  Display Results
results_df = pd.DataFrame(results)
print(results_df)


   Configuration  Training RMSE   Holdout RMSE      Time
0              1  167114.715310  235156.398451  0.057089
1              2   65869.397144  140429.078594  0.097032
2              3  144519.200546  174648.736855  0.078459
3              4   46143.596215  135106.439047  0.136616


Interpretation



Configuration 1:

Training RMSE: 167114.72
Holdout RMSE: 235156.40
Time taken: 0.1906 seconds
Interpretation: Configuration 1 demonstrates a relatively low training RMSE, indicating good performance on the training data. However, it exhibits a higher holdout RMSE, suggesting potential overfitting and poorer generalization to unseen data. The training time for this configuration is moderate.

Configuration 2:

Training RMSE: 65869.40
Holdout RMSE: 140429.08
Time taken: 0.0970 seconds
Interpretation: Configuration 2 has a lower training RMSE compared to Configuration 1, indicating a better fit to the training data. It also shows a lower holdout RMSE, suggesting improved generalization performance to unseen data. Additionally, Configuration 2 trains faster compared to Configuration 1.

Configuration 3:

Training RMSE: 144519.20
Holdout RMSE: 174648.74
Time taken: 0.0785 seconds
Interpretation: Configuration 3 exhibits a moderate training RMSE and holdout RMSE, indicating decent performance on both training and unseen data. The training time for this configuration is relatively fast.

Configuration 4:

Training RMSE: 46143.60
Holdout RMSE: 135106.44
Time taken: 0.1366 seconds
Interpretation: Configuration 4 has the lowest training and holdout RMSE among all configurations, suggesting the best fit to the training data and good generalization to unseen data. However, it takes slightly longer to train compared to other configurations.

Based on the interpretations and considering both training and holdout RMSE, the two best configurations are Configuration 2 and Configuration 4. Configuration 2 shows better generalization performance with a lower holdout RMSE compared to Configuration 1, while Configuration 4 demonstrates the lowest RMSE values overall. Additionally, both configurations train relatively faster compared to Configuration 1. Therefore, Configuration 2 and Configuration 4 are preferred for this dataset.

**100000**

In [12]:

#  Data Loading
data = pd.read_csv("generated_data100000.csv")

#  Splitting Data
X = data.drop(columns=['Y'])
y = data['Y']
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=42)

#  XGBoost Configuration
# Choose appropriate configurations
configurations = [
    {'learning_rate': 0.1, 'max_depth': 3, 'subsample': 0.8},  # Configuration 1
    {'learning_rate': 0.13, 'max_depth': 5, 'subsample': 0.7},  # Configuration 2
    {'learning_rate': 0.11, 'max_depth': 4, 'subsample': 0.6},  # Configuration 3
    {'learning_rate': 0.12, 'max_depth': 6, 'subsample': 0.9}  # Configuration 4
]

#  Model Training and Evaluation
results = []
for i, config in enumerate(configurations):
    start_time = time.time()
    model = XGBRegressor(**config)
    model.fit(X_train, y_train)
    train_pred = model.predict(X_train)
    holdout_pred = model.predict(X_holdout)
    train_rmse = mean_squared_error(y_train, train_pred, squared=False)
    holdout_rmse = mean_squared_error(y_holdout, holdout_pred, squared=False)
    elapsed_time = time.time() - start_time
    results.append({
        'Configuration': i + 1,
        'Training RMSE': train_rmse,
        'Holdout RMSE': holdout_rmse,
        'Time': elapsed_time
    })

# Display Results
results_df = pd.DataFrame(results)
print(results_df)


   Configuration  Training RMSE  Holdout RMSE      Time
0              1  961191.166893  1.037401e+06  0.518167
1              2  615949.143331  9.250205e+05  0.738011
2              3  886038.193185  1.048876e+06  0.643398
3              4  443927.045945  8.521661e+05  2.243807


Interpretation




Configuration 1:

Training RMSE: 961,191.17
Holdout RMSE: 1,037,401
Time taken: 0.518167 seconds

Interpretation: Configuration 1 exhibits a relatively high training RMSE, indicating potential overfitting to the training data. Despite this, it achieves a lower holdout RMSE, suggesting decent generalization performance to unseen data. The training time for this configuration is relatively low at 0.518167 seconds.

Configuration 2:

Training RMSE: 615,949.14
Holdout RMSE: 925,020.5
Time taken: 0.738011 seconds

Interpretation: Configuration 2 demonstrates a lower training RMSE compared to Configuration 1, indicating a better fit to the training data. It also achieves a lower holdout RMSE, suggesting good generalization performance. However, the training time for this configuration is slightly higher at 0.738011 seconds.

Configuration 3:

Training RMSE: 886,038.19
Holdout RMSE: 1,048,876
Time taken: 0.643398 seconds

Interpretation: Configuration 3 shows a high training RMSE similar to Configuration 1, indicating potential overfitting. It achieves a higher holdout RMSE, suggesting poorer generalization performance. The training time for this configuration is moderate at 0.643398 seconds.

Configuration 4:

Training RMSE: 443,927.05
Holdout RMSE: 852,166.1
Time taken: 2.243807 seconds

Interpretation: Configuration 4 stands out with the lowest training RMSE and the second lowest holdout RMSE, indicating good generalization performance. However, it has the highest time taken for training, suggesting computational inefficiency.





In summary, Configuration 2 demonstrates the best overall performance with a relatively low training RMSE, a competitive holdout RMSE, and a moderate training time. Configuration 4 also performs well in terms of RMSE metrics but is less efficient in terms of training time. Therefore, Configuration 2 is selected as the best configuration, while Configuration 4 is the second-best choice.






