# Comparison of Data Splitting Ratios using ```Train_Test_Split```

### Introduction

In this notebook, we explore the impact of different data splits on model performance using the Wine dataset from ```scikit-learn```. By training a logistic regression model with 60:20:20 and 70:15:15 splits, and by considering what happens when no validation set is used, we can observe how the distribution of data between training, validation, and testing affects accuracy and reliability. These experiments provide insight into best practices for model evaluation and will inform strategies for improving the performance and generalization of the capstone project model.


In [6]:
# Imports

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

### Loading and Splitting Data (70% 15% 15%)

In [7]:
# Load the default wine dataset
data = load_wine()
X = data.data
y = data.target

# Split data into train (70%), validation (15%) and test (15%)sets
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.1765, random_state=42, stratify=y_train_val
)

# 0.1765 * 0.85 ≈ 0.15, so validation is ~15% of total


### Training and Assessing Models (70% 15% 15%)

In [8]:
# Fit the logistic regression model
model = LogisticRegression(max_iter=10000, random_state=42)
model.fit(X_train, y_train)

# Evaluate the validation set
val_preds = model.predict(X_val)
val_accuracy = accuracy_score(y_val, val_preds)
print(f"Validation Accuracy: {val_accuracy:.4f}")

# Evaluate the test set
test_preds = model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_preds)
print(f"Test Accuracy: {test_accuracy:.4f}")

# Detailed classification report on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test, test_preds, target_names=data.target_names))

Validation Accuracy: 1.0000
Test Accuracy: 0.9630

Classification Report on Test Set:
              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00         9
     class_1       0.92      1.00      0.96        11
     class_2       1.00      0.86      0.92         7

    accuracy                           0.96        27
   macro avg       0.97      0.95      0.96        27
weighted avg       0.97      0.96      0.96        27



### Loading and Splitting Data (60% 20% 20%)

In [10]:
# Split data into train (60%), validation (20%) and test (20%)sets
X_train_val1, X_test1, y_train_val1, y_test1 = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train1, X_val1, y_train1, y_val1 = train_test_split(
    X_train_val1, y_train_val1, test_size=0.25, random_state=42, stratify=y_train_val1
)

# 0.25 * 0.8 ≈ 0.2, so validation is ~2% of total

### Training and Assessing Models (60% 20% 20%)

In [11]:
# Fit the logistic regression model
model1 = LogisticRegression(max_iter=10000, random_state=42)
model1.fit(X_train1, y_train1)

# Evaluate the validation set
val_preds1 = model.predict(X_val1)
val_accuracy1 = accuracy_score(y_val1, val_preds1)
print(f"Validation Accuracy: {val_accuracy1:.4f}")

# Evaluate the test set
test_preds1 = model.predict(X_test1)
test_accuracy1 = accuracy_score(y_test1, test_preds1)
print(f"Test Accuracy: {test_accuracy1:.4f}")

# Detailed classification report on test set
print("\nClassification Report on Test Set:")
print(classification_report(y_test1, test_preds1, target_names=data.target_names))

Validation Accuracy: 0.9722
Test Accuracy: 0.9722

Classification Report on Test Set:
              precision    recall  f1-score   support

     class_0       1.00      1.00      1.00        12
     class_1       0.93      1.00      0.97        14
     class_2       1.00      0.90      0.95        10

    accuracy                           0.97        36
   macro avg       0.98      0.97      0.97        36
weighted avg       0.97      0.97      0.97        36



### Reflection on Data Splits and Model Performance

**Impact of the 60:20:20 split on accuracy**  
With the 60:20:20 split, the model achieved a validation accuracy of **0.9722** and a test accuracy of **0.9722**. This shows that even with slightly less training data, the logistic regression model was still able to generalize very well. The larger validation and test sets provide more stable performance estimates, which helps increase confidence that the model is not overfitting. Interestingly, the performance was slightly higher on the test set compared to the 70:15:15 split, suggesting that the larger evaluation sets may have given a clearer picture of true generalization.

**Performance with a 70:15:15 split**  
With the 70:15:15 split, the model had a validation accuracy of **1.0000** and a test accuracy of **0.9630**. The additional training data appears to have helped the model fit the data well, as shown by the perfect validation accuracy. However, the test accuracy was marginally lower than in the 60:20:20 case, which may be due to the smaller test set size creating slightly more variability in evaluation.

**What happens if the validation set is omitted**  
If no validation set is used and the workflow relies only on training and test sets, we lose a critical tool for tuning and model selection. The model would be trained on the training set and evaluated only once on the test set. This creates two issues: (1) we wouldn’t know whether changes to the model (e.g., hyperparameter adjustments) genuinely improve performance or just happen to work better on the test set by chance, and (2) we risk overfitting to the test set by repeatedly checking results there. Without a validation set, the test accuracy no longer represents a fair estimate of generalization—it becomes part of the model selection process.

**Applying lessons to the capstone project**  
From experimenting with these different splits, a few lessons stand out for improving capstone model performance and reliability:  
- **Balance training and evaluation data:** More training data usually helps model accuracy, but sufficient validation/testing data is needed for trustworthy evaluation. Choosing the right split depends on dataset size; with small datasets, k-fold cross-validation can be even more effective.  
- **Keep a validation set for tuning:** It ensures that hyperparameter tuning and feature selection do not “leak” into the test evaluation. This makes the final test accuracy a genuine measure of generalization.  
- **Experiment with different splits and model types:** Comparing performance under various conditions helps identify whether improvements are consistent or due to chance. This strengthens confidence in the model’s robustness.  
- **Prioritize generalization over raw accuracy:** The goal of a capstone project is not just to maximize accuracy on one dataset, but to build a model that performs reliably on new, unseen data. Thoughtful use of validation and test splits is key to achieving that.
