# Split data

In this notebook, we split the data into training and testing sets. The result will be four datasets: `X_train`, `X_test`, `y_train`, and `y_test`, which will be stored in `../data/interim/`. Those datasets will be used for training and evaluating the machine learning model.


In [10]:
import pandas as pd
from sklearn.preprocessing import RobustScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
import numpy as np

# Load the cleaned up data
df = pd.read_csv("../data/interim/cleaned_data.csv")

target_variable = "volume_per_ha"

## Handle categorical features

There are some categorical features on the dataset like the `id` (refers to a tree species) and the `yield_class`. Fortunately, they are both already numeric values, so we don't need to apply any encoding.

The following code snippet will not do anything - but we keep it in in case we add other categorical features in the future.

In [19]:
categorical_features = []

label_encoder = LabelEncoder()
df[categorical_features] = df[categorical_features].apply(label_encoder.fit_transform)

## How to split the data

As the data is no time series data, we can use a simple train-test split. We will use 80% of the data for training and 20% for testing.

In [12]:
# Define features and target
X = df.drop(columns=[target_variable])
y = df[target_variable]

# Split into train and test sets (e.g., 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y)

print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

Training set size: 3314
Test set size: 1105


## Identify quality issues

In the next step we want to identify quality issues.

In [20]:
def identify_quality_issues(df):
    """Comprehensive data quality assessment"""
    issues = {}

    # Check for format inconsistencies
    for col in df.select_dtypes(include=["object"]).columns:
        unique_patterns = df[col].astype(str).str.len().value_counts()
        if len(unique_patterns) > 10:  # Many different lengths suggest format issues
            issues[f"{col}_format_inconsistency"] = len(unique_patterns)

    return issues


# Run quality assessment
quality_report = identify_quality_issues(df)
print("Data Quality Issues Found:")
for issue, count in quality_report.items():
    if count > 0:
        print(f"  {issue}: {count} records")

Data Quality Issues Found:


## Statistical preprocessing pipeline

Since we want to apply the same preprocessing steps to both the training and testing data, we will create a preprocessing pipeline. This pipeline will be fitted on the training data and then applied to both the training and testing data. The three steps in the pipeline are:

1. **Imputation of missing values:** shouldn't do anything since we dropped missing values already - but to ensure a robust pipeline
2. **Scaling of numerical features:** will make the model more performant
3. **Select best features:** will help to reduce overfitting - while tuning the model I found out that we can use all features to a certain extent without overfitting.

In [21]:
# CORRECT: Learn all parameters from training data only
def create_preprocessing_pipeline(X_train, y_train):
    """Create preprocessing pipeline fitted on training data"""

    # 1. Missing value imputation
    imputer = SimpleImputer(strategy="mean")
    X_train_imputed = pd.DataFrame(
        imputer.fit_transform(X_train),
        columns=X_train.columns,  # restore column names
        index=X_train.index,  # restore original index
    )

    # 2. Feature scaling
    scaler = (
        RobustScaler()
    )  # Using RobustScaler instead of StandardScaler since it handles outliers
    X_train_scaled = pd.DataFrame(
        scaler.fit_transform(X_train_imputed),
        columns=X_train_imputed.columns,
        index=X_train_imputed.index,
    )

    # 3. Feature selection
    selector = SelectKBest(f_regression, k=7)  # 7 in total
    X_train_selected = pd.DataFrame(
        selector.fit_transform(X_train_scaled, y_train),
        columns=X_train_scaled.columns[selector.get_support()],
        index=X_train_scaled.index,
    )

    # Return fitted preprocessors and transformed data
    preprocessors = {
        "imputer": imputer,
        "scaler": scaler,
        "selector": selector,
    }

    return X_train_selected, preprocessors


def apply_preprocessing_pipeline(X_test, preprocessors):
    """Apply training preprocessing to test data"""

    # Apply in same order as training
    X_test_imputed = pd.DataFrame(
        preprocessors["imputer"].transform(X_test),
        columns=X_test.columns,
        index=X_test.index,
    )
    X_test_scaled = pd.DataFrame(
        preprocessors["scaler"].transform(X_test_imputed),
        columns=X_test_imputed.columns,
        index=X_test_imputed.index,
    )
    # Use selected columns from training
    X_test_selected = pd.DataFrame(
        preprocessors["selector"].transform(X_test_scaled),
        columns=X_test_scaled.columns[preprocessors["selector"].get_support()],
        index=X_test_scaled.index,
    )

    return X_test_selected


# Usage
X_train_processed, fitted_preprocessors = create_preprocessing_pipeline(
    X_train, y_train
)
X_test_processed = apply_preprocessing_pipeline(X_test, fitted_preprocessors)

# Print missing values
X_train_processed.isna().sum()

id                0
yield_class       0
age               0
average_height    0
dbh               0
taper             0
trees_per_ha      0
dtype: int64

## Save the train and test datasets

In [23]:
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)
X_train_processed.to_csv("../data/processed/X_train.csv", index=False)
X_test_processed.to_csv("../data/processed/X_test.csv", index=False)
X_train.to_csv("../data/processed/X_train_unprocessed.csv", index=False)
X_test.to_csv("../data/processed/X_test_unprocessed.csv", index=False)