# Split data

In this notebook, we split the data into training and testing sets. The result will be four datasets: `X_train`, `X_test`, `y_train`, and `y_test`, which will be stored in `../data/interim/`. Those datasets will be used for training and evaluating the machine learning model.


In [256]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
import numpy as np

# Load the cleaned up data
df = pd.read_csv('../data/interim/cleaned_data.csv')



## Handle categorical features

Since we have one categorical feature ('Area'), we need to apply one-hot encoding to it. This makes sure, the categories are translated to numeric values that can be used by the model.

In [257]:
label_encoder = LabelEncoder()
df.Area = label_encoder.fit_transform(df.Area)

## How to split the data

Target variable: `Average Temperature °C`
Features: All other features of the dataset

Categorical data on the dataset: `Area`

We will use the year to do a temporal split since we want to predict values in the future. This means we will use data from the past to predict data from the future.

We will split between the year 2012 and 2013 since then we have a good amount of data for training and can use the subsequent years for testing.

In [258]:
split_year = 2010
feature_columns = df.columns.drop(['Average Temperature °C'])

train_data = df[df['Year'] < split_year]
test_data = df[df['Year'] >= split_year]

X_train = train_data[feature_columns]
y_train = train_data['Average Temperature °C']
X_test = test_data[feature_columns]
y_test = test_data['Average Temperature °C']

print("Training data size: ", train_data.shape[0])
print("Test data size: ", test_data.shape[0])
print("Percent of total data (train): ", train_data.shape[0] / df.shape[0] * 100)
print("Percent of total data (test): ", test_data.shape[0] / df.shape[0] * 100)

Training data size:  4469
Test data size:  2496
Percent of total data (train):  64.16367552045944
Percent of total data (test):  35.83632447954056


## Identify quality issues

In the next step we want to identify quality issues.

In [259]:
def identify_quality_issues(df):
    """Comprehensive data quality assessment"""
    issues = {}

    if 'Forest fires' in df.columns:
        negative_fires = df[df['Forest fires'] < 0]
        issues['negative_fires'] = len(negative_fires)
    if 'Savanna fires' in df.columns:
        negative_savanna = df[df['Savanna fires'] < 0]
        issues['negative_savanna'] = len(negative_savanna)

    # Check for future dates
    if 'Year' in df.columns:
        df['Year'] = pd.to_datetime(df['Year'], errors='coerce')
        current_year = datetime.datetime.now().year
        future_dates = df[df['Year'].dt.year > current_year]
        issues['future_dates'] = len(future_dates)
    
    # Check for format inconsistencies
    for col in df.select_dtypes(include=['object']).columns:
        unique_patterns = df[col].astype(str).str.len().value_counts()
        if len(unique_patterns) > 10:  # Many different lengths suggest format issues
            issues[f'{col}_format_inconsistency'] = len(unique_patterns)
    
    return issues

# Run quality assessment
quality_report = identify_quality_issues(df)
print("Data Quality Issues Found:")
for issue, count in quality_report.items():
    if count > 0:
        print(f"  {issue}: {count} records")

Data Quality Issues Found:


## Handle categorical features

Since we have one categorical feature ('Area'), we need to apply one-hot encoding to it. This makes sure, the categories are translated to numeric values that can be used by the model.

In [260]:
# def apply_one_hot_encoding(X_train, categorical_features):
#     """Apply one-hot encoding to categorical features"""
    
#     encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    
#     # Fit encoder on training data
#     encoded_features = encoder.fit_transform(X_train[categorical_features])
    
#     # Get feature names
#     feature_names = encoder.get_feature_names_out(categorical_features)
    
#     # Create DataFrame with encoded features
#     encoded_df = pd.DataFrame(encoded_features, columns=feature_names, index=X_train.index)
    
#     return encoded_df, encoder

# # Example usage
# categorical_features = ['Area']

# X_train_encoded, fitted_encoder = apply_one_hot_encoding(X_train, categorical_features)
# X_test_encoded = fitted_encoder.transform(X_test[categorical_features])

## Statistical preprocessing pipeline

Since we want to apply the same preprocessing steps to both the training and testing data, we will create a preprocessing pipeline. This pipeline will be fitted on the training data and then applied to both the training and testing data.

In [261]:
# CORRECT: Learn all parameters from training data only
def create_preprocessing_pipeline(X_train, y_train):
    """Create preprocessing pipeline fitted on training data"""
    
    # 1. Missing value imputation
    imputer = SimpleImputer(strategy='mean')
    X_train_imputed = pd.DataFrame(
        imputer.fit_transform(X_train),
        columns=X_train.columns,    # restore column names
        index=X_train.index         # restore original index
    )
    
    # 2. Feature scaling
    scaler = StandardScaler()
    X_train_scaled = pd.DataFrame(
        scaler.fit_transform(X_train_imputed),
        columns=X_train_imputed.columns,
        index=X_train_imputed.index
    )

    # 3. Feature selection
    selector = SelectKBest(f_regression, k=18) # 31 total features including target variable
    X_train_selected = pd.DataFrame(
        selector.fit_transform(X_train_scaled, y_train),
        columns=X_train_scaled.columns[selector.get_support()],
        index=X_train_scaled.index
    )

    # Return fitted preprocessors and transformed data
    preprocessors = {
        'imputer': imputer,
        'scaler': scaler,
        'selector': selector
    }

    return X_train_selected, preprocessors

def apply_preprocessing_pipeline(X_test, preprocessors):
    """Apply training preprocessing to test data"""
    
    # Apply in same order as training
    X_test_imputed = pd.DataFrame(
        preprocessors['imputer'].transform(X_test),
        columns=X_test.columns,
        index=X_test.index
    )
    X_test_scaled = pd.DataFrame(
        preprocessors['scaler'].transform(X_test_imputed),
        columns=X_test_imputed.columns,
        index=X_test_imputed.index
    )
    X_test_selected = pd.DataFrame(
        preprocessors['selector'].transform(X_test_scaled),
        columns=X_test_scaled.columns[preprocessors['selector'].get_support()],
        index=X_test_scaled.index
    )

    return X_test_selected

# Usage
X_train_processed, fitted_preprocessors = create_preprocessing_pipeline(X_train, y_train)
X_test_processed = apply_preprocessing_pipeline(X_test, fitted_preprocessors)

# Print missing values
X_train_processed.isna().sum()

Year                               0
Savanna fires                      0
Forest fires                       0
Drained organic soils (CO2)        0
Pesticides Manufacturing           0
Food Transport                     0
Forestland                         0
Food Household Consumption         0
Food Retail                        0
Food Packaging                     0
Food Processing                    0
Fertilizers Manufacturing          0
IPPU                               0
Manure applied to Soils            0
Manure Management                  0
Fires in humid tropical forests    0
On-farm energy use                 0
Urban population                   0
dtype: int64

## Save the train and test datasets

In [262]:
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)
X_train_processed.to_csv("../data/processed/X_train.csv", index=False)
X_test_processed.to_csv("../data/processed/X_test.csv", index=False)