# Label Preparation and Final Dataset Creation (Part 4)

## Overview
This notebook focuses on preparing the target variables (Labels) for the machine learning model. The steps include:
1.  **Label Formatting:** Cleaning and transforming crop yield data into a "wide" format where each crop type represents a distinct column (Target).
2.  **Data Integration:** Merging the prepared Labels with the Features (X) created in Part 3.
3.  **Final Export:** Saving the consolidated dataset `XY_version1.parquet` for modeling.

In [1]:
import pandas as pd
import numpy as np

### 1. Load Features (X) and Raw Labels
We load the feature set created in the previous step and the raw yield data.

In [2]:
# Load Features created in Part 3
X = pd.read_parquet('Parquet/x_features.parquet')

# Load Raw Yield Data
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

### 2. Standardize Item Names
To ensure accurate pivoting and merging, we sanitize the crop names by removing non-alphanumeric characters, replacing spaces with underscores, and converting to lowercase.

In [3]:
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

# Display sample to verify cleaning
label_yield.head()

Unnamed: 0,area,item,year,label
0,Afghanistan,maize_corn,1970-12-31,1475.7
1,Afghanistan,maize_corn,1971-12-31,1340.0
2,Afghanistan,maize_corn,1972-12-31,1565.2
3,Afghanistan,maize_corn,1973-12-31,1617.0
4,Afghanistan,maize_corn,1974-12-31,1617.0


### 3. Pivot Yield Data (Create Y)
The dataset is currently in a "long" format (one row per crop-year). We pivot this to a "wide" format where each crop becomes a separate column. This structure allows us to model specific crops or multi-output targets.

In [4]:
# Extract Year as integer
label_yield['year'] = pd.to_datetime(label_yield['year']).dt.year

# Pivot the table
Y = label_yield.pivot_table(
    index=['year','area'],  # Unique identifier for row
    columns='item',         # Create columns for each crop
    values='label'          # The Yield value
).reset_index()

### 4. Rename Columns
We prefix the crop columns with `Y_` to clearly distinguish them from feature columns in the final dataset.

In [5]:
# Dynamically generate column list
current_cols = Y.columns.tolist()

# Identify crop columns (those that represent items)
crop_cols = [c for c in current_cols if c not in ['year', 'area']]

# Create new column mapping
new_col_names = ['year', 'area'] + [f'Y_{c}' for c in crop_cols]
Y.columns = new_col_names

# Display structure
Y.head()

Unnamed: 0,year,area,Y_bananas,Y_barley,Y_cassava_fresh,Y_cucumbers_and_gherkins,Y_maize_corn,Y_oil_palm_fruit,Y_other_vegetables_fresh_nec,Y_potatoes,Y_rice,Y_soya_beans,Y_sugar_beet,Y_sugar_cane,Y_tomatoes,Y_watermelons,Y_wheat
0,1970,Afghanistan,,1174.6,,,1475.7,,6127.8,9536.4,1811.9,,14090.9,22000.0,,7229.4,956.3
1,1970,Albania,,1077.8,,,2071.8,,12278.3,5469.3,2970.5,,23638.9,,12333.3,,1537.7
2,1970,Algeria,,668.5,,,1023.5,,4891.3,6254.2,1581.0,,19719.9,,9449.6,8977.0,624.5
3,1970,Angola,10000.0,,3555.6,,912.0,9523.8,6515.7,6296.3,1198.0,,,50932.6,3076.9,,854.6
4,1970,Antigua_and_Barbuda,1500.0,,4000.0,4615.4,2400.0,,6250.0,,,,,37272.7,3437.5,,


### 5. Merge Features and Labels
We perform an inner join between the Feature set (X) and the Label set (Y) based on Year and Area to create the final analytical base table.

In [6]:
XY = X.merge(Y, on=['year', 'area'], how='inner')

# Output the shape to verify merge
print(f"Final dataset shape: {XY.shape}")

Final dataset shape: (6458, 84)


In [7]:
# Save to Parquet
XY.to_parquet('Parquet/XY_version1.parquet')