# Label Preparation and Final Dataset Creation (Part 4)

## Objective
This notebook handles the final piece of the puzzle: the "Labels" (Target Variables). In Part 3, we built the features (inputs). Now, we need to structure the crop yield data so the model knows what to predict.

**Key Steps:**
1.  **Load Data:** Import the features from Part 3 and the raw yield data.
2.  **Clean & Pivot:** Convert the yield data from "long" format (rows) to "wide" format (columns).
3.  **Merge:** Combine the Features (X) and Labels (Y) into one final dataset.
4.  **Save:** Export the ready-to-use dataset as `XY_v2.parquet`.

In [25]:
import pandas as pd
import numpy as np

### 1. Load Features and Raw Labels
We import our feature set (`x_features_v2.parquet`) and the raw crop yield data (`label_yield.parquet`). The raw yield data is currently in a "long" format, which means different crops are stacked in rows.

In [26]:
# Load Features created in Part 3
X = pd.read_parquet('Parquet/x_features_v3.parquet')

# Load Raw Yield Data
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

### 2. Standardize Crop Names
Machine learning models prefer clean, consistent names. We process the `item` column to remove special characters and spaces, converting names like "Maize (corn)" to `maize_corn`. This prevents errors during column creation.

In [27]:
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

# Display sample to verify cleaning
label_yield.head()

Unnamed: 0,area,item,year,label
0,Afghanistan,maize_corn,1970-12-31,1475.7
1,Afghanistan,maize_corn,1971-12-31,1340.0
2,Afghanistan,maize_corn,1972-12-31,1565.2
3,Afghanistan,maize_corn,1973-12-31,1617.0
4,Afghanistan,maize_corn,1974-12-31,1617.0


### 3. Create Target Columns (Pivoting)
We need a separate column for each crop so we can predict them individually.
* **Current Format (Long):** One row per crop per year.
* **New Format (Wide):** One row per year, with columns like `Y_rice`, `Y_wheat`, `Y_maize`.

We use the `pivot_table` function to transform the data structure.

In [28]:
# Extract Year as integer
label_yield['year'] = pd.to_datetime(label_yield['year']).dt.year

# Pivot the table
Y = label_yield.pivot_table(
    index=['year','area'],  # Unique identifier for row
    columns='item',         # Create columns for each crop
    values='label'          # The Yield value
).reset_index()

### 4. Rename Columns
To avoid confusion between our *features* (inputs) and our *targets* (outputs), we add a prefix `Y_` to all the new crop columns. For example, the target for Rice becomes `Y_rice`.

In [29]:
# Dynamically generate column list
current_cols = Y.columns.tolist()

# Identify crop columns (those that represent items)
crop_cols = [c for c in current_cols if c not in ['year', 'area']]

# Create new column mapping
new_col_names = ['year', 'area'] + [f'Y_{c}' for c in crop_cols]
Y.columns = new_col_names

# Display structure
Y.head()

Unnamed: 0,year,area,Y_bananas,Y_barley,Y_cassava_fresh,Y_cucumbers_and_gherkins,Y_maize_corn,Y_oil_palm_fruit,Y_other_vegetables_fresh_nec,Y_potatoes,Y_rice,Y_soya_beans,Y_sugar_beet,Y_sugar_cane,Y_tomatoes,Y_watermelons,Y_wheat
0,1970,Afghanistan,,1174.6,,,1475.7,,6127.8,9536.4,1811.9,,14090.9,22000.0,,7229.4,956.3
1,1970,Albania,,1077.8,,,2071.8,,12278.3,5469.3,2970.5,,23638.9,,12333.3,,1537.7
2,1970,Algeria,,668.5,,,1023.5,,4891.3,6254.2,1581.0,,19719.9,,9449.6,8977.0,624.5
3,1970,Angola,10000.0,,3555.6,,912.0,9523.8,6515.7,6296.3,1198.0,,,50932.6,3076.9,,854.6
4,1970,Antigua_and_Barbuda,1500.0,,4000.0,4615.4,2400.0,,6250.0,,,,,37272.7,3437.5,,


### 5. Merge Features and Labels
Now we combine everything. We perform an **inner join** between our Feature table (X) and our Label table (Y) based on `year` and `area`. This ensures every row in our final dataset has both input features and a target value to learn from.

In [30]:
XY = X.merge(Y, on=['year', 'area'], how='inner')

# Output the shape to verify merge
print(f"Final dataset shape: {XY.shape}")

Final dataset shape: (6631, 75)


### 6. Save Final Dataset
We export the merged dataframe to `XY_v2.parquet`. This file contains everything needed to train the model in the next step.

In [31]:
# Save to Parquet
XY.to_parquet('Parquet/XY_v3.parquet')

In [32]:

# Prevent pandas from hiding columns
pd.set_option('display.max_columns', None)

# Show first 20 rows for Thailand
XY[XY['area'] == 'Thailand'].head(20)

Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_2y,avg_yield_maize_corn_3y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_2y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_potatoes_1y,avg_yield_potatoes_2y,avg_yield_potatoes_3y,avg_yield_rice_1y,avg_yield_rice_2y,avg_yield_rice_3y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_2y,avg_yield_sugar_cane_3y,avg_yield_wheat_1y,avg_yield_wheat_2y,avg_yield_wheat_3y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_2y,avg_yield_oil_palm_fruit_3y,avg_yield_barley_1y,avg_yield_barley_2y,avg_yield_barley_3y,avg_yield_soya_beans_1y,avg_yield_soya_beans_2y,avg_yield_soya_beans_3y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_2y,avg_yield_sugar_beet_3y,avg_yield_watermelons_1y,avg_yield_watermelons_2y,avg_yield_watermelons_3y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_2y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_tomatoes_1y,avg_yield_tomatoes_2y,avg_yield_tomatoes_3y,avg_yield_bananas_1y,avg_yield_bananas_2y,avg_yield_bananas_3y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_2y,avg_yield_cassava_fresh_3y,rain_annual,rain_sin,rain_cos,solar_annual,solar_sin,solar_cos,temp_annual,temp_sin,temp_cos,pesticides_lag1,fertilizer_lag1,latitude,longitude,Y_bananas,Y_barley,Y_cassava_fresh,Y_cucumbers_and_gherkins,Y_maize_corn,Y_oil_palm_fruit,Y_other_vegetables_fresh_nec,Y_potatoes,Y_rice,Y_soya_beans,Y_sugar_beet,Y_sugar_cane,Y_tomatoes,Y_watermelons,Y_wheat
5818,1982,Thailand,2353.8,2291.05,2197.466667,6233.6,6419.95,6408.166667,9987.5,8875.0,8895.866667,1952.1,1919.95,1886.933333,43488.8,36813.3,38131.8,,,,10116.5,9880.45,10013.966667,,,,1052.0,1000.9,1006.466667,,,,15123.1,15068.05,14501.033333,7516.0,8325.1,7987.566667,7373.8,8484.25,7012.533333,12307.7,12115.4,12076.933333,14273.9,14511.4,14330.4,1115.56,-331.152235,-455.368398,,,,25.199167,7.949383,-14.802748,,18.580125,15.87,100.99,11923.1,,16361.4,7021.2,2298.8,10261.8,6656.3,11898.8,1888.0,1123.6,,49240.5,8655.1,15020.0,
5819,1983,Thailand,2298.8,2326.3,2293.633333,6656.3,6444.95,6498.733333,11898.8,10943.15,9882.933333,1888.0,1920.05,1909.3,49240.5,46364.65,40955.7,,,,10261.8,10189.15,10007.566667,,,,1123.6,1087.8,1041.8,,,,15020.0,15071.55,15052.033333,7021.2,7268.6,7890.466667,8655.1,8014.45,8541.2,11923.1,12115.4,12051.3,16361.4,15317.65,15128.066667,1036.59,-509.811515,-330.281927,,,,26.1075,12.035723,-17.9876,,18.704576,15.87,100.99,11769.2,,18655.4,7635.6,2267.4,8361.8,6695.3,14216.0,2035.1,1149.3,,42288.8,11372.1,16079.6,
5820,1984,Thailand,2267.4,2283.1,2306.666667,6695.3,6675.8,6528.4,14216.0,13057.4,12034.1,2035.1,1961.55,1958.4,42288.8,45764.65,45006.033333,,,,8361.8,9311.8,9580.033333,,,,1149.3,1136.45,1108.3,,,,16079.6,15549.8,15407.566667,7635.6,7328.4,7390.933333,11372.1,10013.6,9133.666667,11769.2,11846.15,12000.0,18655.4,17508.4,16430.233333,1110.93,-515.340027,-374.690205,,,,25.77,12.96887,-21.662971,,27.797375,15.87,100.99,11884.6,,14968.8,7183.2,2430.5,8694.8,6795.3,10471.0,2067.0,1275.4,,44540.8,7828.6,15604.8,
5821,1985,Thailand,2430.5,2348.95,2332.233333,6795.3,6745.3,6715.633333,10471.0,12343.5,12195.266667,2067.0,2051.05,1996.7,44540.8,43414.8,45356.7,,,,8694.8,8528.3,9106.133333,,,,1275.4,1212.35,1182.766667,,,,15604.8,15842.2,15568.133333,7183.2,7409.4,7280.0,7828.6,9600.35,9285.266667,11884.6,11826.9,11858.966667,14968.8,16812.1,16661.866667,963.27,-343.283458,-396.494198,19.6375,11.492844,-6.229845,25.6475,6.760364,-13.014972,,25.61347,15.87,100.99,12153.8,,13994.1,7856.4,2571.9,11445.5,7352.9,12284.1,2060.8,1284.8,,47180.3,11355.1,14293.6,
5822,1986,Thailand,2571.9,2501.2,2423.266667,7352.9,7074.1,6947.833333,12284.1,11377.55,12323.7,2060.8,2063.9,2054.3,47180.3,45860.55,44669.966667,,,,11445.5,10070.15,9500.7,,,,1284.8,1280.1,1236.5,,,,14293.6,14949.2,15326.0,7856.4,7519.8,7558.4,11355.1,9591.85,10185.266667,12153.8,12019.2,11935.866667,13994.1,14481.45,15872.766667,1030.77,-346.883906,-347.883479,20.714167,14.875446,-5.373865,25.99,11.939017,-9.788524,,25.354604,15.87,100.99,12278.5,,12664.5,7547.2,2373.8,11582.3,7040.0,12455.8,2052.2,1263.6,,44128.2,9998.3,14339.6,1193.2
5823,1987,Thailand,2373.8,2472.85,2458.733333,7040.0,7196.45,7062.733333,12455.8,12369.95,11736.966667,2052.2,2056.5,2060.0,44128.2,45654.25,45283.1,1193.2,,,11582.3,11513.9,10574.2,,,,1263.6,1274.2,1274.6,,,,14339.6,14316.6,14746.0,7547.2,7701.8,7528.933333,9998.3,10676.7,9727.333333,12278.5,12216.15,12105.633333,12664.5,13329.3,13875.8,906.18,-251.897043,-430.068164,21.35,10.451364,-8.926333,25.884167,7.079935,-10.678447,,31.555306,15.87,100.99,12340.0,,14266.1,7547.2,2048.6,10531.2,7023.8,7841.0,2014.7,1113.4,,47005.3,10000.0,14415.1,1250.0
5824,1988,Thailand,2048.6,2211.2,2331.433333,7023.8,7031.9,7138.9,7841.0,10148.4,10860.3,2014.7,2033.45,2042.566667,47005.3,45566.75,46104.6,1250.0,1221.6,,10531.2,11056.75,11186.333333,,,,1113.4,1188.5,1220.6,,,,14415.1,14377.35,14349.433333,7547.2,7547.2,7650.266667,10000.0,9999.15,10451.133333,12340.0,12309.25,12257.433333,14266.1,13465.3,13641.566667,1201.19,-415.700684,-339.372755,20.1925,11.724691,-8.486582,25.9875,7.238621,-15.782107,,35.131065,15.87,100.99,12353.8,,14421.3,7649.3,2617.6,10693.0,7007.9,8593.8,2146.5,1317.7,,47650.8,10000.0,14490.6,554.6
5825,1989,Thailand,2617.6,2333.1,2346.666667,7007.9,7015.85,7023.9,8593.8,8217.4,9630.2,2146.5,2080.6,2071.133333,47650.8,47328.05,46261.433333,554.6,902.3,999.266667,10693.0,10612.1,10935.5,,,,1317.7,1215.55,1231.566667,,,,14490.6,14452.85,14415.1,7649.3,7598.25,7581.233333,10000.0,10000.0,9999.433333,12353.8,12346.9,12324.1,14421.3,14343.7,13783.966667,1215.08,-275.732618,-480.163017,20.199167,9.939639,-5.837884,25.665,10.485556,-13.079569,,43.465309,15.87,100.99,12197.0,1013.8,15230.1,7649.3,2569.0,12083.3,6914.1,8333.3,2085.3,1338.3,,55598.0,10000.0,14566.0,725.4
5826,1990,Thailand,2569.0,2593.3,2411.733333,6914.1,6961.0,6981.933333,8333.3,8463.55,8256.033333,2085.3,2115.9,2082.166667,55598.0,51624.4,50084.7,725.4,640.0,843.333333,12083.3,11388.15,11102.5,1013.8,,,1338.3,1328.0,1256.466667,,,,14566.0,14528.3,14490.566667,7649.3,7649.3,7615.266667,10000.0,10000.0,10000.0,12197.0,12275.4,12296.933333,15230.1,14825.7,14639.166667,1051.13,-277.526524,-475.521754,20.306667,6.742326,-3.126858,25.886667,10.205024,-10.63089,,46.574313,15.87,100.99,12746.4,1183.1,13915.9,7629.6,2409.0,12414.3,6965.0,9000.0,1955.6,1299.8,,48894.2,10000.0,15074.0,610.7
5827,1991,Thailand,2409.0,2489.0,2531.866667,6965.0,6939.55,6962.333333,9000.0,8666.65,8642.366667,1955.6,2020.45,2062.466667,48894.2,52246.1,50714.333333,610.7,668.05,630.233333,12414.3,12248.8,11730.2,1183.1,1098.45,,1299.8,1319.05,1318.6,,,,15074.0,14820.0,14710.2,7629.6,7639.45,7642.733333,10000.0,10000.0,10000.0,12746.4,12471.7,12432.4,13915.9,14573.0,14522.433333,1161.95,-365.901735,-384.348614,20.354167,9.391652,-6.747691,26.3125,11.037344,-11.874383,18849.0,59.665657,15.87,100.99,11802.9,718.1,13745.3,7677.8,2711.7,12746.6,6954.4,8333.3,2253.4,1368.6,,52342.9,23622.9,14444.4,486.1
