# Label Preparation and Final Dataset Creation (Part 4)

## Objective
This notebook handles the final piece of the puzzle: the "Labels" (Target Variables). In Part 3, we built the features (inputs). Now, we need to structure the crop yield data so the model knows what to predict.

**Key Steps:**
1.  **Load Data:** Import the features from Part 3 and the raw yield data.
2.  **Clean & Pivot:** Convert the yield data from "long" format (rows) to "wide" format (columns).
3.  **Merge:** Combine the Features (X) and Labels (Y) into one final dataset.
4.  **Save:** Export the ready-to-use dataset as `XY_v2.parquet`.

In [1]:
import pandas as pd
import numpy as np

### 1. Load Features and Raw Labels
We import our feature set (`x_features_v2.parquet`) and the raw crop yield data (`label_yield.parquet`). The raw yield data is currently in a "long" format, which means different crops are stacked in rows.

In [2]:
# Load Features created in Part 3
X = pd.read_parquet('Parquet/x_features_v2.parquet')

# Load Raw Yield Data
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

### 2. Standardize Crop Names
Machine learning models prefer clean, consistent names. We process the `item` column to remove special characters and spaces, converting names like "Maize (corn)" to `maize_corn`. This prevents errors during column creation.

In [3]:
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

# Display sample to verify cleaning
label_yield.head()

Unnamed: 0,area,item,year,label
0,Afghanistan,maize_corn,1970-12-31,1475.7
1,Afghanistan,maize_corn,1971-12-31,1340.0
2,Afghanistan,maize_corn,1972-12-31,1565.2
3,Afghanistan,maize_corn,1973-12-31,1617.0
4,Afghanistan,maize_corn,1974-12-31,1617.0


### 3. Create Target Columns (Pivoting)
We need a separate column for each crop so we can predict them individually.
* **Current Format (Long):** One row per crop per year.
* **New Format (Wide):** One row per year, with columns like `Y_rice`, `Y_wheat`, `Y_maize`.

We use the `pivot_table` function to transform the data structure.

In [4]:
# Extract Year as integer
label_yield['year'] = pd.to_datetime(label_yield['year']).dt.year

# Pivot the table
Y = label_yield.pivot_table(
    index=['year','area'],  # Unique identifier for row
    columns='item',         # Create columns for each crop
    values='label'          # The Yield value
).reset_index()

### 4. Rename Columns
To avoid confusion between our *features* (inputs) and our *targets* (outputs), we add a prefix `Y_` to all the new crop columns. For example, the target for Rice becomes `Y_rice`.

In [5]:
# Dynamically generate column list
current_cols = Y.columns.tolist()

# Identify crop columns (those that represent items)
crop_cols = [c for c in current_cols if c not in ['year', 'area']]

# Create new column mapping
new_col_names = ['year', 'area'] + [f'Y_{c}' for c in crop_cols]
Y.columns = new_col_names

# Display structure
Y.head()

Unnamed: 0,year,area,Y_bananas,Y_barley,Y_cassava_fresh,Y_cucumbers_and_gherkins,Y_maize_corn,Y_oil_palm_fruit,Y_other_vegetables_fresh_nec,Y_potatoes,Y_rice,Y_soya_beans,Y_sugar_beet,Y_sugar_cane,Y_tomatoes,Y_watermelons,Y_wheat
0,1970,Afghanistan,,1174.6,,,1475.7,,6127.8,9536.4,1811.9,,14090.9,22000.0,,7229.4,956.3
1,1970,Albania,,1077.8,,,2071.8,,12278.3,5469.3,2970.5,,23638.9,,12333.3,,1537.7
2,1970,Algeria,,668.5,,,1023.5,,4891.3,6254.2,1581.0,,19719.9,,9449.6,8977.0,624.5
3,1970,Angola,10000.0,,3555.6,,912.0,9523.8,6515.7,6296.3,1198.0,,,50932.6,3076.9,,854.6
4,1970,Antigua_and_Barbuda,1500.0,,4000.0,4615.4,2400.0,,6250.0,,,,,37272.7,3437.5,,


### 5. Merge Features and Labels
Now we combine everything. We perform an **inner join** between our Feature table (X) and our Label table (Y) based on `year` and `area`. This ensures every row in our final dataset has both input features and a target value to learn from.

In [6]:
XY = X.merge(Y, on=['year', 'area'], how='inner')

# Output the shape to verify merge
print(f"Final dataset shape: {XY.shape}")

Final dataset shape: (6631, 81)


### 6. Save Final Dataset
We export the merged dataframe to `XY_v2.parquet`. This file contains everything needed to train the model in the next step.

In [7]:
# Save to Parquet
XY.to_parquet('Parquet/XY_v2.parquet')

In [8]:

# Prevent pandas from hiding columns
pd.set_option('display.max_columns', None)

# Show first 20 rows for Thailand
XY[XY['area'] == 'Thailand'].head(20)

Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_3y,avg_yield_maize_corn_5y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_other_vegetables_fresh_nec_5y,avg_yield_potatoes_1y,avg_yield_potatoes_3y,avg_yield_potatoes_5y,avg_yield_rice_1y,avg_yield_rice_3y,avg_yield_rice_5y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_3y,avg_yield_sugar_cane_5y,avg_yield_wheat_1y,avg_yield_wheat_3y,avg_yield_wheat_5y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_3y,avg_yield_oil_palm_fruit_5y,avg_yield_barley_1y,avg_yield_barley_3y,avg_yield_barley_5y,avg_yield_soya_beans_1y,avg_yield_soya_beans_3y,avg_yield_soya_beans_5y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_3y,avg_yield_sugar_beet_5y,avg_yield_watermelons_1y,avg_yield_watermelons_3y,avg_yield_watermelons_5y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_cucumbers_and_gherkins_5y,avg_yield_tomatoes_1y,avg_yield_tomatoes_3y,avg_yield_tomatoes_5y,avg_yield_bananas_1y,avg_yield_bananas_3y,avg_yield_bananas_5y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_3y,avg_yield_cassava_fresh_5y,sum_rain_winter,sum_rain_spring,sum_rain_summer,sum_rain_autumn,sum_rain_annual,avg_solar_winter,avg_solar_spring,avg_solar_summer,avg_solar_autumn,avg_solar_annual,avg_temp_winter,avg_temp_spring,avg_temp_summer,avg_temp_autumn,avg_temp_annual,pesticides_lag1,fertilizer_lag1,latitude,longitude,Y_bananas,Y_barley,Y_cassava_fresh,Y_cucumbers_and_gherkins,Y_maize_corn,Y_oil_palm_fruit,Y_other_vegetables_fresh_nec,Y_potatoes,Y_rice,Y_soya_beans,Y_sugar_beet,Y_sugar_cane,Y_tomatoes,Y_watermelons,Y_wheat
5818,1982,Thailand,2353.8,2197.466667,2086.82,6233.6,6408.166667,6405.22,9987.5,8895.866667,9506.94,1952.1,1886.933333,1841.4,43488.8,38131.8,40210.5,,,,10116.5,10013.966667,10526.08,,,,1052.0,1006.466667,969.92,,,,15123.1,14501.033333,13719.5,7516.0,7987.566667,7690.6,7373.8,7012.533333,6071.66,12307.7,12076.933333,11966.16,14273.9,14330.4,14551.56,13.22,277.48,495.59,329.27,1115.56,,,,,,22.73,28.18,25.583333,24.303333,25.199167,,18.580125,15.87,100.99,11923.1,,16361.4,7021.2,2298.8,10261.8,6656.3,11898.8,1888.0,1123.6,,49240.5,8655.1,15020.0,
5819,1983,Thailand,2298.8,2293.633333,2203.04,6656.3,6498.733333,6455.22,11898.8,9882.933333,9789.54,1888.0,1909.3,1900.8,49240.5,40955.7,39495.76,,,,10261.8,10007.566667,10136.96,,,,1123.6,1041.8,1047.34,,,,15020.0,15052.033333,14179.26,7021.2,7890.466667,7654.84,8655.1,8541.2,6902.14,11923.1,12051.3,11950.78,16361.4,15128.066667,14972.82,40.41,147.26,420.47,428.45,1036.59,,,,,,23.15,30.103333,26.676667,24.5,26.1075,,18.704576,15.87,100.99,11769.2,,18655.4,7635.6,2267.4,8361.8,6695.3,14216.0,2035.1,1149.3,,42288.8,11372.1,16079.6,
5820,1984,Thailand,2267.4,2306.666667,2231.72,6695.3,6528.4,6515.22,14216.0,12034.1,10560.48,2035.1,1958.4,1916.78,42288.8,45006.033333,41184.94,,,,8361.8,9580.033333,9733.1,,,,1149.3,1108.3,1058.46,,,,16079.6,15407.566667,14920.54,7635.6,7390.933333,7723.9,11372.1,9133.666667,8212.96,11769.2,12000.0,11984.62,18655.4,16430.233333,15601.6,31.93,152.61,528.89,397.5,1110.93,,,,,,22.856667,30.083333,26.51,23.63,25.77,,27.797375,15.87,100.99,11884.6,,14968.8,7183.2,2430.5,8694.8,6795.3,10471.0,2067.0,1275.4,,44540.8,7828.6,15604.8,
5821,1985,Thailand,2430.5,2332.233333,2315.76,6795.3,6715.633333,6597.36,10471.0,12195.266667,10867.16,2067.0,1996.7,1966.0,44540.8,45356.7,41939.34,,,,8694.8,9106.133333,9415.86,,,,1275.4,1182.766667,1110.02,,,,15604.8,15568.133333,15368.1,7183.2,7280.0,7698.04,7828.6,9285.266667,8964.86,11884.6,11858.966667,11961.54,14968.8,16661.866667,15801.68,27.9,180.03,445.4,309.94,963.27,19.08,22.12,19.033333,18.316667,19.6375,23.693333,28.2,26.096667,24.6,25.6475,,25.61347,15.87,100.99,12153.8,,13994.1,7856.4,2571.9,11445.5,7352.9,12284.1,2060.8,1284.8,,47180.3,11355.1,14293.6,
5822,1986,Thailand,2571.9,2423.266667,2384.48,7352.9,6947.833333,6746.68,12284.1,12323.7,11771.48,2060.8,2054.3,2000.6,47180.3,44669.966667,45347.84,,,,11445.5,9500.7,9776.08,,,,1284.8,1236.5,1177.02,,,,14293.6,15326.0,15224.22,7856.4,7558.4,7442.48,11355.1,10185.266667,9316.94,12153.8,11935.866667,12007.68,13994.1,15872.766667,15650.72,28.8,251.81,368.31,381.85,1030.77,20.763333,23.44,20.113333,18.54,20.714167,24.686667,28.986667,25.663333,24.623333,25.99,,25.354604,15.87,100.99,12278.5,,12664.5,7547.2,2373.8,11582.3,7040.0,12455.8,2052.2,1263.6,,44128.2,9998.3,14339.6,1193.2
5823,1987,Thailand,2373.8,2458.733333,2388.48,7040.0,7062.733333,6907.96,12455.8,11736.966667,12265.14,2052.2,2060.0,2020.62,44128.2,45283.1,45475.72,1193.2,,,11582.3,10574.2,10069.24,,,,1263.6,1274.6,1219.34,,,,14339.6,14746.0,15067.52,7547.2,7528.933333,7448.72,9998.3,9727.333333,9841.84,12278.5,12105.633333,12001.84,12664.5,13875.8,15328.84,4.51,263.4,391.18,247.09,906.18,20.196667,24.043333,20.926667,20.233333,21.35,24.076667,28.26,26.143333,25.056667,25.884167,,31.555306,15.87,100.99,12340.0,,14266.1,7547.2,2048.6,10531.2,7023.8,7841.0,2014.7,1113.4,,47005.3,10000.0,14415.1,1250.0
5824,1988,Thailand,2048.6,2331.433333,2338.44,7023.8,7138.9,6981.46,7841.0,10860.3,11453.58,2014.7,2042.566667,2045.96,47005.3,46104.6,45028.68,1250.0,,,10531.2,11186.333333,10123.12,,,,1113.4,1220.6,1217.3,,,,14415.1,14349.433333,14946.54,7547.2,7650.266667,7553.92,10000.0,10451.133333,10110.82,12340.0,12257.433333,12085.22,14266.1,13641.566667,14909.78,23.74,273.39,426.22,477.84,1201.19,20.223333,22.436667,20.183333,17.926667,20.1925,23.413333,28.596667,26.873333,25.066667,25.9875,,35.131065,15.87,100.99,12353.8,,14421.3,7649.3,2617.6,10693.0,7007.9,8593.8,2146.5,1317.7,,47650.8,10000.0,14490.6,554.6
5825,1989,Thailand,2617.6,2346.666667,2408.48,7007.9,7023.9,7043.98,8593.8,9630.2,10329.14,2146.5,2071.133333,2068.24,47650.8,46261.433333,46101.08,554.6,999.266667,,10693.0,10935.5,10589.36,,,,1317.7,1231.566667,1250.98,,,,14490.6,14415.1,14628.74,7649.3,7581.233333,7556.66,10000.0,9999.433333,9836.4,12353.8,12324.1,12202.14,14421.3,13783.966667,14062.96,25.88,380.68,454.56,353.96,1215.08,19.963333,22.226667,19.79,18.816667,20.199167,24.246667,28.456667,26.09,23.866667,25.665,,43.465309,15.87,100.99,12197.0,1013.8,15230.1,7649.3,2569.0,12083.3,6914.1,8333.3,2085.3,1338.3,,55598.0,10000.0,14566.0,725.4
5826,1990,Thailand,2569.0,2411.733333,2436.18,6914.1,6981.933333,7067.74,8333.3,8256.033333,9901.6,2085.3,2082.166667,2071.9,55598.0,50084.7,48312.52,725.4,843.333333,,12083.3,11102.5,11267.06,1013.8,,,1338.3,1256.466667,1263.56,,,,14566.0,14490.566667,14420.98,7649.3,7615.266667,7649.88,10000.0,10000.0,10270.68,12197.0,12296.933333,12264.62,15230.1,14639.166667,14115.22,7.29,308.88,453.35,281.61,1051.13,20.036667,21.956667,19.756667,19.476667,20.306667,24.6,28.726667,25.723333,24.496667,25.886667,,46.574313,15.87,100.99,12746.4,1183.1,13915.9,7629.6,2409.0,12414.3,6965.0,9000.0,1955.6,1299.8,,48894.2,10000.0,15074.0,610.7
5827,1991,Thailand,2409.0,2531.866667,2403.6,6965.0,6962.333333,6990.16,9000.0,8642.366667,9244.78,1955.6,2062.466667,2050.86,48894.2,50714.333333,48655.3,610.7,630.233333,866.78,12414.3,11730.2,11460.82,1183.1,,,1299.8,1318.6,1266.56,,,,15074.0,14710.2,14577.06,7629.6,7642.733333,7604.52,10000.0,10000.0,9999.66,12746.4,12432.4,12383.14,13915.9,14522.433333,14099.58,18.29,298.5,422.33,422.83,1161.95,20.016667,22.24,20.47,18.69,20.354167,25.016667,29.256667,26.303333,24.673333,26.3125,18849.0,59.665657,15.87,100.99,11802.9,718.1,13745.3,7677.8,2711.7,12746.6,6954.4,8333.3,2253.4,1368.6,,52342.9,23622.9,14444.4,486.1
