# Part 4: Preparing the Target Labels

## Goal
This is the final step of data preparation. In Part 3, I created the "Features" (the inputs). Now, I need to format the "Labels" (the crop yields we want to predict).

Here is the plan:
1.  **Load** the features and the raw yield data.
2.  **Pivot** the yield data so that each crop gets its own column (e.g., `Y_rice`, `Y_wheat`).
3.  **Merge** the Features (X) and Labels (Y) into one final dataset.
4.  **Save** everything so it's ready for the Machine Learning model.

In [52]:
import pandas as pd
import numpy as np

### 1. Loading the Data
I'll start by loading the features file I made in the previous notebook (`x_features`) and the original raw yield data.

In [53]:
# Loading the Features from Part 3
X = pd.read_parquet('Parquet/x_features_v3.parquet')

# Loading the raw yield data
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

### 2. Cleaning Crop Names
Just like before, I need to make sure the crop names are clean and consistent (lowercase, no spaces) so they match perfectly when I create columns for them.

In [54]:
# Remove special characters and spaces
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

label_yield.head()

Unnamed: 0,area,item,year,label
0,Afghanistan,maize_corn,1970-12-31,1475.7
1,Afghanistan,maize_corn,1971-12-31,1340.0
2,Afghanistan,maize_corn,1972-12-31,1565.2
3,Afghanistan,maize_corn,1973-12-31,1617.0
4,Afghanistan,maize_corn,1974-12-31,1617.0


### 3. Pivoting the Data
Currently, the data is in "long" format (one row per crop). I need to pivot it to "wide" format so that every row represents one Year/Country, and the crops are separate columns.

In [55]:
# Make sure year is an integer
label_yield['year'] = pd.to_datetime(label_yield['year']).dt.year

# Pivot the table
Y = label_yield.pivot_table(
    index=['year','area'],  # The rows
    columns='item',         # The new columns
    values='label'          # The values to fill in
).reset_index()

### 4. Renaming Columns
To make things clear, I'll add a `Y_` prefix to the crop columns. This way, when I merge everything, I'll easily know that `Y_rice` is the target variable I'm trying to predict.

In [56]:
# Get current columns
current_cols = Y.columns.tolist()

# Filter out the index columns
crop_cols = [c for c in current_cols if c not in ['year', 'area']]

# Rename: Keep year/area as is, add Y_ to the crops
new_col_names = ['year', 'area'] + [f'Y_{c}' for c in crop_cols]
Y.columns = new_col_names

Y.head()

Unnamed: 0,year,area,Y_bananas,Y_barley,Y_cassava_fresh,Y_cucumbers_and_gherkins,Y_maize_corn,Y_oil_palm_fruit,Y_other_vegetables_fresh_nec,Y_potatoes,Y_rice,Y_soya_beans,Y_sugar_beet,Y_sugar_cane,Y_tomatoes,Y_watermelons,Y_wheat
0,1970,Afghanistan,,1174.6,,,1475.7,,6127.8,9536.4,1811.9,,14090.9,22000.0,,7229.4,956.3
1,1970,Albania,,1077.8,,,2071.8,,12278.3,5469.3,2970.5,,23638.9,,12333.3,,1537.7
2,1970,Algeria,,668.5,,,1023.5,,4891.3,6254.2,1581.0,,19719.9,,9449.6,8977.0,624.5
3,1970,Angola,10000.0,,3555.6,,912.0,9523.8,6515.7,6296.3,1198.0,,,50932.6,3076.9,,854.6
4,1970,Antigua and Barbuda,1500.0,,4000.0,4615.4,2400.0,,6250.0,,,,,37272.7,3437.5,,


### 5. Merging Features and Targets
Now I just combine the Features (X) and the Labels (Y) using an inner join. This ensures that every row in the final dataset has both the weather/input data AND the yield data.

In [57]:
XY = X.merge(Y, on=['year', 'area'], how='inner')

# Checking the final size
print(f"Final dataset shape: {XY.shape}")

Final dataset shape: (6589, 75)


### 6. Saving the Final Files
I'll save the final dataset as a Parquet file (for speed) and also as a CSV file as requested.

In [58]:
# Save to Parquet
XY.to_parquet('Parquet/XY_v3.parquet')

# Save to CSV
XY.to_csv('Data/final_ML_dataset.csv', index=False)

print("Files saved successfully.")

Files saved successfully.


In [59]:
# Show a sample (Thailand) to double check the columns
pd.set_option('display.max_columns', None)

XY[XY['area'] == 'China, mainland'].head(20)

Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_2y,avg_yield_maize_corn_3y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_2y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_potatoes_1y,avg_yield_potatoes_2y,avg_yield_potatoes_3y,avg_yield_rice_1y,avg_yield_rice_2y,avg_yield_rice_3y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_2y,avg_yield_sugar_cane_3y,avg_yield_wheat_1y,avg_yield_wheat_2y,avg_yield_wheat_3y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_2y,avg_yield_oil_palm_fruit_3y,avg_yield_barley_1y,avg_yield_barley_2y,avg_yield_barley_3y,avg_yield_soya_beans_1y,avg_yield_soya_beans_2y,avg_yield_soya_beans_3y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_2y,avg_yield_sugar_beet_3y,avg_yield_watermelons_1y,avg_yield_watermelons_2y,avg_yield_watermelons_3y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_2y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_tomatoes_1y,avg_yield_tomatoes_2y,avg_yield_tomatoes_3y,avg_yield_bananas_1y,avg_yield_bananas_2y,avg_yield_bananas_3y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_2y,avg_yield_cassava_fresh_3y,rain_annual,rain_sin,rain_cos,solar_annual,solar_sin,solar_cos,temp_annual,temp_sin,temp_cos,pesticides_lag1,fertilizer_lag1,latitude,longitude,Y_bananas,Y_barley,Y_cassava_fresh,Y_cucumbers_and_gherkins,Y_maize_corn,Y_oil_palm_fruit,Y_other_vegetables_fresh_nec,Y_potatoes,Y_rice,Y_soya_beans,Y_sugar_beet,Y_sugar_cane,Y_tomatoes,Y_watermelons,Y_wheat
1346,1982,"China, mainland",3051.3,3065.1,3038.466667,18008.0,17087.15,16852.433333,10279.2,10761.35,10886.4,4328.0,4230.9,4236.633333,53863.5,50738.65,47833.833333,2109.1,2000.25,2046.666667,,,,2583.3,2382.1,2417.933333,1163.4,1131.65,1097.933333,14600.6,14427.8,12804.166667,16397.3,15855.35,15839.833333,12000.0,11602.95,11600.033333,22826.1,22608.7,22479.866667,14000.0,13130.0,13817.966667,15217.4,14909.6,14606.4,763.66,-154.249206,-390.791077,,,,1.715,-25.425118,-56.315518,,156.550263,35.0,103.0,15461.5,2807.0,15319.1,11707.3,3269.1,,16974.8,10889.8,4891.3,1073.7,14543.9,56506.8,22500.0,14777.8,2451.7
1347,1983,"China, mainland",3269.1,3160.2,3133.1,16974.8,17491.4,17049.7,10889.8,10584.5,10804.166667,4891.3,4609.65,4451.033333,56506.8,55185.15,52661.366667,2451.7,2280.4,2150.733333,,,,2807.0,2695.15,2523.733333,1073.7,1118.55,1112.333333,14543.9,14572.25,14466.5,14777.8,15587.55,15496.166667,11707.3,11853.65,11637.733333,22500.0,22663.05,22572.466667,15461.5,14730.75,13907.166667,15319.1,15268.25,15046.1,373.92,-150.207421,-147.599971,,,,2.114167,-30.921604,-54.170596,,161.214356,35.0,103.0,15514.3,2500.0,15833.3,11904.8,3623.5,,19765.7,11375.4,5096.1,1291.1,16901.2,47655.4,23265.3,16000.0,2801.7
1348,1984,"China, mainland",3623.5,3446.3,3314.633333,19765.7,18370.25,18249.5,11375.4,11132.6,10848.133333,5096.1,4993.7,4771.8,47655.4,52081.1,52675.233333,2801.7,2626.7,2454.166667,,,,2500.0,2653.5,2630.1,1291.1,1182.4,1176.066667,16901.2,15722.55,15348.566667,16000.0,15388.9,15725.033333,11904.8,11806.05,11870.7,23265.3,22882.65,22863.8,15514.3,15487.9,14991.933333,15833.3,15576.2,15456.6,498.32,-121.770022,-209.564926,,,,1.499167,-37.270309,-55.28121,,173.656104,35.0,103.0,15000.0,2750.0,15833.3,12500.0,3960.2,,21904.8,11654.2,5372.6,1330.6,16518.4,54284.4,23200.0,15932.9,2969.1
1349,1985,"China, mainland",3960.2,3791.85,3617.6,21904.8,20835.25,19548.433333,11654.2,11514.8,11306.466667,5372.6,5234.35,5120.0,54284.4,50969.9,52815.533333,2969.1,2885.4,2740.833333,,,,2750.0,2625.0,2685.666667,1330.6,1310.85,1231.8,16518.4,16709.8,15987.833333,15932.9,15966.45,15570.233333,12500.0,12202.4,12037.366667,23200.0,23232.65,22988.433333,15000.0,15257.15,15325.266667,15833.3,15833.3,15661.9,503.52,-129.600966,-276.504565,15.56,0.482853,-33.275618,1.980833,-29.518631,-56.33614,,175.985965,35.0,103.0,13425.9,2647.1,15000.0,12790.7,3607.2,13090.9,17823.8,10795.0,5256.3,1360.5,15926.1,53418.8,21818.2,16807.9,2936.7
1350,1986,"China, mainland",3607.2,3783.7,3730.3,17823.8,19864.3,19831.433333,10795.0,11224.6,11274.866667,5256.3,5314.45,5241.666667,53418.8,53851.6,51786.2,2936.7,2952.9,2902.5,13090.9,,,2647.1,2698.55,2632.366667,1360.5,1345.55,1327.4,15926.1,16222.25,16448.566667,16807.9,16370.4,16246.933333,12790.7,12645.35,12398.5,21818.2,22509.1,22761.166667,13425.9,14212.95,14646.733333,15000.0,15416.65,15555.533333,493.31,-175.716034,-244.910755,16.9325,1.590153,-34.537873,2.139167,-28.135778,-57.518787,,143.871773,35.0,103.0,17376.3,2290.9,15217.4,13023.3,3705.1,13584.9,20371.9,10000.0,5337.6,1400.0,15979.2,52923.4,21785.7,16893.7,3040.2
1351,1987,"China, mainland",3705.1,3656.15,3757.5,20371.9,19097.85,20033.5,10000.0,10397.5,10816.4,5337.6,5296.95,5322.166667,52923.4,53171.1,53542.2,3040.2,2988.45,2982.0,13584.9,13337.9,,2290.9,2469.0,2562.666667,1400.0,1380.25,1363.7,15979.2,15952.65,16141.233333,16893.7,16850.8,16544.833333,13023.3,12907.0,12771.333333,21785.7,21801.95,22267.966667,17376.3,15401.1,15267.4,15217.4,15108.7,15350.233333,432.73,-104.713563,-246.029676,17.083333,-3.225878,-34.35861,2.119167,-29.544438,-53.854031,,143.760996,35.0,103.0,13261.4,2828.3,14347.8,13241.4,3920.5,13267.3,19731.8,10305.6,5413.0,1442.8,16371.7,55201.6,21551.7,16870.4,2982.9
1352,1988,"China, mainland",3920.5,3812.8,3744.266667,19731.8,20051.85,19309.166667,10305.6,10152.8,10366.866667,5413.0,5375.3,5335.633333,55201.6,54062.5,53847.933333,2982.9,3011.55,2986.6,13267.3,13426.1,13314.366667,2828.3,2559.6,2588.766667,1442.8,1421.4,1401.1,16371.7,16175.45,16092.333333,16870.4,16882.05,16857.333333,13241.4,13132.35,13018.466667,21551.7,21668.7,21718.533333,13261.4,15318.85,14687.866667,14347.8,14782.6,14855.066667,492.24,-66.729932,-303.627124,16.6075,-3.150878,-30.045374,2.84,-26.985785,-49.858686,,188.750133,35.0,103.0,12897.1,3061.2,14130.4,13409.1,3928.1,12673.3,19103.0,11510.7,5286.8,1434.1,17195.1,53099.6,22000.0,17046.8,2967.9
1353,1989,"China, mainland",3928.1,3924.3,3851.233333,19103.0,19417.4,19735.566667,11510.7,10908.15,10605.433333,5286.8,5349.9,5345.8,53099.6,54150.6,53741.533333,2967.9,2975.4,2997.0,12673.3,12970.3,13175.166667,3061.2,2944.75,2726.8,1434.1,1438.45,1425.633333,17195.1,16783.4,16515.333333,17046.8,16958.6,16936.966667,13409.1,13325.25,13224.6,22000.0,21775.85,21779.133333,12897.1,13079.25,14511.6,14130.4,14239.1,14565.2,581.63,-158.192282,-280.221451,15.226667,-1.940077,-33.691284,2.2925,-32.133528,-53.768531,,209.445037,35.0,103.0,12669.1,2608.7,14666.7,13775.3,3878.1,12978.7,18656.3,11002.5,5508.5,1269.4,16241.9,50854.3,22766.7,17714.3,3043.0
1354,1990,"China, mainland",3878.1,3903.1,3908.9,18656.3,18879.65,19163.7,11002.5,11256.6,10939.6,5508.5,5397.65,5402.766667,50854.3,51976.95,53051.833333,3043.0,3005.45,2997.933333,12978.7,12826.0,12973.1,2608.7,2834.95,2832.733333,1269.4,1351.75,1382.1,16241.9,16718.5,16602.9,17714.3,17380.55,17210.5,13775.3,13592.2,13475.266667,22766.7,22383.35,22106.133333,12669.1,12783.1,12942.533333,14666.7,14398.55,14381.633333,552.43,-140.711926,-264.569697,15.689167,0.854519,-34.189527,1.6225,-35.162899,-52.673697,,208.08671,35.0,103.0,13381.7,2516.9,13913.0,14977.8,4524.0,12325.6,19393.0,11316.6,5726.1,1455.1,21668.7,57117.5,24671.1,19279.3,3194.1
1355,1991,"China, mainland",4524.0,4201.05,4110.066667,19393.0,19024.65,19050.766667,11316.6,11159.55,11276.6,5726.1,5617.3,5507.133333,57117.5,53985.9,53690.466667,3194.1,3118.55,3068.333333,12325.6,12652.15,12659.2,2516.9,2562.8,2728.933333,1455.1,1362.25,1386.2,21668.7,18955.3,18368.566667,19279.3,18496.8,18013.466667,14977.8,14376.55,14054.066667,24671.1,23718.9,23145.933333,13381.7,13025.4,12982.633333,13913.0,14289.85,14236.7,568.58,-138.925517,-289.937118,17.020833,-6.611852,-38.950623,2.2625,-32.028694,-48.419263,144460.0,220.304523,35.0,103.0,14903.7,2545.5,14359.0,16221.3,4578.3,13658.5,18068.5,10560.0,5640.2,1379.5,20791.4,58345.1,26612.9,19047.6,3100.4
