# Part 3: Feature Engineering & Data Prep

## Goal
In this notebook, I'm building the final dataset for the machine learning model. The idea is to create "features" that help predict crop yield.

I'm going to create three main types of features:
1.  **Historical Yields (Lags):** Looking at how well the crop did in previous years (1, 3, and 5-year averages).
2.  **Weather Cycles:** Instead of just average temperatures, I'll use Sine and Cosine functions to capture seasonal patterns (Winter, Spring, etc.) better.
3.  **Farming Info & Location:** Adding data on fertilizers, pesticides, and GPS coordinates.

In [80]:
import pandas as pd
import numpy as np
from functools import reduce

### 1. Loading the Data
First, I'll load the clean datasets I made in Part 1 (`nasa_df` for weather and `label_yield` for crops). I'll also do a quick cleanup on the crop names to make sure they are consistent.

In [81]:
# Loading the parquet files
nasa_df = pd.read_parquet('Parquet/nasa_df.parquet')
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

# Cleaning crop names (removing special chars and lowercasing)
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

# Getting a list of all crops to loop through later
crop_list = list(label_yield['item'].unique())
print(crop_list)

['maize_corn', 'other_vegetables_fresh_nec', 'potatoes', 'rice', 'sugar_cane', 'wheat', 'oil_palm_fruit', 'barley', 'soya_beans', 'sugar_beet', 'watermelons', 'cucumbers_and_gherkins', 'tomatoes', 'bananas', 'cassava_fresh']


### 2. Creating "Lag" Features (Past Yields)
Since agriculture usually follows trends, knowing how a farm performed in the past is really helpful. 

I wrote a function `past_n_year_avg` to calculate:
* **Lag 1:** Yield from exactly 1 year ago.
* **Lag 3 & 5:** Average yield from the last 3 and 5 years.

This basically gives the model a "memory" of recent history.

In [82]:
def past_n_year_avg(df, crop_type, n):
    """
    Calculates the average yield for the past N years.
    If we don't have N full years of history, it returns NaN.
    """
    d = df[df['item'] == crop_type].copy()
    d['year'] = pd.to_datetime(d['year']).dt.year
    d = d.sort_values(['area', 'year'])

    def compute_avg(g):
        yrs = g['year'].values
        lbl = g['label'].values
        res = []

        for y in yrs:
            # look at only the past N years
            mask = (yrs >= y - n) & (yrs < y)
            vals = lbl[mask]

            # strict check: we need exactly N data points
            if len(vals) == n:
                res.append(vals.mean())
            else:
                res.append(np.nan)

        return pd.Series(res, index=g.index)

    # Apply the logic grouped by area
    d[f'avg_yield_{crop_type}_{n}y'] = (
        d.groupby('area', group_keys=False)
         .apply(compute_avg, include_groups=False)
    )

    return d[['year', 'area', f'avg_yield_{crop_type}_{n}y']]


#### Running Lags for All Crops
Now I'll just loop through the list of crops and apply the function. Then I'll merge everything into one dataframe called `features_lag_yield`.

In [83]:
# Generate lags for every crop
dfs = []
for crop in crop_list:
    for n in [1, 2, 3]:
        dfs.append(past_n_year_avg(label_yield, crop, n))

# Merge all the results together
features_lag_yield = reduce(
    lambda left, right: pd.merge(left, right, how='left', on=['year', 'area']),
    dfs
)


### 3. Weather Features (Sin/Cos Transformation)

Instead of manually grouping months into seasons like "Winter" or "Summer", I'm using **Sine and Cosine transformations**.

The idea is to:
1.  **Pivot** the data so each month is a column.
2.  **Calculate Harmonics:** Use Sin/Cos weights to capture the yearly cycle (seasonality) for Rain, Solar, and Temp.
3.  **Lag by 1 Year:** Shift everything so we are using *last year's* weather to predict this year's crop.

In [84]:
def prep_harmonic_weather_lag1year(nasa_df, var_list=['rain','solar','temp']):
    """
    Creates seasonal features using Sin/Cos to capture cyclic patterns.
    Returns: Annual Mean/Sum, Sin Component, Cos Component.
    """
    nasa_df['date'] = pd.to_datetime(nasa_df['date'])
    nasa_df['year'] = nasa_df['date'].dt.year
    nasa_df['month'] = nasa_df['date'].dt.month

    # Weights for the 12 months
    months = np.arange(1, 13)
    sin_weights = np.sin(2 * np.pi * months / 12)
    cos_weights = np.cos(2 * np.pi * months / 12)

    all_features = None

    for var in var_list:
        # Pivot so we have columns 1-12 for each year/area
        p = nasa_df.pivot_table(
            index=['area','year'],
            columns='month',
            values=var
        ).reset_index()

        # Shift year by +1 (Lagging)
        p['year'] = p['year'] + 1

        # Fill missing month columns with NaN just in case
        for m in range(1, 13):
            if m not in p.columns:
                p[m] = np.nan

        # Grab the data as a matrix
        matrix_data = p[range(1, 13)].values

        # --- Calculating the Features ---

        # 1. Base Feature (Sum for rain, Mean for temp/solar)
        if var == 'rain':
            p[f'{var}_annual'] = np.sum(matrix_data, axis=1)
        else:
            p[f'{var}_annual'] = np.mean(matrix_data, axis=1)
        
        # If any month is missing, set annual to NaN
        mask = np.isnan(matrix_data).any(axis=1)
        p.loc[mask, f'{var}_annual'] = np.nan

        # 2. Seasonality (Dot Product with weights)
        p[f'{var}_sin'] = matrix_data @ sin_weights
        p[f'{var}_cos'] = matrix_data @ cos_weights

        # Handle NaNs for sin/cos too
        p.loc[mask, f'{var}_sin'] = np.nan
        p.loc[mask, f'{var}_cos'] = np.nan

        # Keep only the columns we need
        cols_keep = ['area', 'year', f'{var}_annual', f'{var}_sin', f'{var}_cos']
        p_final = p[cols_keep]

        if all_features is None:
            all_features = p_final
        else:
            all_features = all_features.merge(p_final, on=['area', 'year'], how='outer')

    return all_features


In [85]:
# Run the weather processing
nasa_f = prep_harmonic_weather_lag1year(nasa_df, var_list=['rain','solar','temp'])

# Check what columns we got
print(nasa_f.columns.tolist())

#nasa_f.to_csv('Data/nasa_df.csv', index=False)

['area', 'year', 'rain_annual', 'rain_sin', 'rain_cos', 'solar_annual', 'solar_sin', 'solar_cos', 'temp_annual', 'temp_sin', 'temp_cos']


### 4. Adding Coordinates
I'm loading a separate file with Latitude and Longitude. This is important so the model knows which countries are neighbors.

In [86]:
import os

folder_path = 'Data/temperature_csv'
data = []

# Loop through all files in the folder
for f in os.listdir(folder_path):
    if f.endswith('.csv'):
        parts = f.split("_")
        
        # Parse from the back to handle country names with spaces/underscores
        # Format: Country_Name_Lat_Long_Type_Start_End.csv
        lat = parts[-5]
        long = parts[-4]
        country = " ".join(parts[:-5])
        
        data.append({
            'area': country, 
            'latitude': float(lat), 
            'longitude': float(long)
        })

# Create DataFrame
latlong = pd.DataFrame(data)

# Show result
latlong.head()

Unnamed: 0,area,latitude,longitude
0,Afghanistan,34.53,69.17
1,Albania,41.33,19.82
2,Algeria,28.03,1.66
3,Angola,-11.2,17.87
4,Antigua and Barbuda,17.12,-61.85


### 5. Farming Inputs (Fertilizers & Pesticides)
Here I'm adding data on fertilizer and pesticide usage.

**Important:** I'm shifting this data by 1 year. The logic is that farmers usually plan inputs based on the previous year, or the inputs applied *before* the harvest season are what matter.

In [87]:
# 1. Load farming data
farming_df = pd.read_parquet('Parquet/farming_df.parquet')

# 2. Convert year to datetime for easier shifting
farming_df['year'] = pd.to_datetime(farming_df['year'])

# 3. Shift Year Forward by 1 (Lag 1)
# E.g., 2020 pesticides will be used for 2021 yield
farming_df['year'] = farming_df['year'] + pd.DateOffset(years=1)

# 4. Convert back to integer to match the main dataset
farming_df['year'] = farming_df['year'].dt.year

# 5. Rename columns so we know they are lagged
farming_df = farming_df.rename(columns={
    'pesticides': 'pesticides_lag1',
    'fertilizer': 'fertilizer_lag1'
})

### 6. Merging Everything
Now I'll combine all the features into one big dataframe:
**Yield Lags + Weather + Inputs + Location**

I'll also filter out years before 1982 because the 5-year lag calculation creates a lot of missing values for the early years.

In [88]:
# Merge Yield Lags + Weather
x_features = features_lag_yield.merge(
    nasa_f, on=['year', 'area'], how='left'
)

# Merge + Farming Data
x_features = x_features.merge(
    farming_df, on=['year', 'area'], how='left'
)

# Merge + Coordinates
x_features = x_features.merge(
    latlong, on=['area'], how='left'
)

# Show all columns to check if it worked
pd.set_option('display.max_columns', None)

# Checking a sample country (Thailand)
x_features[x_features['area'] == 'China, mainland'].head(20)

Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_2y,avg_yield_maize_corn_3y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_2y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_potatoes_1y,avg_yield_potatoes_2y,avg_yield_potatoes_3y,avg_yield_rice_1y,avg_yield_rice_2y,avg_yield_rice_3y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_2y,avg_yield_sugar_cane_3y,avg_yield_wheat_1y,avg_yield_wheat_2y,avg_yield_wheat_3y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_2y,avg_yield_oil_palm_fruit_3y,avg_yield_barley_1y,avg_yield_barley_2y,avg_yield_barley_3y,avg_yield_soya_beans_1y,avg_yield_soya_beans_2y,avg_yield_soya_beans_3y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_2y,avg_yield_sugar_beet_3y,avg_yield_watermelons_1y,avg_yield_watermelons_2y,avg_yield_watermelons_3y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_2y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_tomatoes_1y,avg_yield_tomatoes_2y,avg_yield_tomatoes_3y,avg_yield_bananas_1y,avg_yield_bananas_2y,avg_yield_bananas_3y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_2y,avg_yield_cassava_fresh_3y,rain_annual,rain_sin,rain_cos,solar_annual,solar_sin,solar_cos,temp_annual,temp_sin,temp_cos,pesticides_lag1,fertilizer_lag1,latitude,longitude
1694,1970,"China, mainland",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,35.0,103.0
1695,1971,"China, mainland",2088.5,,,15562.0,,,10750.0,,,3402.5,,,34745.7,,,1147.5,,,,,,1139.5,,,1091.9,,,10582.8,,,16641.5,,,12375.0,,,24300.0,,,12769.2,,,11428.6,,,,,,,,,,,,,44.389605,35.0,103.0
1696,1972,"China, mainland",2145.6,2117.05,,15702.0,15632.0,,10476.2,10613.1,,3302.6,3352.55,,30842.0,32793.85,,1271.8,1209.65,,,,,1260.7,1200.1,,1106.1,1099.0,,9690.4,10136.6,,16793.1,16717.3,,12500.0,12437.5,,24545.5,24422.75,,12600.0,12684.6,,12000.0,11714.3,,,,,,,,,,,,46.195245,35.0,103.0
1697,1973,"China, mainland",1923.8,2034.7,2052.633333,14755.2,15228.6,15339.733333,10217.4,10346.8,10481.2,3228.8,3265.7,3311.3,34437.4,32639.7,33341.7,1369.5,1320.65,1262.933333,,,,1268.2,1264.45,1222.8,851.4,978.75,1016.466667,8730.8,9210.6,9668.0,15814.8,16303.95,16416.466667,11698.1,12099.05,12191.033333,23187.5,23866.5,24011.0,11400.0,12000.0,12256.4,11875.0,11937.5,11767.866667,,,,,,,,,,,53.330623,35.0,103.0
1698,1974,"China, mainland",2333.6,2128.7,2134.333333,15088.6,14921.9,15181.933333,13500.0,11858.7,11397.866667,3472.7,3350.75,3334.7,33613.3,34025.35,32964.233333,1333.7,1351.6,1325.0,,,,1076.5,1172.35,1201.8,1131.1,991.25,1029.533333,9388.4,9059.6,9269.866667,15900.0,15857.4,16169.3,11796.6,11747.35,11998.233333,23352.9,23270.2,23695.3,12384.6,11892.3,12128.2,11764.7,11819.85,11879.9,,,,,,,,,,,65.806781,35.0,103.0
1699,1975,"China, mainland",2467.7,2400.65,2241.7,15194.8,15141.7,15012.866667,13000.0,13250.0,12239.133333,3492.6,3482.65,3398.033333,33351.7,33482.5,33800.8,1511.6,1422.65,1404.933333,,,,1481.5,1279.0,1275.4,1029.8,1080.45,1004.1,8580.8,8984.6,8900.0,16620.7,16260.35,16111.833333,12315.8,12056.2,11936.833333,24303.0,23827.95,23614.466667,12615.4,12500.0,12133.333333,11764.7,11764.7,11801.466667,,,,,,,,,,,56.629088,35.0,103.0
1700,1976,"China, mainland",2541.6,2504.65,2447.633333,15379.7,15287.25,15221.033333,11571.4,12285.7,12690.466667,3517.8,3505.2,3494.366667,31849.8,32600.75,32938.266667,1639.7,1575.65,1495.0,,,,1760.6,1621.05,1439.533333,1035.5,1032.65,1065.466667,8181.0,8380.9,8716.733333,16600.0,16610.35,16373.566667,12305.1,12310.45,12139.166667,24235.3,24269.15,23963.733333,13750.0,13182.7,12916.666667,11666.7,11715.7,11732.033333,,,,,,,,,,,70.642334,35.0,103.0
1701,1977,"China, mainland",2506.6,2524.1,2505.3,15192.3,15286.0,15255.6,11300.0,11435.7,11957.133333,3477.1,3497.45,3495.833333,30733.7,31291.75,31978.4,1775.4,1707.55,1642.233333,,,,2014.8,1887.7,1752.3,993.4,1014.45,1019.566667,8224.4,8202.7,8328.733333,16600.0,16600.0,16606.9,12305.1,12305.1,12308.666667,24235.3,24235.3,24257.866667,16000.0,14875.0,14121.8,12222.2,11944.45,11884.533333,,,,,,,,,,,66.32003,35.0,103.0
1702,1978,"China, mainland",2515.5,2511.05,2521.233333,15243.9,15218.1,15271.966667,13237.5,12268.75,12036.3,3622.6,3549.85,3539.166667,35029.6,32881.65,32537.7,1465.0,1620.2,1626.7,,,,1975.0,1994.9,1916.8,1061.7,1027.55,1030.2,6981.2,7602.8,7795.533333,16580.6,16590.3,16593.533333,12290.3,12297.7,12300.166667,24222.2,24228.75,24230.933333,14000.0,15000.0,14583.333333,12105.3,12163.75,11998.066667,,,,,,,,,,,94.531309,35.0,103.0
1703,1979,"China, mainland",2803.0,2659.25,2608.366667,15142.0,15192.95,15192.733333,12748.2,12992.85,12428.566667,3978.1,3800.35,3692.6,38499.6,36764.6,34754.3,1844.9,1654.95,1695.1,,,,2229.5,2102.25,2073.1,1059.6,1060.65,1038.233333,8165.6,7573.4,7790.4,15714.3,16147.45,16298.3,11331.4,11810.85,11975.6,23636.4,23929.3,24031.3,17000.0,15500.0,15666.666667,12500.0,12302.65,12275.833333,,,,,,,,,,,112.630924,35.0,103.0


In [89]:
# Filtering for years with good data (1982+)
x_features = x_features[x_features['year'] >= 1982]

# Saving the final file
x_features.to_parquet('Parquet/x_features_v3.parquet')
#x_features.to_csv('Parquet/x_features_v3.csv')

print(f"Final shape: {x_features.shape}")
x_features.head()

Final shape: (6589, 60)


Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_2y,avg_yield_maize_corn_3y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_2y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_potatoes_1y,avg_yield_potatoes_2y,avg_yield_potatoes_3y,avg_yield_rice_1y,avg_yield_rice_2y,avg_yield_rice_3y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_2y,avg_yield_sugar_cane_3y,avg_yield_wheat_1y,avg_yield_wheat_2y,avg_yield_wheat_3y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_2y,avg_yield_oil_palm_fruit_3y,avg_yield_barley_1y,avg_yield_barley_2y,avg_yield_barley_3y,avg_yield_soya_beans_1y,avg_yield_soya_beans_2y,avg_yield_soya_beans_3y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_2y,avg_yield_sugar_beet_3y,avg_yield_watermelons_1y,avg_yield_watermelons_2y,avg_yield_watermelons_3y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_2y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_tomatoes_1y,avg_yield_tomatoes_2y,avg_yield_tomatoes_3y,avg_yield_bananas_1y,avg_yield_bananas_2y,avg_yield_bananas_3y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_2y,avg_yield_cassava_fresh_3y,rain_annual,rain_sin,rain_cos,solar_annual,solar_sin,solar_cos,temp_annual,temp_sin,temp_cos,pesticides_lag1,fertilizer_lag1,latitude,longitude
12,1982,Afghanistan,1669.0,1670.05,1650.1,6892.2,6809.65,6748.933333,15423.7,15142.0,14880.5,2241.4,2207.1,2181.766667,18918.9,18918.9,18378.366667,1235.0,1244.95,1240.533333,,,,1079.1,1064.05,1058.066667,,,,8333.3,15104.15,17361.1,9538.5,9353.35,9187.533333,,,,,,,,,,,,,325.32,154.391016,59.402942,,,,11.699167,-33.899342,-60.110408,,5.778887,34.53,69.17
13,1983,Afghanistan,1665.8,1667.4,1668.633333,6919.2,6905.7,6846.166667,15511.4,15467.55,15265.133333,2199.4,2220.4,2204.533333,19090.9,19004.9,18976.233333,1229.9,1232.45,1239.933333,,,,1073.9,1076.5,1067.333333,,,,9090.9,8712.1,13099.733333,9457.9,9498.2,9388.2,,,,,,,,,,,,,373.54,186.787027,72.87098,,,,10.650833,-42.640609,-60.922614,,6.672946,34.53,69.17
14,1984,Afghanistan,1664.1,1664.95,1666.3,7065.7,6992.45,6959.033333,15764.7,15638.05,15566.6,2258.1,2228.75,2232.966667,19375.0,19232.95,19128.266667,1258.0,1243.95,1240.966667,,,,1099.2,1086.55,1084.066667,,,,20000.0,14545.45,12474.733333,9754.9,9606.4,9583.766667,,,,,,,,,,,,,273.25,211.709895,-21.830163,,,,11.383333,-46.025539,-55.929678,,7.152971,34.53,69.17
15,1985,Afghanistan,1661.2,1662.65,1663.7,7155.1,7110.4,7046.666667,14444.4,15104.55,15240.166667,2241.6,2249.85,2233.033333,19354.8,19364.9,19273.566667,1231.9,1244.95,1239.933333,,,,1085.5,1092.35,1086.2,,,,20000.0,20000.0,16363.633333,9630.0,9692.45,9614.266667,,,,,,,,,,,,,196.1,96.438281,33.467341,19.215833,-13.475498,-51.48091,11.888333,-39.59991,-69.677748,,9.178255,34.53,69.17
16,1986,Afghanistan,1665.2,1663.2,1663.5,7145.9,7150.5,7122.233333,14090.9,14267.65,14766.666667,2248.2,2244.9,2249.3,19333.3,19344.05,19354.366667,1227.7,1229.8,1239.2,,,,1086.0,1085.75,1090.233333,,,,3333.3,11666.65,14444.433333,9556.7,9593.35,9647.2,,,,,,,,,,,,,155.09,67.530475,38.964517,18.741667,-12.861915,-47.249218,12.511667,-31.144849,-60.63187,,9.22402,34.53,69.17
