# Feature Engineering & Data Preparation (Part 3)

## Objective
In this notebook, we build the final dataset used to train our machine learning model. We need to create "features" (variables) that help the model predict crop yield.

We will construct three main types of features:
1.  **Historical Yields (Lag Features):** Using the yield from previous years (1, 3, and 5-year averages) to predict the future.
2.  **Seasonal Weather:** Aggregating monthly weather data into seasonal averages (Winter, Spring, Summer, Fall) and shifting them to align with the crop year.
3.  **Farming Inputs & Location:** Adding fertilizer/pesticide usage and GPS coordinates (Latitude/Longitude).

The final result will be saved as `x_features.parquet`.

In [3]:
import pandas as pd
import numpy as np
from functools import reduce

### 1. Load and Clean Data
We import the standard libraries and load the two datasets we cleaned in Part 1:
* `nasa_df.parquet`: Our weather data.
* `label_yield.parquet`: Our target crop yield data.

We also do a quick cleanup of the crop names to ensure they match perfectly.

In [4]:
# Load datasets
nasa_df = pd.read_parquet('Parquet/nasa_df.parquet')
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

# Clean crop names for consistent column naming
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

# Generate a list of unique crops for iteration
crop_list = list(label_yield['item'].unique())

### 2. Create "Lag" Features (Past Yields)
Agricultural production often follows trends. If a farm was productive last year, it is likely to be productive this year.

We define a function `past_n_year_avg` that calculates the average yield for:
* **Lag 1:** The yield 1 year ago.
* **Lag 3:** The average yield of the last 3 years.
* **Lag 5:** The average yield of the last 5 years.

This gives the model a "memory" of recent performance.

In [5]:
def past_n_year_avg(df, crop_type, n):
    """
    Compute past-n-year average yield strictly for N full years.
    If less than N full past years exist, return NaN.
    """
    d = df[df['item'] == crop_type].copy()
    d['year'] = pd.to_datetime(d['year']).dt.year
    d = d.sort_values(['area', 'year'])

    def compute_avg(g):
        yrs = g['year'].values
        lbl = g['label'].values
        res = []

        for y in yrs:
            # past N years only
            mask = (yrs >= y - n) & (yrs < y)
            vals = lbl[mask]

            # strict requirement: must have exactly N rows
            if len(vals) == n:
                res.append(vals.mean())
            else:
                res.append(np.nan)

        return pd.Series(res, index=g.index)

    d[f'avg_yield_{crop_type}_{n}y'] = (
        d.groupby('area', group_keys=False)
         .apply(compute_avg, include_groups=False)
    )

    return d[['year', 'area', f'avg_yield_{crop_type}_{n}y']]


#### Generate Lags for All Crops
We run our function for every crop in our list and merge the results into a single dataframe called `features_lag_yield`.

In [6]:
# Iterate through all crops and generate lag features
dfs = []
for crop in crop_list:
    for n in [1, 3, 5]:
        dfs.append(past_n_year_avg(label_yield, crop, n))

# Merge all crop features into a single dataframe
features_lag_yield = reduce(
    lambda left, right: pd.merge(left, right, how='left', on=['year', 'area']),
    dfs
)

### 3. Weather Feature Engineering
Crops don't care about "January" or "February" specifically; they care about growing seasons (Spring, Summer, Autumn, Winter).

We process the weather data as follows:
1.  **Group by Season:** We combine months into four seasons (e.g., Dec-Feb = Winter).
2.  **Aggregate:** We calculate the **Total Rain** (Sum) and **Average Temperature/Sunlight** (Mean) for each season.
3.  **Lag by 1 Year:** We align the weather from the *previous* year to the *current* crop year. This allows us to predict yields before the current season is even finished.

In [7]:
def prep_seasonal_weather_lag1year(nasa_df, var_list=['rain','solar','temp']):
    """
    Computes seasonal lag-1 weather features:
    - Rain -> SUM
    - Solar, Temp -> AVG
    Strict: require all months for the season/annual, else NaN
    """
    nasa_df['date'] = pd.to_datetime(nasa_df['date'])
    nasa_df['year'] = nasa_df['date'].dt.year
    nasa_df['month'] = nasa_df['date'].dt.month
    
    all_features = None
    
    for var in var_list:
        # pivot
        p = nasa_df.pivot_table(
            index=['area','year'],
            columns='month',
            values=var
        ).reset_index()
        
        # rename months
        month_map = {m: f"{var}_{pd.Timestamp(2000,m,1).strftime('%b').lower()}" for m in range(1,13)}
        p = p.rename(columns=month_map)

        # lag 1 year
        p['year'] = p['year'] + 1

        # ensure all months exist
        months = [f"{var}_{pd.Timestamp(2000,m,1).strftime('%b').lower()}" for m in range(1,13)]
        for col in months:
            if col not in p.columns:
                p[col] = pd.NA

        # define seasons
        winter = [f"{var}_jan", f"{var}_feb", f"{var}_dec"]
        spring = [f"{var}_mar", f"{var}_apr", f"{var}_may"]
        summer = [f"{var}_jun", f"{var}_jul", f"{var}_aug"]
        autumn = [f"{var}_sep", f"{var}_oct", f"{var}_nov"]

        # aggregation functions
        if var == 'rain':
            # strict sum
            agg_func = lambda df, cols: df[cols].where(df[cols].notna().all(axis=1), pd.NA).sum(axis=1)
        else:
            # strict avg
            agg_func = lambda df, cols: df[cols].where(df[cols].notna().all(axis=1), pd.NA).mean(axis=1)

        p[f"{'sum' if var=='rain' else 'avg'}_{var}_winter"] = agg_func(p, winter)
        p[f"{'sum' if var=='rain' else 'avg'}_{var}_spring"] = agg_func(p, spring)
        p[f"{'sum' if var=='rain' else 'avg'}_{var}_summer"] = agg_func(p, summer)
        p[f"{'sum' if var=='rain' else 'avg'}_{var}_autumn"] = agg_func(p, autumn)
        p[f"{'sum' if var=='rain' else 'avg'}_{var}_annual"] = agg_func(p, months)

        # keep relevant columns
        cols_keep = ['area','year'] + [
            f"{'sum' if var=='rain' else 'avg'}_{var}_{s}" for s in ['winter','spring','summer','autumn','annual']
        ]
        p = p[cols_keep]

        # merge
        all_features = p if all_features is None else all_features.merge(p, on=['area','year'], how='outer')

    return all_features


In [8]:
# Process weather features
nasa_f = prep_seasonal_weather_lag1year(nasa_df, var_list=['rain','solar','temp'])

# Verify
print(nasa_f.columns.tolist())

['area', 'year', 'sum_rain_winter', 'sum_rain_spring', 'sum_rain_summer', 'sum_rain_autumn', 'sum_rain_annual', 'avg_solar_winter', 'avg_solar_spring', 'avg_solar_summer', 'avg_solar_autumn', 'avg_solar_annual', 'avg_temp_winter', 'avg_temp_spring', 'avg_temp_summer', 'avg_temp_autumn', 'avg_temp_annual']


### 4. Add Location Data (Geospatial)
 Geography plays a huge role in agriculture. We load a separate file containing the **Latitude and Longitude** for each country. This helps the model understand that "Thailand" and "Vietnam" are neighbors and might share similar traits.

In [9]:
# Load geospatial data (Assuming 'lat_long.csv' exists in Data folder)
latlong = pd.read_csv('Data/coordinates.csv')

# Clean and standardize formatting
latlong['area'] = latlong['Area'].str.replace(' ', '_')
latlong = latlong[['area', 'latitude', 'longitude']]

# Display sample to verify structure
latlong.head()

Unnamed: 0,area,latitude,longitude
0,Albania,41.33,19.82
1,Algeria,28.03,1.66
2,Angola,-11.2,17.87
3,Argentina,-38.42,-63.62
4,Armenia,40.07,45.04


### 5. Add Farming Inputs (Fertilizers & Pesticides)
We include data on how much fertilizer and pesticide was used.
* **Logic:** We shift this data by 1 year (`Lag 1`).
* **Reason:** Farmers often plan their budget based on the previous year's usage. Using last year's data makes our prediction more practical for early forecasting.

In [10]:
# 1. Load the farming data
farming_df = pd.read_parquet('Parquet/farming_df.parquet')

# 2. Ensure 'year' is in datetime format for accurate date shifting
farming_df['year'] = pd.to_datetime(farming_df['year'])

# 3. Create Lag Features (Shift Year Forward by 1)
# Logic: We use 2020's pesticides for the 2021 yield row.
farming_df['year'] = farming_df['year'] + pd.DateOffset(years=1)

# === FIX START ===
# 4. Convert 'year' back to an integer to match x_features
farming_df['year'] = farming_df['year'].dt.year
# === FIX END ===

# 5. Rename columns to indicate they are lagged
farming_df = farming_df.rename(columns={
    'pesticides': 'pesticides_lag1',
    'fertilizer': 'fertilizer_lag1'
})

### 6. Final Merge and Save
We combine all our new features into one master dataset:
* **Yield Lags** + **Seasonal Weather** + **Farming Inputs** + **Location**

We filter out data before 1982 (since we don't have enough history to calculate the 5-year lag for those early years) and save the final file as `x_features_v2.parquet`.

In [11]:
# Merge Yield Lags with Weather Data
x_features = features_lag_yield.merge(
    nasa_f, on=['year', 'area'], how='left'
)

# 7. Merge with Farming Data
# Now both dataframes have 'year' as an integer
x_features = x_features.merge(
    farming_df, on=['year', 'area'], how='left'
)

# Merge with Geospatial Data
x_features = x_features.merge(
    latlong, on=['area'], how='left'
)


# Prevent pandas from hiding columns
pd.set_option('display.max_columns', None)

# Show first 20 rows for Thailand
x_features[x_features['area'] == 'Thailand'].head(20)

Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_3y,avg_yield_maize_corn_5y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_other_vegetables_fresh_nec_5y,avg_yield_potatoes_1y,avg_yield_potatoes_3y,avg_yield_potatoes_5y,avg_yield_rice_1y,avg_yield_rice_3y,avg_yield_rice_5y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_3y,avg_yield_sugar_cane_5y,avg_yield_wheat_1y,avg_yield_wheat_3y,avg_yield_wheat_5y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_3y,avg_yield_oil_palm_fruit_5y,avg_yield_barley_1y,avg_yield_barley_3y,avg_yield_barley_5y,avg_yield_soya_beans_1y,avg_yield_soya_beans_3y,avg_yield_soya_beans_5y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_3y,avg_yield_sugar_beet_5y,avg_yield_watermelons_1y,avg_yield_watermelons_3y,avg_yield_watermelons_5y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_cucumbers_and_gherkins_5y,avg_yield_tomatoes_1y,avg_yield_tomatoes_3y,avg_yield_tomatoes_5y,avg_yield_bananas_1y,avg_yield_bananas_3y,avg_yield_bananas_5y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_3y,avg_yield_cassava_fresh_5y,sum_rain_winter,sum_rain_spring,sum_rain_summer,sum_rain_autumn,sum_rain_annual,avg_solar_winter,avg_solar_spring,avg_solar_summer,avg_solar_autumn,avg_solar_annual,avg_temp_winter,avg_temp_spring,avg_temp_summer,avg_temp_autumn,avg_temp_annual,pesticides_lag1,fertilizer_lag1,latitude,longitude
7309,1970,Thailand,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,15.87,100.99
7310,1971,Thailand,2587.7,,,6181.1,,,8000.0,,,2020.7,,,43149.5,,,,,,5789.5,,,,,,875.0,,,,,,9966.7,,,7000.0,,,3223.2,,,9600.0,,,15317.0,,,,,,,,,,,,,,,,,,,6.588943,15.87,100.99
7311,1972,Thailand,2421.1,,,6171.9,,,8000.0,,,1936.9,,,47752.3,,,,,,5789.5,,,,,,1007.0,,,,,,10347.6,,,7000.0,,,3708.3,,,9600.0,,,14144.3,,,,,,,,,,,,,,,,,,,10.30802,15.87,100.99
7312,1973,Thailand,1414.0,2140.933333,,6210.9,6187.966667,,8000.0,8000.0,,1830.8,1929.466667,,37373.9,42758.566667,,,,,5789.5,5789.5,,,,,940.3,940.766667,,,,,11568.0,10627.433333,,7000.0,7000.0,,3869.6,3600.366667,,9600.0,9600.0,,12127.7,13863.0,,,,,,,,,,,,,,,,,,12.159696,15.87,100.99
7313,1974,Thailand,2227.6,2020.9,,6250.0,6210.933333,,6540.6,7513.533333,,1924.2,1897.3,,52476.8,45867.666667,,,,,5833.3,5804.1,,,,,1001.6,982.966667,,,,,11581.3,11165.633333,,7096.2,7032.066667,,4081.5,3886.466667,,10000.0,9733.333333,,13120.4,13130.8,,,,,,,,,,,,,,,,,,10.789281,15.87,100.99
7314,1975,Thailand,2332.1,1991.233333,2196.5,6289.1,6250.0,6220.6,6532.2,7024.266667,7414.56,1825.4,1860.133333,1907.6,51579.0,47143.233333,46466.3,,,,5833.3,5818.7,5807.02,,,,871.6,937.833333,939.1,,,,10775.0,11308.1,10847.72,7092.6,7062.933333,7037.76,5231.1,4394.066667,4022.74,10400.0,10000.0,9840.0,13184.6,12810.9,13578.8,,,,,,,,,,,,,,,,,13.756906,15.87,100.99
7315,1976,Thailand,2375.2,2311.633333,2154.0,6328.1,6289.066667,6250.0,9944.5,7672.433333,7803.46,1830.8,1860.133333,1869.62,48127.6,50727.8,47461.92,,,,5823.5,5830.033333,5813.82,,,,1027.6,966.933333,969.62,,,,11831.3,11395.866667,11220.64,7107.1,7098.633333,7059.18,7387.4,5566.666667,4855.58,10400.0,10266.666667,10000.0,14933.0,13746.0,13502.0,,,,,,,,,,,,,,,,,12.090537,15.87,100.99
7316,1977,Thailand,2386.2,2364.5,2147.02,6367.2,6328.133333,6289.06,2573.1,6349.933333,6718.08,1845.0,1833.733333,1851.24,53291.6,50999.4,48569.78,,,,10843.0,7499.933333,6824.52,,,,1209.6,1036.266667,1010.14,,,,12126.8,11577.7,11576.48,7206.9,7135.533333,7100.56,4198.5,5605.666667,4953.62,11200.0,10666.666667,10320.0,15065.4,14394.333333,13686.22,,,,,,,,,,,,,,,,,15.635716,15.87,100.99
7317,1978,Thailand,1717.7,2159.7,2207.76,6406.3,6367.2,6328.14,10485.8,7667.8,7215.24,1591.0,1755.6,1803.28,52814.2,51411.133333,51657.84,,,,12207.4,9624.633333,8108.1,,,,736.5,991.233333,969.38,,,,12721.2,12226.433333,11807.12,7200.0,7171.333333,7140.56,4502.7,5362.866667,5080.24,12000.0,11200.0,10800.0,14255.1,14751.166667,14111.7,,,,,,,,,,,,,,,,,17.250999,15.87,100.99
7318,1979,Thailand,2124.0,2075.966667,2187.04,6395.3,6389.6,6357.2,10361.3,7806.733333,7979.38,1955.2,1797.066667,1809.48,33842.9,46649.566667,47931.06,,,,10381.1,11143.833333,9017.66,,,,1093.7,1013.266667,987.8,,,,12373.2,12407.066667,11965.5,7290.3,7232.4,7179.38,4818.0,4506.4,5227.54,11600.0,11600.0,11120.0,15511.5,14944.0,14589.92,,,,,,,,,,,,,,,,,17.873107,15.87,100.99


In [12]:

# Filter data to relevant years (1983 onwards)
x_features = x_features[x_features['year'] >= 1982]

# Save to Parquet
x_features.to_parquet('Parquet/x_features_v2.parquet')

# Output shape for verification
print(f"Final X features shape: {x_features.shape}")

x_features.head()



Final X features shape: (6631, 66)


Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_3y,avg_yield_maize_corn_5y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_other_vegetables_fresh_nec_5y,avg_yield_potatoes_1y,avg_yield_potatoes_3y,avg_yield_potatoes_5y,avg_yield_rice_1y,avg_yield_rice_3y,avg_yield_rice_5y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_3y,avg_yield_sugar_cane_5y,avg_yield_wheat_1y,avg_yield_wheat_3y,avg_yield_wheat_5y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_3y,avg_yield_oil_palm_fruit_5y,avg_yield_barley_1y,avg_yield_barley_3y,avg_yield_barley_5y,avg_yield_soya_beans_1y,avg_yield_soya_beans_3y,avg_yield_soya_beans_5y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_3y,avg_yield_sugar_beet_5y,avg_yield_watermelons_1y,avg_yield_watermelons_3y,avg_yield_watermelons_5y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_cucumbers_and_gherkins_5y,avg_yield_tomatoes_1y,avg_yield_tomatoes_3y,avg_yield_tomatoes_5y,avg_yield_bananas_1y,avg_yield_bananas_3y,avg_yield_bananas_5y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_3y,avg_yield_cassava_fresh_5y,sum_rain_winter,sum_rain_spring,sum_rain_summer,sum_rain_autumn,sum_rain_annual,avg_solar_winter,avg_solar_spring,avg_solar_summer,avg_solar_autumn,avg_solar_annual,avg_temp_winter,avg_temp_spring,avg_temp_summer,avg_temp_autumn,avg_temp_annual,pesticides_lag1,fertilizer_lag1,latitude,longitude
12,1982,Afghanistan,1669.0,1650.1,1630.38,6892.2,6748.933333,6489.62,15423.7,14880.5,13780.82,2241.4,2181.766667,2097.64,18918.9,18378.366667,17427.02,1235.0,1240.533333,1210.1,,,,1079.1,1058.066667,1038.06,,,,8333.3,17361.1,17216.66,9538.5,9187.533333,8695.38,,,,,,,,,,,,,150.63,113.69,34.44,26.56,325.32,,,,,,0.88,12.286667,21.83,11.8,11.699167,,5.778887,34.53,69.17
13,1983,Afghanistan,1665.8,1668.633333,1646.88,6919.2,6846.166667,6717.3,15511.4,15265.133333,14777.84,2199.4,2204.533333,2156.56,19090.9,18976.233333,18045.2,1229.9,1239.933333,1229.9,,,,1073.9,1067.333333,1059.3,,,,9090.9,13099.733333,15154.84,9457.9,9388.2,9127.32,,,,,,,,,,,,,139.74,172.94,3.65,57.21,373.54,,,,,,-0.713333,9.976667,21.856667,11.483333,10.650833,,6.672946,34.53,69.17
14,1984,Afghanistan,1664.1,1666.3,1656.04,7065.7,6959.033333,6846.34,15764.7,15566.6,15183.52,2258.1,2232.966667,2200.56,19375.0,19128.266667,18720.2,1258.0,1240.966667,1241.9,,,,1099.2,1084.066667,1069.46,,,,20000.0,12474.733333,16234.84,9754.9,9583.766667,9355.08,,,,,,,,,,,,,60.6,202.73,9.34,0.58,273.25,,,,,,0.953333,9.246667,22.393333,12.94,11.383333,,7.152971,34.53,69.17
15,1985,Afghanistan,1661.2,1663.7,1666.24,7155.1,7046.666667,6951.86,14444.4,15240.166667,15200.9,2241.6,2233.033333,2222.66,19354.8,19273.566667,19131.7,1231.9,1239.933333,1241.94,,,,1085.5,1086.2,1077.34,,,,20000.0,16363.633333,15859.84,9630.0,9614.266667,9509.9,,,,,,,,,,,,,68.62,89.17,17.24,21.07,196.1,11.2,21.623333,26.856667,17.183333,19.215833,-1.406667,13.023333,24.12,11.816667,11.888333,,9.178255,34.53,69.17
16,1986,Afghanistan,1665.2,1663.5,1665.06,7145.9,7122.233333,7035.62,14090.9,14766.666667,15047.02,2248.2,2249.3,2237.74,19333.3,19354.366667,19214.58,1227.7,1239.2,1236.5,,,,1086.0,1090.233333,1084.74,,,,3333.3,14444.433333,12151.5,9556.7,9647.2,9587.6,,,,,,,,,,,,,75.65,65.67,5.44,8.33,155.09,11.366667,20.27,26.086667,17.243333,18.741667,1.843333,12.99,22.946667,12.266667,12.511667,,9.22402,34.53,69.17
