# Feature Engineering & Data Preparation (Part 3)

## Objective
In this notebook, we build the final dataset used to train our machine learning model. We need to create "features" (variables) that help the model predict crop yield.

We will construct three main types of features:
1.  **Historical Yields (Lag Features):** Using the yield from previous years (1, 3, and 5-year averages) to predict the future.
2.  **Seasonal Weather:** Aggregating monthly weather data into seasonal averages (Winter, Spring, Summer, Fall) and shifting them to align with the crop year.
3.  **Farming Inputs & Location:** Adding fertilizer/pesticide usage and GPS coordinates (Latitude/Longitude).

The final result will be saved as `x_features.parquet`.

In [45]:
import pandas as pd
import numpy as np
from functools import reduce

### 1. Load and Clean Data
We import the standard libraries and load the two datasets we cleaned in Part 1:
* `nasa_df.parquet`: Our weather data.
* `label_yield.parquet`: Our target crop yield data.

We also do a quick cleanup of the crop names to ensure they match perfectly.

In [46]:
# Load datasets
nasa_df = pd.read_parquet('Parquet/nasa_df.parquet')
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

# Clean crop names for consistent column naming
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

# Generate a list of unique crops for iteration
crop_list = list(label_yield['item'].unique())

### 2. Create "Lag" Features (Past Yields)
Agricultural production often follows trends. If a farm was productive last year, it is likely to be productive this year.

We define a function `past_n_year_avg` that calculates the average yield for:
* **Lag 1:** The yield 1 year ago.
* **Lag 3:** The average yield of the last 3 years.
* **Lag 5:** The average yield of the last 5 years.

This gives the model a "memory" of recent performance.

In [47]:
def past_n_year_avg(df, crop_type, n):
    """
    Compute past-n-year average yield strictly for N full years.
    If less than N full past years exist, return NaN.
    """
    d = df[df['item'] == crop_type].copy()
    d['year'] = pd.to_datetime(d['year']).dt.year
    d = d.sort_values(['area', 'year'])

    def compute_avg(g):
        yrs = g['year'].values
        lbl = g['label'].values
        res = []

        for y in yrs:
            # past N years only
            mask = (yrs >= y - n) & (yrs < y)
            vals = lbl[mask]

            # strict requirement: must have exactly N rows
            if len(vals) == n:
                res.append(vals.mean())
            else:
                res.append(np.nan)

        return pd.Series(res, index=g.index)

    d[f'avg_yield_{crop_type}_{n}y'] = (
        d.groupby('area', group_keys=False)
         .apply(compute_avg, include_groups=False)
    )

    return d[['year', 'area', f'avg_yield_{crop_type}_{n}y']]


#### Generate Lags for All Crops
We run our function for every crop in our list and merge the results into a single dataframe called `features_lag_yield`.

In [48]:

# Iterate through all crops and generate lag features
dfs = []
for crop in crop_list:
    for n in [1, 2, 3]:
        dfs.append(past_n_year_avg(label_yield, crop, n))

# Merge all crop features into a single dataframe
features_lag_yield = reduce(
    lambda left, right: pd.merge(left, right, how='left', on=['year', 'area']),
    dfs
)



### 3. Weather Feature Engineering (Sin/Cos Transformation)

Instead of grouping months into arbitrary seasons (Winter, Spring, etc.), we will use **Sine and Cosine transformations** to capture the cyclical nature of the weather.

We process the weather data as follows:
1.  **Pivot:** Organize data so we have 12 separate months for every Year/Area.
2.  **Harmonic Extraction:** For each weather variable (Rain, Solar, Temp), we calculate:
    * **Annual Base:** The total (for rain) or average (for temp/solar) of the year.
    * **Sin Component:** The dot product of the monthly values with a sine wave.
    * **Cos Component:** The dot product of the monthly values with a cosine wave.
3.  **Lag by 1 Year:** We shift this data to use the *previous* year's weather to predict the current crop.

In [49]:
def prep_harmonic_weather_lag1year(nasa_df, var_list=['rain','solar','temp']):
    """
    Computes seasonal lag-1 weather features using Sin/Cos transformations (Fourier).
    Instead of averaging seasons, this extracts the cyclic nature:
    - Annual Mean/Sum (The base level)
    - Sin Component (Seasonality A)
    - Cos Component (Seasonality B)
    """
    nasa_df['date'] = pd.to_datetime(nasa_df['date'])
    nasa_df['year'] = nasa_df['date'].dt.year
    nasa_df['month'] = nasa_df['date'].dt.month

    # Pre-calculate sin/cos weights for months 1-12
    months = np.arange(1, 13)
    sin_weights = np.sin(2 * np.pi * months / 12)
    cos_weights = np.cos(2 * np.pi * months / 12)

    all_features = None

    for var in var_list:
        # Pivot table (rows=year/area, cols=1..12)
        p = nasa_df.pivot_table(
            index=['area','year'],
            columns='month',
            values=var
        ).reset_index()

        # Lag 1 Year
        p['year'] = p['year'] + 1

        # Ensure columns 1-12 exist (fill missing with NaN)
        for m in range(1, 13):
            if m not in p.columns:
                p[m] = np.nan

        # Extract the 12 month columns as a numpy matrix
        matrix_data = p[range(1, 13)].values

        # --- Calculate Features Vectorized ---

        # 1. Base Feature (Annual Sum for Rain, Mean for others)
        if var == 'rain':
            p[f'{var}_annual'] = np.sum(matrix_data, axis=1)
        else:
            p[f'{var}_annual'] = np.mean(matrix_data, axis=1)
        
        # Enforce strict NaN handling (if any month is missing, annual is NaN)
        mask = np.isnan(matrix_data).any(axis=1)
        p.loc[mask, f'{var}_annual'] = np.nan

        # 2. Sin/Cos Components (Dot Product)
        p[f'{var}_sin'] = matrix_data @ sin_weights
        p[f'{var}_cos'] = matrix_data @ cos_weights

        # Enforce NaN strictness for sin/cos
        p.loc[mask, f'{var}_sin'] = np.nan
        p.loc[mask, f'{var}_cos'] = np.nan

        # Keep only keys and new features
        cols_keep = ['area', 'year', f'{var}_annual', f'{var}_sin', f'{var}_cos']
        p_final = p[cols_keep]

        if all_features is None:
            all_features = p_final
        else:
            all_features = all_features.merge(p_final, on=['area', 'year'], how='outer')

    return all_features


In [50]:
# Process weather features with Sin/Cos
nasa_f = prep_harmonic_weather_lag1year(nasa_df, var_list=['rain','solar','temp'])

# Verify new columns
print(nasa_f.columns.tolist())

['area', 'year', 'rain_annual', 'rain_sin', 'rain_cos', 'solar_annual', 'solar_sin', 'solar_cos', 'temp_annual', 'temp_sin', 'temp_cos']


### 4. Add Location Data (Geospatial)
 Geography plays a huge role in agriculture. We load a separate file containing the **Latitude and Longitude** for each country. This helps the model understand that "Thailand" and "Vietnam" are neighbors and might share similar traits.

In [51]:
# Load geospatial data (Assuming 'lat_long.csv' exists in Data folder)
latlong = pd.read_csv('Data/coordinates.csv')

# Clean and standardize formatting
latlong['area'] = latlong['Area'].str.replace(' ', '_')
latlong = latlong[['area', 'latitude', 'longitude']]

# Display sample to verify structure
latlong.head()

Unnamed: 0,area,latitude,longitude
0,Albania,41.33,19.82
1,Algeria,28.03,1.66
2,Angola,-11.2,17.87
3,Argentina,-38.42,-63.62
4,Armenia,40.07,45.04


### 5. Add Farming Inputs (Fertilizers & Pesticides)
We include data on how much fertilizer and pesticide was used.
* **Logic:** We shift this data by 1 year (`Lag 1`).
* **Reason:** Farmers often plan their budget based on the previous year's usage. Using last year's data makes our prediction more practical for early forecasting.

In [52]:
# 1. Load the farming data
farming_df = pd.read_parquet('Parquet/farming_df.parquet')

# 2. Ensure 'year' is in datetime format for accurate date shifting
farming_df['year'] = pd.to_datetime(farming_df['year'])

# 3. Create Lag Features (Shift Year Forward by 1)
# Logic: We use 2020's pesticides for the 2021 yield row.
farming_df['year'] = farming_df['year'] + pd.DateOffset(years=1)

# === FIX START ===
# 4. Convert 'year' back to an integer to match x_features
farming_df['year'] = farming_df['year'].dt.year
# === FIX END ===

# 5. Rename columns to indicate they are lagged
farming_df = farming_df.rename(columns={
    'pesticides': 'pesticides_lag1',
    'fertilizer': 'fertilizer_lag1'
})

### 6. Final Merge and Save
We combine all our new features into one master dataset:
* **Yield Lags** + **Harmonic Weather** + **Farming Inputs** + **Location**

We filter out data before 1982 (since we don't have enough history to calculate the 5-year lag for those early years) and save the final file as `x_features_v2.parquet`.

In [53]:
# Merge Yield Lags with Weather Data
x_features = features_lag_yield.merge(
    nasa_f, on=['year', 'area'], how='left'
)

# 7. Merge with Farming Data
# Now both dataframes have 'year' as an integer
x_features = x_features.merge(
    farming_df, on=['year', 'area'], how='left'
)

# Merge with Geospatial Data
x_features = x_features.merge(
    latlong, on=['area'], how='left'
)


# Prevent pandas from hiding columns
pd.set_option('display.max_columns', None)

# Show first 20 rows for Thailand
x_features[x_features['area'] == 'Thailand'].head(20)

Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_2y,avg_yield_maize_corn_3y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_2y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_potatoes_1y,avg_yield_potatoes_2y,avg_yield_potatoes_3y,avg_yield_rice_1y,avg_yield_rice_2y,avg_yield_rice_3y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_2y,avg_yield_sugar_cane_3y,avg_yield_wheat_1y,avg_yield_wheat_2y,avg_yield_wheat_3y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_2y,avg_yield_oil_palm_fruit_3y,avg_yield_barley_1y,avg_yield_barley_2y,avg_yield_barley_3y,avg_yield_soya_beans_1y,avg_yield_soya_beans_2y,avg_yield_soya_beans_3y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_2y,avg_yield_sugar_beet_3y,avg_yield_watermelons_1y,avg_yield_watermelons_2y,avg_yield_watermelons_3y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_2y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_tomatoes_1y,avg_yield_tomatoes_2y,avg_yield_tomatoes_3y,avg_yield_bananas_1y,avg_yield_bananas_2y,avg_yield_bananas_3y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_2y,avg_yield_cassava_fresh_3y,rain_annual,rain_sin,rain_cos,solar_annual,solar_sin,solar_cos,temp_annual,temp_sin,temp_cos,pesticides_lag1,fertilizer_lag1,latitude,longitude
7309,1970,Thailand,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,15.87,100.99
7310,1971,Thailand,2587.7,,,6181.1,,,8000.0,,,2020.7,,,43149.5,,,,,,5789.5,,,,,,875.0,,,,,,9966.7,,,7000.0,,,3223.2,,,9600.0,,,15317.0,,,,,,,,,,,,,6.588943,15.87,100.99
7311,1972,Thailand,2421.1,2504.4,,6171.9,6176.5,,8000.0,8000.0,,1936.9,1978.8,,47752.3,45450.9,,,,,5789.5,5789.5,,,,,1007.0,941.0,,,,,10347.6,10157.15,,7000.0,7000.0,,3708.3,3465.75,,9600.0,9600.0,,14144.3,14730.65,,,,,,,,,,,,10.30802,15.87,100.99
7312,1973,Thailand,1414.0,1917.55,2140.933333,6210.9,6191.4,6187.966667,8000.0,8000.0,8000.0,1830.8,1883.85,1929.466667,37373.9,42563.1,42758.566667,,,,5789.5,5789.5,5789.5,,,,940.3,973.65,940.766667,,,,11568.0,10957.8,10627.433333,7000.0,7000.0,7000.0,3869.6,3788.95,3600.366667,9600.0,9600.0,9600.0,12127.7,13136.0,13863.0,,,,,,,,,,,12.159696,15.87,100.99
7313,1974,Thailand,2227.6,1820.8,2020.9,6250.0,6230.45,6210.933333,6540.6,7270.3,7513.533333,1924.2,1877.5,1897.3,52476.8,44925.35,45867.666667,,,,5833.3,5811.4,5804.1,,,,1001.6,970.95,982.966667,,,,11581.3,11574.65,11165.633333,7096.2,7048.1,7032.066667,4081.5,3975.55,3886.466667,10000.0,9800.0,9733.333333,13120.4,12624.05,13130.8,,,,,,,,,,,10.789281,15.87,100.99
7314,1975,Thailand,2332.1,2279.85,1991.233333,6289.1,6269.55,6250.0,6532.2,6536.4,7024.266667,1825.4,1874.8,1860.133333,51579.0,52027.9,47143.233333,,,,5833.3,5833.3,5818.7,,,,871.6,936.6,937.833333,,,,10775.0,11178.15,11308.1,7092.6,7094.4,7062.933333,5231.1,4656.3,4394.066667,10400.0,10200.0,10000.0,13184.6,13152.5,12810.9,,,,,,,,,,,13.756906,15.87,100.99
7315,1976,Thailand,2375.2,2353.65,2311.633333,6328.1,6308.6,6289.066667,9944.5,8238.35,7672.433333,1830.8,1828.1,1860.133333,48127.6,49853.3,50727.8,,,,5823.5,5828.4,5830.033333,,,,1027.6,949.6,966.933333,,,,11831.3,11303.15,11395.866667,7107.1,7099.85,7098.633333,7387.4,6309.25,5566.666667,10400.0,10400.0,10266.666667,14933.0,14058.8,13746.0,,,,,,,,,,,12.090537,15.87,100.99
7316,1977,Thailand,2386.2,2380.7,2364.5,6367.2,6347.65,6328.133333,2573.1,6258.8,6349.933333,1845.0,1837.9,1833.733333,53291.6,50709.6,50999.4,,,,10843.0,8333.25,7499.933333,,,,1209.6,1118.6,1036.266667,,,,12126.8,11979.05,11577.7,7206.9,7157.0,7135.533333,4198.5,5792.95,5605.666667,11200.0,10800.0,10666.666667,15065.4,14999.2,14394.333333,,,,,,,,,,,15.635716,15.87,100.99
7317,1978,Thailand,1717.7,2051.95,2159.7,6406.3,6386.75,6367.2,10485.8,6529.45,7667.8,1591.0,1718.0,1755.6,52814.2,53052.9,51411.133333,,,,12207.4,11525.2,9624.633333,,,,736.5,973.05,991.233333,,,,12721.2,12424.0,12226.433333,7200.0,7203.45,7171.333333,4502.7,4350.6,5362.866667,12000.0,11600.0,11200.0,14255.1,14660.25,14751.166667,,,,,,,,,,,17.250999,15.87,100.99
7318,1979,Thailand,2124.0,1920.85,2075.966667,6395.3,6400.8,6389.6,10361.3,10423.55,7806.733333,1955.2,1773.1,1797.066667,33842.9,43328.55,46649.566667,,,,10381.1,11294.25,11143.833333,,,,1093.7,915.1,1013.266667,,,,12373.2,12547.2,12407.066667,7290.3,7245.15,7232.4,4818.0,4660.35,4506.4,11600.0,11800.0,11600.0,15511.5,14883.3,14944.0,,,,,,,,,,,17.873107,15.87,100.99


In [54]:

# Filter data to relevant years (1983 onwards)
x_features = x_features[x_features['year'] >= 1982]

# Save to Parquet
x_features.to_parquet('Parquet/x_features_v3.parquet')

# Output shape for verification
print(f"Final X features shape: {x_features.shape}")

x_features.head()



Final X features shape: (6631, 60)


Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_2y,avg_yield_maize_corn_3y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_2y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_potatoes_1y,avg_yield_potatoes_2y,avg_yield_potatoes_3y,avg_yield_rice_1y,avg_yield_rice_2y,avg_yield_rice_3y,avg_yield_sugar_cane_1y,avg_yield_sugar_cane_2y,avg_yield_sugar_cane_3y,avg_yield_wheat_1y,avg_yield_wheat_2y,avg_yield_wheat_3y,avg_yield_oil_palm_fruit_1y,avg_yield_oil_palm_fruit_2y,avg_yield_oil_palm_fruit_3y,avg_yield_barley_1y,avg_yield_barley_2y,avg_yield_barley_3y,avg_yield_soya_beans_1y,avg_yield_soya_beans_2y,avg_yield_soya_beans_3y,avg_yield_sugar_beet_1y,avg_yield_sugar_beet_2y,avg_yield_sugar_beet_3y,avg_yield_watermelons_1y,avg_yield_watermelons_2y,avg_yield_watermelons_3y,avg_yield_cucumbers_and_gherkins_1y,avg_yield_cucumbers_and_gherkins_2y,avg_yield_cucumbers_and_gherkins_3y,avg_yield_tomatoes_1y,avg_yield_tomatoes_2y,avg_yield_tomatoes_3y,avg_yield_bananas_1y,avg_yield_bananas_2y,avg_yield_bananas_3y,avg_yield_cassava_fresh_1y,avg_yield_cassava_fresh_2y,avg_yield_cassava_fresh_3y,rain_annual,rain_sin,rain_cos,solar_annual,solar_sin,solar_cos,temp_annual,temp_sin,temp_cos,pesticides_lag1,fertilizer_lag1,latitude,longitude
12,1982,Afghanistan,1669.0,1670.05,1650.1,6892.2,6809.65,6748.933333,15423.7,15142.0,14880.5,2241.4,2207.1,2181.766667,18918.9,18918.9,18378.366667,1235.0,1244.95,1240.533333,,,,1079.1,1064.05,1058.066667,,,,8333.3,15104.15,17361.1,9538.5,9353.35,9187.533333,,,,,,,,,,,,,325.32,154.391016,59.402942,,,,11.699167,-33.899342,-60.110408,,5.778887,34.53,69.17
13,1983,Afghanistan,1665.8,1667.4,1668.633333,6919.2,6905.7,6846.166667,15511.4,15467.55,15265.133333,2199.4,2220.4,2204.533333,19090.9,19004.9,18976.233333,1229.9,1232.45,1239.933333,,,,1073.9,1076.5,1067.333333,,,,9090.9,8712.1,13099.733333,9457.9,9498.2,9388.2,,,,,,,,,,,,,373.54,186.787027,72.87098,,,,10.650833,-42.640609,-60.922614,,6.672946,34.53,69.17
14,1984,Afghanistan,1664.1,1664.95,1666.3,7065.7,6992.45,6959.033333,15764.7,15638.05,15566.6,2258.1,2228.75,2232.966667,19375.0,19232.95,19128.266667,1258.0,1243.95,1240.966667,,,,1099.2,1086.55,1084.066667,,,,20000.0,14545.45,12474.733333,9754.9,9606.4,9583.766667,,,,,,,,,,,,,273.25,211.709895,-21.830163,,,,11.383333,-46.025539,-55.929678,,7.152971,34.53,69.17
15,1985,Afghanistan,1661.2,1662.65,1663.7,7155.1,7110.4,7046.666667,14444.4,15104.55,15240.166667,2241.6,2249.85,2233.033333,19354.8,19364.9,19273.566667,1231.9,1244.95,1239.933333,,,,1085.5,1092.35,1086.2,,,,20000.0,20000.0,16363.633333,9630.0,9692.45,9614.266667,,,,,,,,,,,,,196.1,96.438281,33.467341,19.215833,-13.475498,-51.48091,11.888333,-39.59991,-69.677748,,9.178255,34.53,69.17
16,1986,Afghanistan,1665.2,1663.2,1663.5,7145.9,7150.5,7122.233333,14090.9,14267.65,14766.666667,2248.2,2244.9,2249.3,19333.3,19344.05,19354.366667,1227.7,1229.8,1239.2,,,,1086.0,1085.75,1090.233333,,,,3333.3,11666.65,14444.433333,9556.7,9593.35,9647.2,,,,,,,,,,,,,155.09,67.530475,38.964517,18.741667,-12.861915,-47.249218,12.511667,-31.144849,-60.63187,,9.22402,34.53,69.17
