# Feature Engineering and Data Preparation (Part 3)

## Overview
In this notebook, we construct the feature set (independent variables) required for the predictive model. The process involves three main stages:
1.  **Lag Features:** Calculating historical crop yield averages (1, 3, and 6 years) to capture agricultural trends.
2.  **Weather Aggregation (Seasonal):** Processing NASA weather data (Temperature, Solar Radiation, Rainfall) to create **Seasonal** (Winter, Spring, Summer, Autumn) and Annual aggregates lagged by one year.
3.  **Geospatial Integration:** Merging latitude and longitude data to account for spatial variances.

The final output is a consolidated dataset `x_features.parquet`.

In [1]:
import pandas as pd
import numpy as np
from functools import reduce

### 1. Data Loading and String Cleaning
We begin by loading the weather and crop yield datasets. We perform string manipulation on the 'item' (crop) column to ensure consistency by removing special characters and replacing spaces with underscores.

In [2]:
# Load datasets
nasa_df = pd.read_parquet('Parquet/nasa_df.parquet')
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

# Clean crop names for consistent column naming
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

# Generate a list of unique crops for iteration
crop_list = list(label_yield['item'].unique())

### 2. Constructing Lag Features
To predict future yields, historical performance is a significant indicator. We define functions to calculate the moving average of crop yields over the past 1, 3, and 6 years. These features are calculated per crop per area.

In [3]:
def past_n_year_avg(group, n):
    """
    Calculates the average yield for the past n years for a given group.
    """
    res = []
    for i, row in group.iterrows():
        current_year = row['year']
        # Identify years strictly less than the current year
        past_years = list(range(current_year - n, current_year))
        avg = group.loc[group['year'].isin(past_years), 'label'].mean()
        res.append(avg)
    return pd.Series(res, index=group.index)

def prep_feature_crop_lag_1_3_6_by_type(df, crop_type):
    """
    Filters data by crop type and computes 1, 3, and 6-year lagged averages.
    """
    df = df[df['item'] == crop_type].copy()
    
    # Ensure year column is integer format for calculation
    df['year'] = pd.to_datetime(df['year']).dt.year
    df = df.sort_values(['area', 'item', 'year'])
    
    # Compute averages per area
    df[f'avg_yield_{crop_type}_1y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 1))
    df[f'avg_yield_{crop_type}_3y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 3))
    df[f'avg_yield_{crop_type}_6y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 6))
    
    # Return only the feature columns
    return df[['year', 'area', f'avg_yield_{crop_type}_1y', f'avg_yield_{crop_type}_3y', f'avg_yield_{crop_type}_6y']]

In [4]:
# Iterate through all crops and generate lag features
dfs = []
for crop_type in crop_list:
    dfs.append(prep_feature_crop_lag_1_3_6_by_type(df=label_yield, crop_type=crop_type))

# Merge all crop features into a single dataframe
features_lag_yield = reduce(
    lambda left, right: pd.merge(left, right, how='left', on=['year', 'area']),
    dfs
)

  df[f'avg_yield_{crop_type}_1y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 1))
  df[f'avg_yield_{crop_type}_3y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 3))
  df[f'avg_yield_{crop_type}_6y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 6))
  df[f'avg_yield_{crop_type}_1y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 1))
  df[f'avg_yield_{crop_type}_3y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 3))
  df[f'avg_yield_{crop_type}_6y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 6))
  df[f'avg_yield_{crop_type}_1y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 1))
  df[f'avg_yield_{crop_type}_3y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 3))
  df[f'avg_yield_{crop_t

### 3. Weather Data Aggregation (Seasonal)
We transform monthly weather data into **Seasonal** aggregates based on the standard Northern Hemisphere meteorological mapping:
* **Winter:** Dec, Jan, Feb
* **Spring:** Mar, Apr, May
* **Summer:** Jun, Jul, Aug
* **Autumn:** Sep, Oct, Nov

We also apply a **1-year lag**, meaning the yield in Year `T` is predicted using the weather summaries from Year `T-1`.

In [5]:
def prep_rain_features_sum_lag1year(nasa_df):
    """
    Aggregates rainfall data into Seasonal sums (Winter, Spring, Summer, Autumn) 
    and shifts the year for predictive lag.
    """
    nasa_df['year'] = nasa_df['date'].dt.year
    nasa_df['month'] = nasa_df['date'].dt.month
    
    # Pivot to wide format: Years x Months
    features_rain = nasa_df.pivot_table(
        index=['area','year'],
        columns='month',
        values='rain'
    ).reset_index()
    
    # Rename columns to standard abbreviations (e.g., rain_Jan, rain_Feb)
    month_map = {i: f'rain_{pd.Timestamp(1900,i,1).strftime("%b")}' for i in range(1,13)}
    features_rain = features_rain.rename(columns=month_map)
    
    # Create Lag-1: Weather from previous year predicts next year's yield
    features_rain['year'] = features_rain['year'] + 1
    
    # Standardize column names (Ensure all months exist, fill 0 if missing for sum)
    months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    for m in months:
        if f'rain_{m}' not in features_rain.columns:
            features_rain[f'rain_{m}'] = 0
    
    # --- Seasonal Aggregations ---
    # Winter: Dec, Jan, Feb
    features_rain['sum_rain_Winter'] = features_rain[['rain_Jan', 'rain_Feb', 'rain_Dec']].sum(axis=1)
    # Spring: Mar, Apr, May
    features_rain['sum_rain_Spring'] = features_rain[['rain_Mar', 'rain_Apr', 'rain_May']].sum(axis=1)
    # Summer: Jun, Jul, Aug
    features_rain['sum_rain_Summer'] = features_rain[['rain_Jun', 'rain_Jul', 'rain_Aug']].sum(axis=1)
    # Autumn: Sep, Oct, Nov
    features_rain['sum_rain_Autumn'] = features_rain[['rain_Sep', 'rain_Oct', 'rain_Nov']].sum(axis=1)
    # Annual
    features_rain['sum_rain_Annual'] = features_rain[[f'rain_{m}' for m in months]].sum(axis=1)
    
    return features_rain[['area', 'year', 'sum_rain_Winter', 'sum_rain_Spring', 'sum_rain_Summer', 'sum_rain_Autumn', 'sum_rain_Annual']]

In [6]:
def prep_monthly_features_avg_nasa_lag1year(nasa_df, var_list=['rain', 'solar', 'temp']):
    """
    Computes Seasonal averages for multiple weather variables (Rain, Solar, Temp), 
    applying a 1-year lag.
    """
    nasa_df['date'] = pd.to_datetime(nasa_df['date'])
    nasa_df['year'] = nasa_df['date'].dt.year
    nasa_df['month'] = nasa_df['date'].dt.month
    
    all_features = None
    
    for var in var_list:
        # Pivot table for the specific variable
        df_pivot = nasa_df.pivot_table(
            index=['area','year'],
            columns='month',
            values=var
        ).reset_index()
        
        month_map = {i: f'{var}_{pd.Timestamp(1900,i,1).strftime("%b")}' for i in range(1,13)}
        df_pivot = df_pivot.rename(columns=month_map)
        
        # Apply Lag
        df_pivot['year'] = df_pivot['year'] + 1
        
        # Fill missing months with NA to ensure column existence
        months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
        for m in months:
            col = f'{var}_{m}'
            if col not in df_pivot.columns:
                df_pivot[col] = pd.NA
        
        # --- Seasonal Averages ---
        # Winter: Dec, Jan, Feb
        df_pivot[f'avg_{var}_Winter'] = df_pivot[[f'{var}_{m}' for m in ['Jan', 'Feb', 'Dec']]].mean(axis=1)
        # Spring: Mar, Apr, May
        df_pivot[f'avg_{var}_Spring'] = df_pivot[[f'{var}_{m}' for m in ['Mar', 'Apr', 'May']]].mean(axis=1)
        # Summer: Jun, Jul, Aug
        df_pivot[f'avg_{var}_Summer'] = df_pivot[[f'{var}_{m}' for m in ['Jun', 'Jul', 'Aug']]].mean(axis=1)
        # Autumn: Sep, Oct, Nov
        df_pivot[f'avg_{var}_Autumn'] = df_pivot[[f'{var}_{m}' for m in ['Sep', 'Oct', 'Nov']]].mean(axis=1)
        # Annual
        df_pivot[f'avg_{var}_Annual'] = df_pivot[[f'{var}_{m}' for m in months]].mean(axis=1)
        
        # Keep only identification and aggregated columns
        cols_to_keep = ['area','year'] + [f'avg_{var}_{season}' for season in ['Winter', 'Spring', 'Summer', 'Autumn', 'Annual']]
        df_pivot = df_pivot[cols_to_keep]
        
        # Merge variables iteratively
        if all_features is None:
            all_features = df_pivot
        else:
            all_features = all_features.merge(df_pivot, on=['area','year'], how='outer')
    
    return all_features

In [7]:
# Process weather features with new Season logic
features_avg_nasa_all_lag1year = prep_monthly_features_avg_nasa_lag1year(nasa_df, var_list=['rain', 'solar', 'temp'])
features_sum_nasa_rain_lag1year = prep_rain_features_sum_lag1year(nasa_df)

# Combine averages and sums
nasa_f = features_avg_nasa_all_lag1year.merge(
    features_sum_nasa_rain_lag1year, on=['year', 'area'], how='inner'
)

# Check columns to confirm seasonality
print("Weather Feature Columns:", nasa_f.columns.tolist())

Weather Feature Columns: ['area', 'year', 'avg_rain_Winter', 'avg_rain_Spring', 'avg_rain_Summer', 'avg_rain_Autumn', 'avg_rain_Annual', 'avg_solar_Winter', 'avg_solar_Spring', 'avg_solar_Summer', 'avg_solar_Autumn', 'avg_solar_Annual', 'avg_temp_Winter', 'avg_temp_Spring', 'avg_temp_Summer', 'avg_temp_Autumn', 'avg_temp_Annual', 'sum_rain_Winter', 'sum_rain_Spring', 'sum_rain_Summer', 'sum_rain_Autumn', 'sum_rain_Annual']


### 4. Integrating Geospatial Data
We incorporate latitude and longitude data to allow the model to learn from spatial relationships. The area names are cleaned to match the primary dataset keys.

In [9]:
# Load geospatial data (Assuming 'lat_long.csv' exists in Data folder)
latlong = pd.read_csv('Data/coordinates_countries_full_209.csv')

# Clean and standardize formatting
latlong['area'] = latlong['Area'].str.replace(' ', '_')
latlong = latlong[['area', 'latitude', 'longitude']]

# Display sample to verify structure
latlong.head()

Unnamed: 0,area,latitude,longitude
0,Albania,41.33,19.82
1,Algeria,28.03,1.66
2,Angola,-11.2,17.87
3,Argentina,-38.42,-63.62
4,Armenia,40.07,45.04


### 5. Final Merging and Export
We merge the yield lag features, the seasonal weather features, and the geospatial data into a single DataFrame. We filter for years starting from 1983 to ensure data consistency and save the result.

In [None]:
# Merge Yield Lags with Weather Data
x_features = features_lag_yield.merge(
    nasa_f, on=['year', 'area'], how='left'
)

# Merge with Geospatial Data
x_features = x_features.merge(
    latlong, on=['area'], how='left'
)

# Filter data to relevant years (1983 onwards)
x_features = x_features[x_features['year'] >= 1983]

# Save to Parquet
x_features.to_parquet('Parquet/x_features_1.parquet')

# Output shape for verification
print(f"Final X features shape: {x_features.shape}")

x_features.head()

Final X features shape: (6458, 69)


Unnamed: 0,year,area,avg_yield_maize_corn_1y,avg_yield_maize_corn_3y,avg_yield_maize_corn_6y,avg_yield_other_vegetables_fresh_nec_1y,avg_yield_other_vegetables_fresh_nec_3y,avg_yield_other_vegetables_fresh_nec_6y,avg_yield_potatoes_1y,avg_yield_potatoes_3y,...,avg_temp_Summer,avg_temp_Autumn,avg_temp_Annual,sum_rain_Winter,sum_rain_Spring,sum_rain_Summer,sum_rain_Autumn,sum_rain_Annual,latitude,longitude
13,1983,Afghanistan,1665.8,1668.633333,1636.283333,6919.2,6846.166667,6561.216667,15511.4,15265.133333,...,21.856667,11.483333,10.650833,139.74,172.94,3.65,57.21,373.54,34.53,69.17
14,1984,Afghanistan,1664.1,1666.3,1649.75,7065.7,6959.033333,6775.366667,15764.7,15566.6,...,22.393333,12.94,11.383333,60.6,202.73,9.34,0.58,273.25,34.53,69.17
15,1985,Afghanistan,1661.2,1663.7,1656.9,7155.1,7046.666667,6897.8,14444.4,15240.166667,...,24.12,11.816667,11.888333,68.62,89.17,17.24,21.07,196.1,34.53,69.17
16,1986,Afghanistan,1665.2,1663.5,1666.066667,7145.9,7122.233333,6984.2,14090.9,14766.666667,...,22.946667,12.266667,12.511667,75.65,65.67,5.44,8.33,155.09,34.53,69.17
17,1987,Afghanistan,1687.5,1671.3,1668.8,7249.5,7183.5,7071.266667,15866.7,14800.666667,...,21.813333,12.196667,11.256667,61.78,172.26,36.63,40.53,311.2,34.53,69.17
