# Feature Engineering and Data Preparation (Part 3)

## Overview
In this notebook, we construct the feature set (independent variables) required for the predictive model. The process involves three main stages:
1.  **Lag Features:** Calculating historical crop yield averages (1, 3, and 6 years) to capture agricultural trends.
2.  **Weather Aggregation:** Processing NASA weather data (Temperature, Solar Radiation, Rainfall) to create seasonal and annual aggregates lagged by one year to prevent data leakage.
3.  **Geospatial Integration:** Merging latitude and longitude data to account for spatial variances.

The final output is a consolidated dataset `x_features.parquet`.

In [1]:
import pandas as pd
import numpy as np
from functools import reduce

### 1. Data Loading and String Cleaning
We begin by loading the weather and crop yield datasets. We perform string manipulation on the 'item' (crop) column to ensure consistency by removing special characters and replacing spaces with underscores.

In [2]:
# Load datasets
nasa_df = pd.read_parquet('Parquet/nasa_df.parquet')
label_yield = pd.read_parquet('Parquet/label_yield.parquet')

# Clean crop names for consistent column naming
label_yield['item'] = label_yield['item'].str.replace(r'[^0-9a-zA-Z ]', '', regex=True)
label_yield['item'] = label_yield['item'].str.replace(" ", "_").str.lower()

# Generate a list of unique crops for iteration
crop_list = list(label_yield['item'].unique())

### 2. Constructing Lag Features
To predict future yields, historical performance is a significant indicator. We define functions to calculate the moving average of crop yields over the past 1, 3, and 6 years. These features are calculated per crop per area.

In [3]:
def past_n_year_avg(group, n):
    """
    Calculates the average yield for the past n years for a given group.
    """
    res = []
    for i, row in group.iterrows():
        current_year = row['year']
        # Identify years strictly less than the current year
        past_years = list(range(current_year - n, current_year))
        avg = group.loc[group['year'].isin(past_years), 'label'].mean()
        res.append(avg)
    return pd.Series(res, index=group.index)

def prep_feature_crop_lag_1_3_6_by_type(df, crop_type):
    """
    Filters data by crop type and computes 1, 3, and 6-year lagged averages.
    """
    df = df[df['item'] == crop_type].copy()
    
    # Ensure year column is integer format for calculation
    df['year'] = pd.to_datetime(df['year']).dt.year
    df = df.sort_values(['area', 'item', 'year'])
    
    # Compute averages per area
    df[f'avg_yield_{crop_type}_1y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 1))
    df[f'avg_yield_{crop_type}_3y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 3))
    df[f'avg_yield_{crop_type}_6y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 6))
    
    # Return only the feature columns
    return df[['year', 'area', f'avg_yield_{crop_type}_1y', f'avg_yield_{crop_type}_3y', f'avg_yield_{crop_type}_6y']]

In [4]:
# Iterate through all crops and generate lag features
dfs = []
for crop_type in crop_list:
    dfs.append(prep_feature_crop_lag_1_3_6_by_type(df=label_yield, crop_type=crop_type))

# Merge all crop features into a single dataframe
features_lag_yield = reduce(
    lambda left, right: pd.merge(left, right, how='left', on=['year', 'area']),
    dfs
)

  df[f'avg_yield_{crop_type}_1y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 1))
  df[f'avg_yield_{crop_type}_3y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 3))
  df[f'avg_yield_{crop_type}_6y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 6))
  df[f'avg_yield_{crop_type}_1y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 1))
  df[f'avg_yield_{crop_type}_3y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 3))
  df[f'avg_yield_{crop_type}_6y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 6))
  df[f'avg_yield_{crop_type}_1y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 1))
  df[f'avg_yield_{crop_type}_3y'] = df.groupby(['area', 'item'], group_keys=False).apply(lambda g: past_n_year_avg(g, 3))
  df[f'avg_yield_{crop_t

### 3. Weather Data Aggregation
We transform monthly weather data into seasonal and annual aggregates. Importantly, we shift the year by +1. This ensures that when we look at a yield for Year X, we are using weather data from Year X-1 (or earlier), simulating a real-world forecasting scenario where future weather is unknown.

In [5]:
def prep_rain_features_sum_lag1year(nasa_df):
    """
    Aggregates rainfall data into quarterly sums and shifts the year for predictive lag.
    """
    nasa_df['year'] = nasa_df['date'].dt.year
    nasa_df['month'] = nasa_df['date'].dt.month
    
    # Pivot to wide format: Years x Months
    features_rain = nasa_df.pivot_table(
        index=['area','year'],
        columns='month',
        values='rain'
    ).reset_index()
    
    # Rename columns to standard abbreviations
    month_map = {i: f'rain_{pd.Timestamp(1900,i,1).strftime("%b")}' for i in range(1,13)}
    features_rain = features_rain.rename(columns=month_map)
    
    # Create Lag-1: Weather from previous year predicts next year's yield
    features_rain['year'] = features_rain['year'] + 1
    
    # Standardize column names for processing
    features_rain.columns = ['area', 'year', 'rain_Jan', 'rain_Feb', 'rain_Mar', 'rain_Apr',
           'rain_May', 'rain_Jun', 'rain_Jul', 'rain_Aug', 'rain_Sep', 'rain_Oct',
           'rain_Nov', 'rain_Dec']
    
    # Calculate Quarterly Sums
    features_rain['sum_rain_1_3'] = features_rain[['rain_Jan','rain_Feb','rain_Mar']].sum(axis=1)
    features_rain['sum_rain_3_6'] = features_rain[['rain_Apr','rain_May','rain_Jun',]].sum(axis=1)
    features_rain['sum_rain_6_9'] = features_rain[['rain_Jul','rain_Aug','rain_Sep']].sum(axis=1)
    features_rain['sum_rain_10_12'] = features_rain[['rain_Oct','rain_Nov','rain_Dec']].sum(axis=1)
    features_rain['sum_rain_1_12'] = features_rain.iloc[:, 2:14].sum(axis=1)
    
    return features_rain[['area', 'year', 'sum_rain_1_3', 'sum_rain_3_6', 'sum_rain_6_9', 'sum_rain_10_12', 'sum_rain_1_12']]

In [6]:
def prep_monthly_features_avg_nasa_lag1year(nasa_df, var_list=['rain', 'solar', 'temp']):
    """
    Computes quarterly averages for multiple weather variables, applying a 1-year lag.
    """
    nasa_df['date'] = pd.to_datetime(nasa_df['date'])
    nasa_df['year'] = nasa_df['date'].dt.year
    nasa_df['month'] = nasa_df['date'].dt.month
    
    all_features = None
    
    for var in var_list:
        # Pivot table for the specific variable
        df_pivot = nasa_df.pivot_table(
            index=['area','year'],
            columns='month',
            values=var
        ).reset_index()
        
        month_map = {i: f'{var}_{pd.Timestamp(1900,i,1).strftime("%b")}' for i in range(1,13)}
        df_pivot = df_pivot.rename(columns=month_map)
        
        # Apply Lag
        df_pivot['year'] = df_pivot['year'] + 1
        
        # Fill missing months with NA to ensure column existence
        for month in ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']:
            col = f'{var}_{month}'
            if col not in df_pivot.columns:
                df_pivot[col] = pd.NA
        
        # Compute Quarterly Averages
        df_pivot[f'avg_{var}_1_3'] = df_pivot[[f'{var}_{m}' for m in ['Jan','Feb','Mar']]].mean(axis=1)
        df_pivot[f'avg_{var}_3_6'] = df_pivot[[f'{var}_{m}' for m in ['Apr','May','Jun']]].mean(axis=1)
        df_pivot[f'avg_{var}_6_9'] = df_pivot[[f'{var}_{m}' for m in ['Jul','Aug','Sep']]].mean(axis=1)
        df_pivot[f'avg_{var}_10_12'] = df_pivot[[f'{var}_{m}' for m in ['Oct','Nov','Dec']]].mean(axis=1)
        df_pivot[f'avg_{var}_1_12'] = df_pivot.iloc[:, 2:14].mean(axis=1)
        
        # Merge variables iteratively
        if all_features is None:
            all_features = df_pivot
        else:
            all_features = all_features.merge(df_pivot, on=['area','year'], how='outer')
    
    return all_features

In [7]:
# Process weather features
features_avg_nasa_all_lag1year = prep_monthly_features_avg_nasa_lag1year(nasa_df, var_list=['rain' ,'solar','temp'])
features_sum_nasa_rain_lag1year = prep_rain_features_sum_lag1year(nasa_df)

# Combine averages and sums
nasa_f = features_avg_nasa_all_lag1year.merge(
    features_sum_nasa_rain_lag1year, on=['year', 'area'], how='inner'
)

### 4. Integrating Geospatial Data
We incorporate latitude and longitude data to allow the model to learn from spatial relationships. The area names are cleaned to match the primary dataset keys.

In [9]:
# Load geospatial data (Assuming 'lat_long.csv' exists in Data folder)
latlong = pd.read_csv('Data/coordinates_countries_full_209.csv')

# Clean and standardize formatting
latlong['area'] = latlong['Area'].str.replace(' ', '_')
latlong = latlong[['area', 'latitude', 'longitude']]

# Display sample to verify structure
latlong.head()

Unnamed: 0,area,latitude,longitude
0,Albania,41.33,19.82
1,Algeria,28.03,1.66
2,Angola,-11.2,17.87
3,Argentina,-38.42,-63.62
4,Armenia,40.07,45.04


### 5. Final Merging and Export
We merge the yield lag features, the weather features, and the geospatial data into a single DataFrame. We filter for years starting from 1983 to ensure data consistency and save the result.

In [10]:
# Merge Yield Lags with Weather Data
x_features = features_lag_yield.merge(
    nasa_f, on=['year', 'area'], how='left'
)

# Merge with Geospatial Data
x_features = x_features.merge(
    latlong, on=['area'], how='left'
)

# Filter data to relevant years (1983 onwards)
x_features = x_features[x_features['year'] >= 1983]

# Save to Parquet
x_features.to_parquet('Parquet/x_features.parquet')

# Output shape for verification
print(f"Final X features shape: {x_features.shape}")

Final X features shape: (6458, 105)
