# Exploratory Data Analysis (Part 2)

## Objective
This notebook explores the cleaned datasets generated in Part 1:
1. **`label_yield.parquet`**: Annual crop yield data.
2. **`nasa_df.parquet`**: Monthly weather data (Rain, Solar Radiation, Temperature).

We aim to understand the data distributions, trends over time, and the relationship between weather variables and crop yields, with a specific focus on **Cereals (Barley, Wheat, Rice, Maize)**.

### 1. Import Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, r2_score

# Set plot style for better readability
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

In [2]:
# Load the processed data
yield_df = pd.read_parquet('Parquet/label_yield.parquet')
nasa_df = pd.read_parquet('Parquet/nasa_df.parquet')

print("Yield Data Shape:", yield_df.shape)
print("Weather Data Shape:", nasa_df.shape)

### 2. Analysis of Crop Yield Data
First, we compare the scale of yields across different crops to understand the global agricultural landscape.

In [3]:
# 1. Identify the last 3 years in the dataset for a recent snapshot
max_year = pd.to_datetime(yield_df['year']).dt.year.max()
recent_years = [max_year, max_year-1, max_year-2]

# 2. Filter for recent data
recent_df = yield_df[pd.to_datetime(yield_df['year']).dt.year.isin(recent_years)]

# 3. Calculate average yield per crop
avg_yield_by_crop = recent_df.groupby('item')['label'].mean().reset_index()

# 4. Sort from highest to lowest
avg_yield_by_crop = avg_yield_by_crop.sort_values(by='label', ascending=False).head(20)

# 5. Plot
plt.figure(figsize=(12, 10))
sns.barplot(x='label', y='item', data=avg_yield_by_crop, palette='viridis')
plt.title(f'Average Crop Yield (Top 20 Crops, {min(recent_years)}-{max_year})')
plt.xlabel('Average Yield (kg/ha)')
plt.ylabel('Crop')
plt.show()

### 3. Focus on Cereals
We will now filter the dataset to focus exclusively on the specified cereals: **Barley, Wheat, Rice, and Maize (Corn)**.

In [4]:
# Define the target cereals using regex for flexibility (e.g. 'Maize (corn)')
target_pattern = r"Barley|Wheat|Rice|Maize"

# Filter data
cereals_df = yield_df[yield_df['item'].str.contains(target_pattern, case=False)].copy()

print("Cereals Data Shape:", cereals_df.shape)
print("Unique Crops found:", cereals_df['item'].unique())
print("Unique Areas producing Cereals:", cereals_df['area'].nunique())

cereals_df.head()

#### 3.1 Cereal Yield Trends Over Time
We visualize how the yield of these key cereals has changed over the decades globally.

In [5]:
# Extract year number for plotting
cereals_df['year_num'] = pd.to_datetime(cereals_df['year']).dt.year

# Aggregate global average yield per crop per year
global_trends = cereals_df.groupby(['item', 'year_num'])['label'].mean().reset_index()

plt.figure(figsize=(14, 6))
sns.lineplot(x='year_num', y='label', hue='item', data=global_trends, marker='o')
plt.title('Global Yield Trends: Major Cereals')
plt.xlabel('Year')
plt.ylabel('Average Yield (kg/ha)')
plt.legend(title='Crop')
plt.show()

### 4. Weather Data Analysis
We briefly inspect the weather data distributions.

In [6]:
# Quick distribution check of weather variables
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.histplot(nasa_df['temp'].dropna(), bins=30, kde=True, ax=axes[0], color='orange')
axes[0].set_title('Global Temperature Distribution')

sns.histplot(nasa_df['rain'].dropna(), bins=30, kde=True, ax=axes[1], color='blue')
axes[1].set_title('Global Rainfall Distribution')

sns.histplot(nasa_df['solar'].dropna(), bins=30, kde=True, ax=axes[2], color='red')
axes[2].set_title('Global Solar Radiation Distribution')

plt.tight_layout()
plt.show()

### 5. Feature Engineering Support

#### 5.1 Autocorrelation (Lag Analysis)
**Hypothesis:** The yield of a specific crop in the current year ($t$) is highly correlated with its yield in the previous year ($t-1$).

*Note: We must group by both `area` and `item` to ensure we don't shift data between different crops.*

In [7]:
# Create a temporary dataframe to calculate lags
lag_analysis = cereals_df.sort_values(['area', 'item', 'year_num']).copy()

# Calculate Previous Year Yield (Lag 1) grouped by Area AND Crop
lag_analysis['yield_lag_1'] = lag_analysis.groupby(['area', 'item'])['label'].shift(1)

# Filter for plot (Year >= 2010) for cleaner visualization
plot_data = lag_analysis[lag_analysis['year_num'] >= 2010]

plt.figure(figsize=(8, 8))
sns.scatterplot(x='yield_lag_1', y='label', hue='item', data=plot_data, alpha=0.3)
plt.plot([0, 15000], [0, 15000], color='red', linestyle='--') # Reference line
plt.title('Autocorrelation: Year(t) vs Year(t-1) by Cereal')
plt.xlabel('Yield Year (t-1)')
plt.ylabel('Yield Year (t)')
plt.legend(title='Crop')
plt.show()

# Calculate correlation score
corr_score = lag_analysis['label'].corr(lag_analysis['yield_lag_1'])
print(f"Overall Correlation between Year(t) and Year(t-1): {corr_score:.4f}")

#### 5.2 Rolling Averages (Trend Analysis)
We check if a 3-year moving average is a good predictor.

In [8]:
# 3-year lagged moving average (t-1, t-2, t-3)
lag_analysis['MA_3_lag'] = lag_analysis.groupby(['area', 'item'])['label'].shift(1).rolling(window=3).mean()

# Drop NA for metrics
valid = lag_analysis.dropna(subset=['MA_3_lag'])

# Compute R2
r2 = r2_score(valid['label'], valid['MA_3_lag'])
print(f"RÂ² (Actual vs MA_3_lag): {r2:.4f}")

### 6. Correlation Analysis: Weather vs. Cereal Yields
We aggregate monthly weather data to annual averages and merge it with our Cereals dataset to check correlations.

In [9]:
# 1. Extract year from weather data
nasa_df['year'] = pd.to_datetime(nasa_df['date']).dt.year

# 2. Aggregate weather by Year and Area
weather_annual = nasa_df.groupby(['area', 'year']).agg({
    'rain': 'sum',   # Total Rain
    'temp': 'mean',  # Average Temp
    'solar': 'mean'  # Average Sun
}).reset_index()

# 3. Merge with Cereals Data
cereals_merge = cereals_df[['area', 'item', 'year_num', 'label']].rename(columns={'year_num': 'year'})
merged_df = pd.merge(cereals_merge, weather_annual, on=['area', 'year'], how='inner')

# 4. Plot Correlation Heatmap
plt.figure(figsize=(8, 6))
corr_matrix = merged_df[['label', 'rain', 'temp', 'solar']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation: Cereal Yields vs Weather Variables')
plt.show()

## Summary
1.  **Cereals Selected:** Wheat, Rice, Maize, and Barley.
2.  **Trends:** Yields for these crops generally trend upwards, but with distinct patterns per crop.
3.  **Predictability:** Lagged features (previous year yield) remain highly predictive for all cereals.