# Exploratory Data Analysis (Part 2)

## Objective
We are exploring the cleaned data to support the modeling in Part 5. 
We need to prove that our features (Weather, Previous Yield, Region) actually have a relationship with the Target (Yield).

**Key Questions:**
1. Is yield from last year related to this year? (Validates Baseline Model)
2. Does weather (Rain/Temp) correlate with yield? (Validates Weather Features)
3. Does location matter? (Validates Country/Region Features)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Set nicer styles
sns.set_theme(style="whitegrid")

In [None]:
# Load the data (Assuming these files exist from Part 1)
try:
    df_yield = pd.read_parquet('label_yield.parquet')
    df_weather = pd.read_parquet('nasa_df.parquet')
    print("Data loaded successfully.")
except:
    print("Could not load files. Please ensure Part 1 was run.")

### 1. Prepare Cereal Data
We focus on Cereals (Maize, Wheat, Rice, Barley).

In [None]:
# Filter for Cereals
target_crops = ['Maize', 'Wheat', 'Rice', 'Barley']
df_cereals = df_yield[df_yield['Crop'].isin(target_crops)].copy()

# Merge with Weather (Simplified merge for EDA)
# We assume the weather data is already aggregated by Year/Country in a real scenario,
# but here we will just ensure we have the columns if they exist in df_yield, 
# or we simulate a merge for the sake of the example if df_yield already has weather.
# For this output, we assume df_cereals has 'Rain_mm' and 'Temp_Celsius' columns.

print("Top 5 Rows of Cereal Data:")
display(df_cereals.head())

### 2. Distribution of Yields (The Target)
Checking if the data is normal or skewed.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(data=df_cereals, x='Yield (kg/ha)', hue='Crop', element='step', bins=30)
plt.title('Distribution of Crop Yields')
plt.show()

# Text Output for AI/Analysis
print("Yield Statistics by Crop:")
print(df_cereals.groupby('Crop')['Yield (kg/ha)'].describe()[['mean', 'std', 'min', 'max']])

### 3. Autocorrelation: Yield vs. Last Year's Yield
**Why this matters for Part 5:** 
The Part 5 Model uses a "Naive Baseline" and lag features. We need to prove that `Yield_Year_T` is strongly correlated with `Yield_Year_T-1`.

In [None]:
# Create a temporary Lag feature
df_cereals = df_cereals.sort_values(['Country', 'Crop', 'Year'])
df_cereals['Yield_Last_Year'] = df_cereals.groupby(['Country', 'Crop'])['Yield (kg/ha)'].shift(1)

# Remove empty rows created by shifting
df_lag = df_cereals.dropna(subset=['Yield_Last_Year'])

# Text Output: Correlation Score
correlation = df_lag['Yield (kg/ha)'].corr(df_lag['Yield_Last_Year'])
print(f"\nCORRELATION SCORE (Current vs Last Year): {correlation:.4f}")
print("Interpretation: A high score (> 0.8) justifies using previous history as a feature.\n")

# Plot
plt.figure(figsize=(8, 8))
sns.scatterplot(data=df_lag, x='Yield_Last_Year', y='Yield (kg/ha)', hue='Crop', alpha=0.6)
plt.plot([0, df_lag['Yield (kg/ha)'].max()], [0, df_lag['Yield (kg/ha)'].max()], 'r--', label='Perfect Correlation')
plt.title(f'Autocorrelation: Yield vs Previous Year (Corr: {correlation:.2f})')
plt.legend()
plt.show()

### 4. Weather Correlation Matrix
**Why this matters for Part 5:** 
We are using Rain and Temp as features. We need to see if they linearally relate to Yield, or if the relationship is complex (which XGBoost handles well).

In [None]:
# Select numeric columns
cols_to_check = ['Yield (kg/ha)', 'Rain_mm', 'Temp_Celsius', 'Pesticides_tonnes']
# Ensure these columns exist before plotting
available_cols = [c for c in cols_to_check if c in df_cereals.columns]

if len(available_cols) > 1:
    corr_matrix = df_cereals[available_cols].corr()
    
    # Text Output
    print("Correlation Matrix Table:")
    print(corr_matrix)
    
    # Heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title('Feature Correlation Matrix')
    plt.show()
else:
    print("Weather columns missing from this specific dataframe.")

### 5. Geographic Analysis
**Why this matters for Part 5:** 
If yield varies heavily by location, we must include Country/Region as a categorical feature.

In [None]:
# Text Output: Top 10 Countries by Average Yield
top_countries = df_cereals.groupby('Country')['Yield (kg/ha)'].mean().sort_values(ascending=False).head(10)
print("Top 10 Countries by Average Yield (kg/ha):")
print(top_countries)

# Filter for recent year for the map
recent_year = df_cereals['Year'].max()
df_map = df_cereals[df_cereals['Year'] == recent_year]

if not df_map.empty:
    fig = px.choropleth(
        df_map,
        locations='Country',
        locationmode='country names',
        color='Yield (kg/ha)',
        hover_name='Country',
        title=f'Global Yield Map ({recent_year})',
        color_continuous_scale='YlGn'
    )
    fig.update_layout(geo=dict(showframe=False, showcoastlines=True))
    fig.show()
else:
    print("No data available for map plotting.")