# Oilfield Reclamation Site Assessment Using AlphaEarth Foundations

**Objective:** Use Google's AlphaEarth Foundation 64D embeddings to assess reclamation success at oilfield lease sites by comparing to healthy regional cropland references.

**Methodology:**
1. Upload field boundary (arable land) and lease boundary polygons
2. Extract AAFC Annual Crop Inventory data to identify crop type per year (2017-2023)
3. Build regional "healthy reference" embeddings by sampling same crop within 10-20km
4. Compare lease embeddings vs regional reference and vs background field
5. Track recovery trajectory over time using cosine similarity

**Key Datasets:**
- AlphaEarth Foundation (AEF): `GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL` (64D embeddings, 10m resolution)
- AAFC Annual Crop Inventory: `AAFC/ACI` (30m resolution, 2009-2023)

**Site Location:** 50.30523°, -101.80618° (Saskatchewan, Canada)

## Setup and Authentication

In [None]:
# Import required libraries
import ee
import geemap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial.distance import cosine
import ipywidgets as widgets
from IPython.display import display, HTML
import json
import os
from io import BytesIO
import zipfile

In [None]:
# Authenticate and initialize Earth Engine
try:
    ee.Initialize()
    print("✓ Earth Engine initialized successfully")
except:
    print("Authenticating with Earth Engine...")
    ee.Authenticate()
    ee.Initialize()
    print("✓ Earth Engine initialized successfully")

## 1. Upload Boundary Files

Upload your polygon files (KML, GeoJSON, or SHP/ZIP):
- **Field Boundary:** The clean agricultural area (quarter section minus non-arable areas)
- **Lease Boundary:** The disturbed oilfield lease site

In [None]:
# Initialize boundary variables
field_boundary = None
lease_boundary = None
site_center = None

# File upload widgets
field_upload = widgets.FileUpload(
    accept='.kml,.geojson,.json,.shp,.zip',
    multiple=False,
    description='Field Boundary',
    button_style='primary'
)

lease_upload = widgets.FileUpload(
    accept='.kml,.geojson,.json,.shp,.zip',
    multiple=False,
    description='Lease Boundary',
    button_style='warning'
)

process_button = widgets.Button(
    description='Process Files',
    button_style='success',
    tooltip='Click to process uploaded files',
    icon='check'
)

upload_status = widgets.HTML(value="<p>Please upload both boundary files, then click 'Process Files'.</p>")

# Function to process uploaded files and convert to ee.Geometry
def process_uploaded_file(upload_widget, name):
    """Process uploaded file and convert to ee.Geometry"""
    if not upload_widget.value:
        return None
    
    uploaded_file = list(upload_widget.value.values())[0]
    filename = list(upload_widget.value.keys())[0]
    content = uploaded_file['content']
    
    # Save to temp file
    temp_path = f'/tmp/{filename}'
    with open(temp_path, 'wb') as f:
        f.write(content)
    
    try:
        # Handle different file types
        if filename.endswith('.kml'):
            import fiona
            fiona.drvsupport.supported_drivers['KML'] = 'r'
            import geopandas as gpd
            gdf = gpd.read_file(temp_path, driver='KML')
        elif filename.endswith(('.geojson', '.json')):
            import geopandas as gpd
            gdf = gpd.read_file(temp_path)
        elif filename.endswith('.zip'):
            import geopandas as gpd
            gdf = gpd.read_file(f'zip://{temp_path}')
        elif filename.endswith('.shp'):
            import geopandas as gpd
            gdf = gpd.read_file(temp_path)
        else:
            raise ValueError(f"Unsupported file format: {filename}")
        
        # Ensure WGS84 projection
        if gdf.crs and gdf.crs.to_string() != 'EPSG:4326':
            gdf = gdf.to_crs('EPSG:4326')
        
        # Convert to GeoJSON
        geojson = json.loads(gdf.to_json())
        
        # Get first feature geometry
        if geojson['features']:
            geometry = geojson['features'][0]['geometry']
            ee_geom = ee.Geometry(geometry)
            print(f"✓ {name} loaded successfully")
            return ee_geom
        else:
            raise ValueError(f"No features found in {filename}")
    
    except Exception as e:
        print(f"✗ Error processing {name}: {str(e)}")
        return None
    finally:
        # Cleanup
        if os.path.exists(temp_path):
            os.remove(temp_path)

def on_process_click(b):
    """Handler for process button click"""
    global field_boundary, lease_boundary, site_center
    
    upload_status.value = "<p style='color:blue;'>Processing files...</p>"
    
    # Process uploads
    field_boundary = process_uploaded_file(field_upload, "Field Boundary")
    lease_boundary = process_uploaded_file(lease_upload, "Lease Boundary")
    
    if field_boundary and lease_boundary:
        # Get site centroid for reference
        site_center = field_boundary.centroid().coordinates().getInfo()
        print(f"Site center: {site_center[1]:.5f}°, {site_center[0]:.5f}°")
        upload_status.value = "<p style='color:green;'><b>✓ Both boundaries loaded successfully!</b><br>You can now proceed to the next steps.</p>"
    elif not field_upload.value and not lease_upload.value:
        upload_status.value = "<p style='color:red;'>Please upload both files first.</p>"
    elif not field_upload.value:
        upload_status.value = "<p style='color:red;'>Please upload field boundary file.</p>"
    elif not lease_upload.value:
        upload_status.value = "<p style='color:red;'>Please upload lease boundary file.</p>"
    else:
        upload_status.value = "<p style='color:red;'>Error processing files. Check the messages above.</p>"

process_button.on_click(on_process_click)

display(widgets.VBox([
    widgets.HTML("<h3>Upload Polygon Boundaries</h3>"),
    field_upload,
    lease_upload,
    process_button,
    upload_status
]))

## 2. Extract Crop History (AAFC Annual Crop Inventory)

Identify what crop was grown in the field for each year (2017-2023)

In [None]:
# AAFC crop classification lookup
# Source: https://agriculture.canada.ca/atlas/data_donnees/annualCropInventory/supportdocument_documentdesupport/
CROP_CLASSES = {
    0: 'Unknown',
    10: 'Cloud',
    20: 'Water',
    30: 'Exposed Land',
    34: 'Urban',
    35: 'Greenhouses',
    50: 'Shrubland',
    80: 'Wetland',
    110: 'Grassland',
    120: 'Agriculture (undifferentiated)',
    122: 'Forage Crops',
    130: 'Too Wet to be Seeded',
    131: 'Fallow',
    132: 'Cereals',
    133: 'Pasture',
    134: 'Other Crops',
    135: 'Wheat',
    136: 'Oats',
    137: 'Barley',
    138: 'Other Grains',
    139: 'Winter Wheat',
    140: 'Pulse Crops',
    141: 'Soybeans',
    142: 'Other Oilseeds',
    143: 'Corn',
    145: 'Potatoes',
    146: 'Sugar Beets',
    147: 'Other Vegetables',
    148: 'Flax',
    149: 'Canola/Rapeseed',
    150: 'Fruit Trees',
    151: 'Berries',
    152: 'Vineyards',
    153: 'Hops',
    154: 'Sod',
    155: 'Herbs',
    156: 'Nursery',
    157: 'Buckwheat',
    158: 'Canaryseed',
    160: 'Lentils',
    161: 'Peas',
    162: 'Sunflower',
    167: 'Rye',
    174: 'Millet',
    175: 'Mustard',
    176: 'Quinoa',
    177: 'Chickpeas',
    178: 'Beans',
    179: 'Hemp',
    180: 'Vetch',
    181: 'Timothy',
    182: 'Safflower',
    183: 'Switchgrass',
    190: 'Triticale',
    191: 'Tobacco',
    192: 'Ginseng',
    193: 'Sorghum',
    194: 'Camelina',
    195: 'Carrots',
    196: 'Pumpkins/Squash',
    197: 'Cabbage',
    198: 'Turnips',
    199: 'Asparagus',
    200: 'Tomatoes',
    210: 'Coniferous',
    220: 'Broadleaf',
    230: 'Mixed Wood'
}

def get_crop_history(geometry, years=range(2017, 2024)):
    """Extract crop type for each year from AAFC dataset"""
    aafc = ee.ImageCollection('AAFC/ACI')
    
    crop_history = {}
    
    for year in years:
        # Get crop inventory for this year
        crop_img = aafc.filter(ee.Filter.date(f'{year}-01-01', f'{year}-12-31')).first()
        
        if crop_img:
            # Sample the crop type at the field center
            sample = crop_img.select('landcover').reduceRegion(
                reducer=ee.Reducer.mode(),
                geometry=geometry,
                scale=30,
                maxPixels=1e8
            )
            
            crop_code = sample.get('landcover').getInfo()
            if crop_code:
                crop_name = CROP_CLASSES.get(crop_code, f'Unknown ({crop_code})')
                crop_history[year] = {'code': crop_code, 'name': crop_name}
    
    return crop_history

In [None]:
# Extract crop history for the field
if field_boundary:
    print("Extracting crop history from AAFC Annual Crop Inventory...")
    crop_history = get_crop_history(field_boundary)
    
    # Display as table
    crop_df = pd.DataFrame.from_dict(crop_history, orient='index')
    crop_df.index.name = 'Year'
    print("\nCrop History:")
    display(crop_df)
else:
    print("Please upload field boundary first.")

## 3. Extract AlphaEarth Foundation Embeddings

Get 64D embeddings for:
- Lease pixels (disturbed site)
- Background pixels (healthy field, excluding lease)
- Regional reference (same crop within 10-20km)

In [None]:
# Load AEF dataset
aef_collection = ee.ImageCollection("GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL")

def get_embeddings(geometry, year, scale=10):
    """
    Extract mean 64D embedding for a geometry and year
    
    Args:
        geometry: ee.Geometry
        year: int (2017-2024)
        scale: int (default 10m)
    
    Returns:
        dict with 'embedding' (64D array) and 'pixel_count'
    """
    # Filter to specific year
    aef_year = aef_collection.filter(ee.Filter.date(f'{year}-01-01', f'{year}-12-31')).first()
    
    # Get all 64 bands
    band_names = [f'embedding_{i}' for i in range(64)]
    
    # Compute mean embedding across the geometry
    stats = aef_year.select(band_names).reduceRegion(
        reducer=ee.Reducer.mean().combine(
            reducer2=ee.Reducer.count(),
            sharedInputs=True
        ),
        geometry=geometry,
        scale=scale,
        maxPixels=1e8
    )
    
    result = stats.getInfo()
    
    # Extract embedding values
    embedding = np.array([result.get(f'embedding_{i}_mean', np.nan) for i in range(64)])
    pixel_count = result.get('embedding_0_count', 0)
    
    return {
        'embedding': embedding,
        'pixel_count': pixel_count,
        'year': year
    }

def get_regional_reference(center_point, crop_code, year, radius_km=15, sample_pixels=1000, max_radius_km=50):
    """
    Build regional reference embedding by sampling healthy pixels of same crop
    
    Args:
        center_point: ee.Geometry.Point
        crop_code: int (AAFC crop classification code)
        year: int
        radius_km: float (initial sampling radius in km)
        sample_pixels: int (number of pixels to sample)
        max_radius_km: float (maximum radius to try if no samples found)
    
    Returns:
        dict with 'embedding' (64D centroid), 'sample_count', and 'actual_radius'
    """
    band_names = [f'embedding_{i}' for i in range(64)]
    
    # Try progressively larger radii if needed
    for current_radius in [radius_km, radius_km * 2, max_radius_km]:
        # Create sampling region (circular buffer)
        sampling_region = center_point.buffer(current_radius * 1000)  # Convert km to meters
        
        # Get crop mask for this year
        aafc = ee.ImageCollection('AAFC/ACI')
        crop_img = aafc.filter(ee.Filter.date(f'{year}-01-01', f'{year}-12-31')).first()
        
        # Create mask for target crop only
        crop_mask = crop_img.select('landcover').eq(crop_code)
        
        # Get AEF embeddings for this year
        aef_year = aef_collection.filter(ee.Filter.date(f'{year}-01-01', f'{year}-12-31')).first()
        
        # Mask to only include target crop
        masked_embeddings = aef_year.updateMask(crop_mask)
        
        # Sample pixels
        samples = masked_embeddings.select(band_names).sample(
            region=sampling_region,
            scale=10,
            numPixels=sample_pixels,
            seed=42,
            geometries=False
        )
        
        sample_count = samples.size().getInfo()
        
        # If we found enough samples, compute centroid
        if sample_count >= 10:  # Require at least 10 samples for reliability
            # Compute mean (centroid) embedding
            centroid = samples.reduceColumns(
                reducer=ee.Reducer.mean().repeat(64),
                selectors=band_names
            )
            
            result = centroid.getInfo()
            
            # Handle case where mean might be None or missing
            if result and 'mean' in result and result['mean'] is not None:
                embedding = np.array(result['mean'], dtype=float)
                
                # Verify embedding is valid
                if len(embedding) == 64 and not np.all(np.isnan(embedding)):
                    print(f"    Found {sample_count} samples within {current_radius}km radius")
                    return {
                        'embedding': embedding,
                        'sample_count': sample_count,
                        'year': year,
                        'crop_code': crop_code,
                        'actual_radius': current_radius
                    }
        
        # If we didn't find enough samples, try next radius
        print(f"    Only found {sample_count} samples at {current_radius}km, expanding search...")
    
    # If we exhausted all radii, return NaN embedding
    print(f"    ⚠ WARNING: Could not find sufficient samples for crop code {crop_code} within {max_radius_km}km")
    return {
        'embedding': np.full(64, np.nan),
        'sample_count': 0,
        'year': year,
        'crop_code': crop_code,
        'actual_radius': max_radius_km
    }

print("✓ Embedding extraction functions ready")


## 4. Compute Similarity Metrics

Calculate cosine similarity between lease and references

In [None]:
def cosine_similarity(vec1, vec2):
    """Compute cosine similarity between two vectors (1 = identical, 0 = orthogonal, -1 = opposite)"""
    # Remove any NaN values
    if np.any(np.isnan(vec1)) or np.any(np.isnan(vec2)):
        return np.nan
    
    # Cosine similarity = 1 - cosine distance
    return 1 - cosine(vec1, vec2)

def euclidean_distance(vec1, vec2):
    """Compute Euclidean distance between two vectors"""
    if np.any(np.isnan(vec1)) or np.any(np.isnan(vec2)):
        return np.nan
    return np.linalg.norm(vec1 - vec2)

print("✓ Similarity metric functions ready")

## 5. Run Complete Analysis

Process all years and compute reclamation assessment metrics

In [None]:
# Run analysis for all available years
if field_boundary and lease_boundary and crop_history:
    print("Running reclamation analysis...\n")
    
    results = []
    site_center_point = field_boundary.centroid()
    
    # Calculate background area (field minus lease)
    background_area = field_boundary.difference(lease_boundary)
    
    for year in sorted(crop_history.keys()):
        crop_info = crop_history[year]
        print(f"\nProcessing {year} - {crop_info['name']}...")
        
        try:
            # Extract embeddings
            print("  - Extracting lease embeddings...")
            lease_emb = get_embeddings(lease_boundary, year)
            
            print("  - Extracting background embeddings...")
            background_emb = get_embeddings(background_area, year)
            
            print("  - Building regional reference...")
            regional_ref = get_regional_reference(
                site_center_point,
                crop_info['code'],
                year,
                radius_km=15
            )
            
            # Compute similarities
            lease_vs_regional = cosine_similarity(lease_emb['embedding'], regional_ref['embedding'])
            background_vs_regional = cosine_similarity(background_emb['embedding'], regional_ref['embedding'])
            lease_vs_background = cosine_similarity(lease_emb['embedding'], background_emb['embedding'])
            
            # Difference-in-differences: How much worse is lease compared to background?
            did_score = lease_vs_regional - background_vs_regional
            
            results.append({
                'year': year,
                'crop': crop_info['name'],
                'crop_code': crop_info['code'],
                'lease_pixels': lease_emb['pixel_count'],
                'background_pixels': background_emb['pixel_count'],
                'regional_samples': regional_ref['sample_count'],
                'lease_vs_regional': lease_vs_regional,
                'background_vs_regional': background_vs_regional,
                'lease_vs_background': lease_vs_background,
                'difference_in_differences': did_score
            })
            
            print(f"  ✓ Lease vs Regional: {lease_vs_regional:.4f}")
            print(f"  ✓ Background vs Regional: {background_vs_regional:.4f}")
            print(f"  ✓ Difference-in-Differences: {did_score:.4f}")
            
        except Exception as e:
            print(f"  ✗ Error: {str(e)}")
            continue
    
    # Create results DataFrame
    results_df = pd.DataFrame(results)
    
    print("\n" + "="*80)
    print("ANALYSIS COMPLETE")
    print("="*80)
    display(results_df)
else:
    print("Please complete all previous steps first.")

## 6. Visualization: Recovery Trajectory

Plot similarity metrics over time to assess reclamation progress

In [None]:
if 'results_df' in locals() and len(results_df) > 0:
    fig, axes = plt.subplots(2, 1, figsize=(12, 10))
    
    # Plot 1: Cosine similarity trends
    ax1 = axes[0]
    ax1.plot(results_df['year'], results_df['lease_vs_regional'], 
             marker='o', linewidth=2, label='Lease vs Regional Reference', color='red')
    ax1.plot(results_df['year'], results_df['background_vs_regional'], 
             marker='s', linewidth=2, label='Background vs Regional Reference', color='green')
    ax1.plot(results_df['year'], results_df['lease_vs_background'], 
             marker='^', linewidth=2, label='Lease vs Background', color='blue', linestyle='--')
    
    ax1.set_xlabel('Year', fontsize=12)
    ax1.set_ylabel('Cosine Similarity', fontsize=12)
    ax1.set_title('Reclamation Site Recovery Trajectory\n(Higher = More Similar to Healthy Reference)', 
                  fontsize=14, fontweight='bold')
    ax1.legend(loc='best', fontsize=10)
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim([0, 1])
    
    # Add crop labels
    for idx, row in results_df.iterrows():
        ax1.text(row['year'], row['lease_vs_regional'] + 0.02, 
                row['crop'][:10], fontsize=8, rotation=45, ha='left')
    
    # Plot 2: Difference-in-Differences (Recovery Gap)
    ax2 = axes[1]
    colors = ['red' if x < 0 else 'green' for x in results_df['difference_in_differences']]
    ax2.bar(results_df['year'], results_df['difference_in_differences'], 
            color=colors, alpha=0.7, edgecolor='black')
    ax2.axhline(y=0, color='black', linestyle='-', linewidth=1)
    
    ax2.set_xlabel('Year', fontsize=12)
    ax2.set_ylabel('Difference-in-Differences Score', fontsize=12)
    ax2.set_title('Recovery Gap: Lease Performance vs Background\n(Positive = Lease Recovering, Negative = Lease Underperforming)', 
                  fontsize=14, fontweight='bold')
    ax2.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print("\nSummary Statistics:")
    print(f"Mean Lease vs Regional Similarity: {results_df['lease_vs_regional'].mean():.4f}")
    print(f"Mean Background vs Regional Similarity: {results_df['background_vs_regional'].mean():.4f}")
    print(f"Mean Recovery Gap: {results_df['difference_in_differences'].mean():.4f}")
    
    if results_df['difference_in_differences'].mean() >= -0.05:
        print("\n✓ ASSESSMENT: Lease appears to be performing similarly to background field.")
        print("  Reclamation may be approaching equivalent land capability.")
    else:
        print("\n⚠ ASSESSMENT: Lease is underperforming compared to background field.")
        print("  Further reclamation work or monitoring may be needed.")
else:
    print("No results to visualize yet.")

## 7. Export Results

Save analysis results for reporting

In [None]:
if 'results_df' in locals():
    # Export to CSV
    output_file = 'reclamation_analysis_results.csv'
    results_df.to_csv(output_file, index=False)
    print(f"✓ Results saved to {output_file}")
    
    # Create summary report
    summary = f"""
    RECLAMATION ASSESSMENT SUMMARY
    ==============================
    
    Site Location: {site_center[1]:.5f}°, {site_center[0]:.5f}°
    Analysis Period: {results_df['year'].min()} - {results_df['year'].max()}
    
    Average Metrics:
    - Lease vs Regional Reference: {results_df['lease_vs_regional'].mean():.4f}
    - Background vs Regional Reference: {results_df['background_vs_regional'].mean():.4f}
    - Recovery Gap (DiD): {results_df['difference_in_differences'].mean():.4f}
    
    Trend Analysis:
    - First Year DiD: {results_df.iloc[0]['difference_in_differences']:.4f}
    - Last Year DiD: {results_df.iloc[-1]['difference_in_differences']:.4f}
    - Change: {results_df.iloc[-1]['difference_in_differences'] - results_df.iloc[0]['difference_in_differences']:.4f}
    
    Interpretation:
    The difference-in-differences (DiD) score shows how the lease performs relative
    to the background field when both are compared to regional healthy cropland.
    
    - DiD ≈ 0: Lease performing similar to background (equivalent land use)
    - DiD < -0.05: Lease underperforming (needs attention)
    - DiD > 0: Lease outperforming background (unexpected but possible)
    """
    
    print(summary)
    
    with open('reclamation_summary.txt', 'w') as f:
        f.write(summary)
    print("\n✓ Summary report saved to reclamation_summary.txt")

## Interpretation Guide

### Cosine Similarity Scores
- **1.0** = Identical embedding vectors (perfect match)
- **0.9-1.0** = Very high similarity (typical for same crop type in good condition)
- **0.7-0.9** = Moderate similarity (some differences but generally similar)
- **<0.7** = Low similarity (significant differences)

### Difference-in-Differences (DiD) Score
This metric answers: **"Given this year's crop and regional conditions, did the lease behave like healthy peers?"**

**DiD = (Lease vs Regional) - (Background vs Regional)**

- **DiD ≈ 0** (±0.05): Lease is performing equivalently to background field → **Reclamation Success**
- **DiD < -0.05**: Lease is underperforming compared to background → **Needs Attention**
- **DiD > 0.05**: Lease is outperforming background (rare, investigate if real or artifact)

### Recovery Trajectory
Look for these patterns over time:
- **Improving trend**: DiD increasing toward zero = recovery in progress
- **Stable at zero**: DiD consistently near zero = equivalent land capability achieved
- **Declining trend**: DiD becoming more negative = degradation or poor reclamation

### Spatial Resolution Considerations
- **AEF Resolution**: 10m × 10m pixels
- **100m × 100m lease** = ~100 pixels (adequate for statistical analysis)
- **15m access road** = 1-2 pixels wide (may be too small for reliable assessment)

For small features like access roads, consider aggregating multiple years or focusing on larger disturbed areas.