# Exploring Southern GYE Elk GPS Collar Data (2007-2015)

This notebook explores the Southern Greater Yellowstone Ecosystem GPS collar dataset - excellent for large sample size training!

**Dataset Info:**
- **Location:** 22 Wyoming winter supplemental feedgrounds
- **Coverage:** 288 adult and yearling female elk, 2007-2015
- **Data:** GPS locations during brucellosis risk period (February-July)
- **Use Case:** Large sample size, diverse conditions, statistical robustness
- **Note:** ~200 miles from Area 048, but provides excellent training data

**Data Format:** UTM coordinates (Easting/Northing) in Zone 12N - will be converted to lat/lon

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from shapely.geometry import Point
import pyproj

# Set up paths
DATA_DIR = Path("../data/raw")
GYE_DIR = DATA_DIR / "elk_southern_gye"

print("=" * 60)
print("SOUTHERN GYE DATASET")
print("=" * 60)
print(f"\nData directory: {GYE_DIR}")
print(f"Directory exists: {GYE_DIR.exists()}")

# Look for data files
if GYE_DIR.exists():
    files = list(GYE_DIR.glob("*.csv"))
    print(f"\nCSV files found: {len(files)}")
    for f in files:
        print(f"  - {f.name}")
else:
    print("\n‚ö†Ô∏è  Directory doesn't exist yet!")

SOUTHERN GYE DATASET

Data directory: ../data/raw/elk_southern_gye
Directory exists: True

CSV files found: 1
  - Elk GPS collar data in southern GYE 2007-2015.csv


## Step 1: Load the Data

The dataset uses **UTM coordinates** (Easting/Northing) in Zone 12N. We'll convert these to lat/lon.

In [2]:
# Find and load the CSV file
csv_files = list(GYE_DIR.glob("*.csv"))

if csv_files:
    csv_file = csv_files[0]
    print(f"Loading: {csv_file.name}")
    df = pd.read_csv(csv_file)
    
    print(f"\n‚úì Data loaded successfully!")
    print(f"  Shape: {df.shape}")
    print(f"  Columns: {list(df.columns)}")
    print(f"\nFirst few rows:")
    print(df.head())
    
    # Expected columns: AID, Easting, Northing, Date_Time_MST, Feedground
    print(f"\nColumn check:")
    expected_cols = ['AID', 'Easting', 'Northing', 'Date_Time_MST', 'Feedground']
    for col in expected_cols:
        if col in df.columns:
            print(f"  ‚úì {col}")
        else:
            print(f"  ‚úó {col} (missing)")
else:
    print("‚ö†Ô∏è  No CSV files found!")
    df = None

Loading: Elk GPS collar data in southern GYE 2007-2015.csv

‚úì Data loaded successfully!
  Shape: (94591, 5)
  Columns: ['AID', 'Easting', 'Northing', 'Date_Time_MST', 'Feedground']

First few rows:
   AID      Easting     Northing    Date_Time_MST    Feedground
0   17  570434.7207  4727340.258   3/21/2008 4:00  Bench_Corral
1   17  570435.9304  4727881.330   3/21/2008 8:00  Bench_Corral
2   17  569890.9708  4727580.732  3/21/2008 12:00  Bench_Corral
3   17  572112.5855  4731880.513   3/22/2008 4:00  Bench_Corral
4   17  571634.1351  4731804.016   3/22/2008 8:00  Bench_Corral

Column check:
  ‚úì AID
  ‚úì Easting
  ‚úì Northing
  ‚úì Date_Time_MST
  ‚úì Feedground


## Step 2: Convert UTM Coordinates to Lat/Lon

The data uses UTM Zone 12N coordinates. We'll convert to WGS84 (lat/lon) for analysis.

In [3]:
if df is not None:
    print("=" * 60)
    print("CONVERTING UTM TO LAT/LON")
    print("=" * 60)
    
    # UTM Zone 12N (Wyoming)
    utm_zone = 12
    utm_crs = f'EPSG:326{utm_zone}'  # UTM Zone 12N
    wgs84_crs = 'EPSG:4326'  # WGS84 lat/lon
    
    print(f"\nUTM Zone: {utm_zone}N")
    print(f"Converting {len(df):,} points...")
    
    # Create GeoDataFrame from UTM coordinates
    gdf_utm = gpd.GeoDataFrame(
        df,
        geometry=gpd.points_from_xy(df['Easting'], df['Northing']),
        crs=utm_crs
    )
    
    # Convert to WGS84 (lat/lon)
    gdf_wgs84 = gdf_utm.to_crs(wgs84_crs)
    
    # Extract lat/lon
    gdf_wgs84['latitude'] = gdf_wgs84.geometry.y
    gdf_wgs84['longitude'] = gdf_wgs84.geometry.x
    
    print(f"\n‚úì Conversion complete!")
    print(f"\nSample converted coordinates:")
    print(f"  UTM: Easting={df['Easting'].iloc[0]:.2f}, Northing={df['Northing'].iloc[0]:.2f}")
    print(f"  WGS84: Lat={gdf_wgs84['latitude'].iloc[0]:.4f}¬∞, Lon={gdf_wgs84['longitude'].iloc[0]:.4f}¬∞")
    
    # Verify coordinates are in Wyoming range
    lat_range = (gdf_wgs84['latitude'].min(), gdf_wgs84['latitude'].max())
    lon_range = (gdf_wgs84['longitude'].min(), gdf_wgs84['longitude'].max())
    print(f"\nCoordinate ranges:")
    print(f"  Latitude: {lat_range[0]:.4f}¬∞ to {lat_range[1]:.4f}¬∞")
    print(f"  Longitude: {lon_range[0]:.4f}¬∞ to {lon_range[1]:.4f}¬∞")
    
    if 41 <= lat_range[0] <= 45 and -111 <= lon_range[0] <= -104:
        print(f"  ‚úì Coordinates are in Wyoming range")

CONVERTING UTM TO LAT/LON

UTM Zone: 12N
Converting 94,591 points...

‚úì Conversion complete!

Sample converted coordinates:
  UTM: Easting=570434.72, Northing=4727340.26
  WGS84: Lat=42.6953¬∞, Lon=-110.1401¬∞

Coordinate ranges:
  Latitude: 42.5352¬∞ to 44.2879¬∞
  Longitude: -111.0527¬∞ to -109.1663¬∞


## Step 3: Inspect Dataset Structure

In [4]:
if 'gdf_wgs84' in locals() and gdf_wgs84 is not None:
    print("=" * 60)
    print("DATASET STRUCTURE")
    print("=" * 60)
    print(f"\nShape: {gdf_wgs84.shape}")
    print(f"Columns: {list(gdf_wgs84.columns)}")
    print(f"\nData types:")
    print(gdf_wgs84.dtypes)
    print(f"\nMissing values:")
    missing = gdf_wgs84.isnull().sum()
    if missing.sum() > 0:
        for col, count in missing[missing > 0].items():
            print(f"  {col}: {count} ({count/len(gdf_wgs84)*100:.1f}%)")
    else:
        print("  ‚úì No missing values!")
    
    print(f"\nUnique values:")
    print(f"  Unique elk (AID): {gdf_wgs84['AID'].nunique()}")
    print(f"  Unique feedgrounds: {gdf_wgs84['Feedground'].nunique()}")
    print(f"  Feedgrounds: {sorted(gdf_wgs84['Feedground'].unique())}")

DATASET STRUCTURE

Shape: (94591, 8)
Columns: ['AID', 'Easting', 'Northing', 'Date_Time_MST', 'Feedground', 'geometry', 'latitude', 'longitude']

Data types:
AID                 int64
Easting           float64
Northing          float64
Date_Time_MST      object
Feedground         object
geometry         geometry
latitude          float64
longitude         float64
dtype: object

Missing values:
  ‚úì No missing values!

Unique values:
  Unique elk (AID): 288
  Unique feedgrounds: 20
  Feedgrounds: ['Bench_Corral', 'Black_Butte', 'Camp_Creek', 'Dell_Creek', 'Dog_Creek', 'Fall_Creek', 'Finnegan', 'Forest_Park', 'Franz', 'Green_River_Lakes', 'Greys_River', 'Gros_Ventre', 'Horse_Creek', 'Jewett', 'McNeel', 'Muddy_Creek', 'National_Elk_Refuge', 'Scab_Creek', 'Soda_Lake', 'South_Park']


## Step 4: Analyze Spatial Coverage

In [5]:
if 'gdf_wgs84' in locals() and gdf_wgs84 is not None:
    print("=" * 60)
    print("SPATIAL COVERAGE")
    print("=" * 60)
    print(f"\nLatitude: {gdf_wgs84['latitude'].min():.4f}¬∞ to {gdf_wgs84['latitude'].max():.4f}¬∞")
    print(f"Longitude: {gdf_wgs84['longitude'].min():.4f}¬∞ to {gdf_wgs84['longitude'].max():.4f}¬∞")
    
    # Distance to Area 048
    area_048_lat, area_048_lon = 41.835, -106.425
    
    from math import radians, sin, cos, sqrt, atan2
    
    def haversine_distance(lat1, lon1, lat2, lon2):
        R = 6371  # Earth radius in km
        lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
        dlat = lat2 - lat1
        dlon = lon2 - lon1
        a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
        c = 2 * atan2(sqrt(a), sqrt(1-a))
        return R * c
    
    gdf_wgs84['distance_to_area_048_km'] = gdf_wgs84.apply(
        lambda row: haversine_distance(row['latitude'], row['longitude'], area_048_lat, area_048_lon),
        axis=1
    )
    
    print(f"\nProximity to Area 048:")
    print(f"  Min distance: {gdf_wgs84['distance_to_area_048_km'].min():.2f} km")
    print(f"  Max distance: {gdf_wgs84['distance_to_area_048_km'].max():.2f} km")
    print(f"  Avg distance: {gdf_wgs84['distance_to_area_048_km'].mean():.2f} km")
    print(f"  Points within 200km: {(gdf_wgs84['distance_to_area_048_km'] <= 200).sum():,} ({(gdf_wgs84['distance_to_area_048_km'] <= 200).sum() / len(gdf_wgs84) * 100:.1f}%)")
    print(f"\n‚ö†Ô∏è  Note: Southern GYE is ~200 miles from Area 048.")
    print(f"   This data is valuable for large sample size training.")

SPATIAL COVERAGE

Latitude: 42.5352¬∞ to 44.2879¬∞
Longitude: -111.0527¬∞ to -109.1663¬∞

Proximity to Area 048:
  Min distance: 244.09 km
  Max distance: 435.00 km
  Avg distance: 353.23 km
  Points within 200km: 0 (0.0%)

‚ö†Ô∏è  Note: Southern GYE is ~200 miles from Area 048.
   This data is valuable for large sample size training.


## Step 5: Analyze Temporal Patterns

In [6]:
if 'gdf_wgs84' in locals() and gdf_wgs84 is not None:
    print("=" * 60)
    print("TEMPORAL ANALYSIS")
    print("=" * 60)
    
    # Parse date column (format: M/D/YYYY H:MM)
    try:
        gdf_wgs84['date'] = pd.to_datetime(gdf_wgs84['Date_Time_MST'], format='%m/%d/%Y %H:%M')
        gdf_wgs84['year'] = gdf_wgs84['date'].dt.year
        gdf_wgs84['month'] = gdf_wgs84['date'].dt.month
        gdf_wgs84['day_of_year'] = gdf_wgs84['date'].dt.dayofyear
        
        print(f"\nDate range: {gdf_wgs84['date'].min()} to {gdf_wgs84['date'].max()}")
        print(f"Total days: {(gdf_wgs84['date'].max() - gdf_wgs84['date'].min()).days}")
        
        print(f"\nYear distribution:")
        for year, count in gdf_wgs84['year'].value_counts().sort_index().items():
            print(f"  {int(year)}: {count:,} points ({count/len(gdf_wgs84)*100:.1f}%)")
        
        print(f"\nMonth distribution:")
        for month, count in gdf_wgs84['month'].value_counts().sort_index().items():
            month_name = pd.to_datetime(f"2020-{month}-01").strftime("%B")
            print(f"  {month_name} ({month}): {count:,} points ({count/len(gdf_wgs84)*100:.1f}%)")
        
        # Note: Data is Feb-July (brucellosis risk period)
        print(f"\nüìã Note: This dataset focuses on Feb-July (brucellosis risk period)")
        print(f"   October data may be limited, but still valuable for general patterns.")
        
        # Check October data
        october_points = gdf_wgs84[gdf_wgs84['month'] == 10]
        if len(october_points) > 0:
            print(f"\nüéØ October data: {len(october_points):,} points ({len(october_points)/len(gdf_wgs84)*100:.1f}%)")
        else:
            print(f"\n‚ö†Ô∏è  No October data (expected - dataset focuses on Feb-July)")
            
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not parse dates: {e}")
        print(f"   Date format: {gdf_wgs84['Date_Time_MST'].iloc[0]}")
        print(f"   Try adjusting the date format string if needed.")

TEMPORAL ANALYSIS

Date range: 2007-03-14 00:00:00 to 2015-07-31 10:00:00
Total days: 3061

Year distribution:
  2007: 2,881 points (3.0%)
  2008: 9,734 points (10.3%)
  2009: 12,680 points (13.4%)
  2010: 13,512 points (14.3%)
  2011: 11,167 points (11.8%)
  2012: 15,135 points (16.0%)
  2013: 18,983 points (20.1%)
  2014: 9,155 points (9.7%)
  2015: 1,344 points (1.4%)

Month distribution:
  February (2): 103 points (0.1%)
  March (3): 3,292 points (3.5%)
  April (4): 19,724 points (20.9%)
  May (5): 24,432 points (25.8%)
  June (6): 22,954 points (24.3%)
  July (7): 24,086 points (25.5%)

üìã Note: This dataset focuses on Feb-July (brucellosis risk period)
   October data may be limited, but still valuable for general patterns.

‚ö†Ô∏è  No October data (expected - dataset focuses on Feb-July)


## Step 6: Analyze Elk Individual Patterns

In [7]:
if 'gdf_wgs84' in locals() and gdf_wgs84 is not None:
    print("=" * 60)
    print("ELK INDIVIDUAL ANALYSIS")
    print("=" * 60)
    
    print(f"\nTotal unique elk (AID): {gdf_wgs84['AID'].nunique()}")
    print(f"Total GPS points: {len(gdf_wgs84):,}")
    print(f"Average points per elk: {len(gdf_wgs84) / gdf_wgs84['AID'].nunique():.0f}")
    
    points_per_elk = gdf_wgs84['AID'].value_counts()
    print(f"\nPoints per elk:")
    print(f"  Minimum: {points_per_elk.min():,}")
    print(f"  Maximum: {points_per_elk.max():,}")
    print(f"  Mean: {points_per_elk.mean():.0f}")
    print(f"  Median: {points_per_elk.median():.0f}")
    
    print(f"\nTop 5 elk by point count:")
    for elk_id, count in points_per_elk.head().items():
        print(f"  Elk {elk_id}: {count:,} points")
    
    # Feedground analysis
    print(f"\nFeedground distribution:")
    feedground_counts = gdf_wgs84['Feedground'].value_counts()
    for feedground, count in feedground_counts.head(10).items():
        print(f"  {feedground}: {count:,} points ({count/len(gdf_wgs84)*100:.1f}%)")

ELK INDIVIDUAL ANALYSIS

Total unique elk (AID): 288
Total GPS points: 94,591
Average points per elk: 328

Points per elk:
  Minimum: 31
  Maximum: 954
  Mean: 328
  Median: 324

Top 5 elk by point count:
  Elk 88: 954 points
  Elk 245: 765 points
  Elk 917: 750 points
  Elk 913: 750 points
  Elk 679: 744 points

Feedground distribution:
  National_Elk_Refuge: 12,441 points (13.2%)
  Gros_Ventre: 12,271 points (13.0%)
  Jewett: 9,565 points (10.1%)
  Muddy_Creek: 7,343 points (7.8%)
  Fall_Creek: 6,743 points (7.1%)
  McNeel: 6,142 points (6.5%)
  Soda_Lake: 5,834 points (6.2%)
  Greys_River: 4,696 points (5.0%)
  Forest_Park: 4,173 points (4.4%)
  South_Park: 3,533 points (3.7%)


## Step 7: Prepare Data for PathWild Integration

In [8]:
if 'gdf_wgs84' in locals() and gdf_wgs84 is not None:
    # Create PathWild-ready dataset
    pathwild_data = pd.DataFrame({
        'latitude': gdf_wgs84['latitude'],
        'longitude': gdf_wgs84['longitude'],
        'distance_to_area_048_km': gdf_wgs84['distance_to_area_048_km'],
        'elk_id': gdf_wgs84['AID'],
        'feedground': gdf_wgs84['Feedground']
    })
    
    # Add temporal info if available
    if 'date' in gdf_wgs84.columns:
        pathwild_data['date'] = gdf_wgs84['date']
        pathwild_data['year'] = gdf_wgs84['year']
        pathwild_data['month'] = gdf_wgs84['month']
        pathwild_data['day_of_year'] = gdf_wgs84['day_of_year']
    
    # Add original UTM coordinates (useful for reference)
    pathwild_data['utm_easting'] = gdf_wgs84['Easting']
    pathwild_data['utm_northing'] = gdf_wgs84['Northing']
    
    print("=" * 60)
    print("PATHWILD-READY DATASET")
    print("=" * 60)
    print(f"\nShape: {pathwild_data.shape}")
    print(f"Columns: {list(pathwild_data.columns)}")
    print(f"\nFirst few rows:")
    print(pathwild_data.head())
    
    # Save to CSV
    output_file = Path("../data/processed/southern_gye_points.csv")
    output_file.parent.mkdir(parents=True, exist_ok=True)
    pathwild_data.to_csv(output_file, index=False)
    print(f"\n‚úì Saved to {output_file}")

PATHWILD-READY DATASET

Shape: (94591, 11)
Columns: ['latitude', 'longitude', 'distance_to_area_048_km', 'elk_id', 'feedground', 'date', 'year', 'month', 'day_of_year', 'utm_easting', 'utm_northing']

First few rows:
    latitude   longitude  distance_to_area_048_km  elk_id    feedground  \
0  42.695323 -110.140098               320.296093      17  Bench_Corral   
1  42.700195 -110.140016               320.440500      17  Bench_Corral   
2  42.697538 -110.146706               320.883282      17  Bench_Corral   
3  42.736050 -110.119038               319.937610      17  Bench_Corral   
4  42.735406 -110.124892               320.373967      17  Bench_Corral   

                 date  year  month  day_of_year  utm_easting  utm_northing  
0 2008-03-21 04:00:00  2008      3           81  570434.7207   4727340.258  
1 2008-03-21 08:00:00  2008      3           81  570435.9304   4727881.330  
2 2008-03-21 12:00:00  2008      3           81  569890.9708   4727580.732  
3 2008-03-22 04:00:00  2

## Step 8: Summary and Next Steps

In [9]:
if 'gdf_wgs84' in locals() and gdf_wgs84 is not None:
    print("=" * 60)
    print("SOUTHERN GYE DATASET SUMMARY")
    print("=" * 60)
    print(f"\nTotal GPS points: {len(gdf_wgs84):,}")
    print(f"Unique elk: {gdf_wgs84['AID'].nunique()}")
    print(f"Unique feedgrounds: {gdf_wgs84['Feedground'].nunique()}")
    print(f"\nGeographic coverage:")
    print(f"  Latitude: {gdf_wgs84['latitude'].min():.4f}¬∞ to {gdf_wgs84['latitude'].max():.4f}¬∞")
    print(f"  Longitude: {gdf_wgs84['longitude'].min():.4f}¬∞ to {gdf_wgs84['longitude'].max():.4f}¬∞")
    print(f"\nProximity to Area 048:")
    print(f"  Average distance: {gdf_wgs84['distance_to_area_048_km'].mean():.2f} km")
    
    if 'year' in gdf_wgs84.columns:
        print(f"\nTemporal coverage:")
        print(f"  Years: {sorted(gdf_wgs84['year'].unique())}")
        print(f"  Months: {sorted(gdf_wgs84['month'].unique())}")
    
    print(f"\nüìã Key Insights:")
    print(f"  ‚úì LARGE sample size ({len(gdf_wgs84):,} points from {gdf_wgs84['AID'].nunique()} elk)")
    print(f"  ‚úì Excellent for statistical robustness")
    print(f"  ‚úì Diverse conditions across {gdf_wgs84['Feedground'].nunique()} feedgrounds")
    print(f"  ‚ö†Ô∏è  Geographic distance from Area 048 (~200 miles)")
    print(f"  ‚ö†Ô∏è  Data focuses on Feb-July (brucellosis period)")
    print(f"  ‚Üí Best used for large-scale training and generalization")
    
    print(f"\nNext steps:")
    print("  1. Combine with South Bighorn + National Elk Refuge data")
    print("  2. Use for large sample size training")
    print("  3. Integrate with DataContextBuilder to add environmental features")
    print("  4. Create training dataset with positive examples (GPS points)")
    print("  5. Generate negative examples (random points)")
    print("  6. Train XGBoost model with weighted combination of all datasets")

SOUTHERN GYE DATASET SUMMARY

Total GPS points: 94,591
Unique elk: 288
Unique feedgrounds: 20

Geographic coverage:
  Latitude: 42.5352¬∞ to 44.2879¬∞
  Longitude: -111.0527¬∞ to -109.1663¬∞

Proximity to Area 048:
  Average distance: 353.23 km

Temporal coverage:
  Years: [2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015]
  Months: [2, 3, 4, 5, 6, 7]

üìã Key Insights:
  ‚úì LARGE sample size (94,591 points from 288 elk)
  ‚úì Excellent for statistical robustness
  ‚úì Diverse conditions across 20 feedgrounds
  ‚ö†Ô∏è  Geographic distance from Area 048 (~200 miles)
  ‚ö†Ô∏è  Data focuses on Feb-July (brucellosis period)
  ‚Üí Best used for large-scale training and generalization

Next steps:
  1. Combine with South Bighorn + National Elk Refuge data
  2. Use for large sample size training
  3. Integrate with DataContextBuilder to add environmental features
  4. Create training dataset with positive examples (GPS points)
  5. Generate negative examples (random points)
  6. Train X