# Exploring National Elk Refuge GPS Collar Data (2006-2015)

This notebook explores the National Elk Refuge GPS collar dataset - valuable for general elk behavior patterns and large sample size training!

**Dataset Info:**
- **Location:** National Elk Refuge, Jackson, Wyoming
- **Coverage:** 17 adult female elk, 2006-2015
- **Data:** GPS locations, timestamps, migration patterns
- **Use Case:** General elk behavior patterns, seasonal timing, long time series
- **Note:** ~200 miles from Area 048, but provides valuable general patterns

**Download:** https://data.usgs.gov/datacatalog/data/USGS:5a9f2782e4b0b1c392e502ea

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from shapely.geometry import Point

# Set up paths
DATA_DIR = Path("../data/raw")
REFUGE_DIR = DATA_DIR / "elk_national_refuge"

print("=" * 60)
print("NATIONAL ELK REFUGE DATASET")
print("=" * 60)
print(f"\nData directory: {REFUGE_DIR}")
print(f"Directory exists: {REFUGE_DIR.exists()}")

# Look for data files
if REFUGE_DIR.exists():
    files = list(REFUGE_DIR.glob("*"))
    print(f"\nFiles found: {len(files)}")
    for f in files[:10]:
        print(f"  - {f.name}")
else:
    print("\n‚ö†Ô∏è  Directory doesn't exist yet!")
    print("üì• Download instructions:")
    print("   1. Visit: https://data.usgs.gov/datacatalog/data/USGS:5a9f2782e4b0b1c392e502ea")
    print("   2. Download the dataset (CSV or shapefile format)")
    print("   3. Extract to: data/raw/elk_national_refuge/")

NATIONAL ELK REFUGE DATASET

Data directory: ../data/raw/elk_national_refuge
Directory exists: True

Files found: 1
  - Elk_GPS_collar_data_National_Elk_Refuge_2006-2015.csv


## Step 1: Load the Data

The dataset may be in CSV format (GPS points) or shapefile format. We'll try both.

In [2]:
# Try to find and load the data file
csv_files = list(REFUGE_DIR.glob("*.csv"))
shp_files = list(REFUGE_DIR.glob("*.shp"))

if shp_files:
    print(f"Loading shapefile: {shp_files[0].name}")
    gdf = gpd.read_file(shp_files[0])
    data_type = "shapefile"
elif csv_files:
    print(f"Loading CSV: {csv_files[0].name}")
    df = pd.read_csv(csv_files[0])
    
    # Auto-detect lat/lon columns
    lat_col = None
    lon_col = None
    for col in df.columns:
        col_lower = col.lower()
        if 'lat' in col_lower and lat_col is None:
            lat_col = col
        if ('lon' in col_lower or 'long' in col_lower) and lon_col is None:
            lon_col = col
    
    if lat_col and lon_col:
        print(f"  Found coordinates: {lat_col}, {lon_col}")
        gdf = gpd.GeoDataFrame(
            df,
            geometry=gpd.points_from_xy(df[lon_col], df[lat_col]),
            crs='EPSG:4326'
        )
        data_type = "csv_points"
    else:
        print(f"  ‚ö†Ô∏è  Columns: {list(df.columns)}")
        print("  Please update the notebook to specify lat/lon column names.")
        gdf = None
        data_type = None
else:
    print("‚ö†Ô∏è  No data files found!")
    gdf = None
    data_type = None

if gdf is not None:
    print(f"\n‚úì Data loaded: {data_type}, Shape: {gdf.shape}, CRS: {gdf.crs}")

Loading CSV: Elk_GPS_collar_data_National_Elk_Refuge_2006-2015.csv
  Found coordinates: Lat, Long

‚úì Data loaded: csv_points, Shape: (104913, 12), CRS: EPSG:4326


## Step 2: Inspect Dataset Structure

In [3]:
if gdf is not None:
    print("=" * 60)
    print("DATASET STRUCTURE")
    print("=" * 60)
    print(f"\nShape: {gdf.shape}")
    print(f"Columns: {list(gdf.columns)}")
    print(f"\nFirst few rows:")
    print(gdf.head())
    print(f"\nData types:")
    print(gdf.dtypes)
    print(f"\nMissing values:")
    missing = gdf.isnull().sum()
    if missing.sum() > 0:
        for col, count in missing[missing > 0].items():
            print(f"  {col}: {count} ({count/len(gdf)*100:.1f}%)")
    else:
        print("  ‚úì No missing values!")

DATASET STRUCTURE

Shape: (104913, 12)
Columns: ['Elk_ID', 'DT', 'TZ', 'UTM_X', 'UTM_Y', 'Zone', 'Lat', 'Long', 'month', 'year', 't', 'geometry']

First few rows:
   Elk_ID                DT   TZ        UTM_X        UTM_Y  Zone        Lat  \
0     572  2006-03-01 18:01  MTN  550916.6207  4853505.016    12  43.832868   
1     572  2006-03-01 20:00  MTN  550927.0239  4853469.442    12  43.832547   
2     572  2006-03-01 22:00  MTN  550943.4909  4853461.016    12  43.832470   
3     572  2006-03-02 00:00  MTN  550958.4163  4853464.906    12  43.832504   
4     572  2006-03-02 02:01  MTN  551164.4670  4853321.873    12  43.831202   

         Long  month  year             t                     geometry  
0 -110.366700      3  2006  13209.042361   POINT (-110.3667 43.83287)  
1 -110.366574      3  2006  13209.125000  POINT (-110.36657 43.83255)  
2 -110.366370      3  2006  13209.208333  POINT (-110.36637 43.83247)  
3 -110.366184      3  2006  13209.291667   POINT (-110.36618 43.8325)  
4 

## Step 3: Extract Coordinates and Analyze Spatial Coverage

In [4]:
if gdf is not None:
    # Ensure we have lat/lon
    if 'latitude' not in gdf.columns or 'longitude' not in gdf.columns:
        if gdf.geometry is not None:
            gdf_wgs84 = gdf.to_crs('EPSG:4326') if gdf.crs != 'EPSG:4326' else gdf
            gdf_wgs84['latitude'] = gdf_wgs84.geometry.y
            gdf_wgs84['longitude'] = gdf_wgs84.geometry.x
        else:
            gdf_wgs84 = None
    else:
        gdf_wgs84 = gdf.to_crs('EPSG:4326') if gdf.crs != 'EPSG:4326' else gdf
    
    if gdf_wgs84 is not None:
        print("=" * 60)
        print("SPATIAL COVERAGE")
        print("=" * 60)
        print(f"\nLatitude: {gdf_wgs84['latitude'].min():.4f}¬∞ to {gdf_wgs84['latitude'].max():.4f}¬∞")
        print(f"Longitude: {gdf_wgs84['longitude'].min():.4f}¬∞ to {gdf_wgs84['longitude'].max():.4f}¬∞")
        
        # Distance to Area 048
        area_048_lat, area_048_lon = 41.835, -106.425
        
        from math import radians, sin, cos, sqrt, atan2
        
        def haversine_distance(lat1, lon1, lat2, lon2):
            R = 6371  # Earth radius in km
            lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
            dlat = lat2 - lat1
            dlon = lon2 - lon1
            a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
            c = 2 * atan2(sqrt(a), sqrt(1-a))
            return R * c
        
        gdf_wgs84['distance_to_area_048_km'] = gdf_wgs84.apply(
            lambda row: haversine_distance(row['latitude'], row['longitude'], area_048_lat, area_048_lon),
            axis=1
        )
        
        print(f"\nProximity to Area 048:")
        print(f"  Min distance: {gdf_wgs84['distance_to_area_048_km'].min():.2f} km")
        print(f"  Max distance: {gdf_wgs84['distance_to_area_048_km'].max():.2f} km")
        print(f"  Avg distance: {gdf_wgs84['distance_to_area_048_km'].mean():.2f} km")
        print(f"  Points within 200km: {(gdf_wgs84['distance_to_area_048_km'] <= 200).sum()} ({(gdf_wgs84['distance_to_area_048_km'] <= 200).sum() / len(gdf_wgs84) * 100:.1f}%)")
        print(f"\n‚ö†Ô∏è  Note: National Elk Refuge is ~200 miles from Area 048.")
        print(f"   This data is valuable for general patterns, not geographic specificity.")

SPATIAL COVERAGE

Latitude: 43.4530¬∞ to 44.2879¬∞
Longitude: -110.7760¬∞ to -109.9710¬∞

Proximity to Area 048:
  Min distance: 384.85 km
  Max distance: 435.68 km
  Avg distance: 405.06 km
  Points within 200km: 0 (0.0%)

‚ö†Ô∏è  Note: National Elk Refuge is ~200 miles from Area 048.
   This data is valuable for general patterns, not geographic specificity.


## Step 4: Analyze Temporal Patterns

In [6]:
if gdf_wgs84 is not None:
    # Try to find date column
    date_col = None
    for col in gdf_wgs84.columns:
        if 'date' in col.lower() or 'dt' in col.lower():
            date_col = col
            break
    
    if date_col:
        try:
            gdf_wgs84['date'] = pd.to_datetime(gdf_wgs84[date_col])
            gdf_wgs84['year'] = gdf_wgs84['date'].dt.year
            gdf_wgs84['month'] = gdf_wgs84['date'].dt.month
            
            print("=" * 60)
            print("TEMPORAL ANALYSIS")
            print("=" * 60)
            print(f"\nDate range: {gdf_wgs84['date'].min()} to {gdf_wgs84['date'].max()}")
            
            print(f"\nYear distribution:")
            for year, count in gdf_wgs84['year'].value_counts().sort_index().items():
                print(f"  {int(year)}: {count:,} points ({count/len(gdf_wgs84)*100:.1f}%)")
            
            print(f"\nMonth distribution:")
            for month, count in gdf_wgs84['month'].value_counts().sort_index().items():
                month_name = pd.to_datetime(f"2020-{month}-01").strftime("%B")
                print(f"  {month_name}: {count:,} points ({count/len(gdf_wgs84)*100:.1f}%)")
            
            # October analysis
            october_points = gdf_wgs84[gdf_wgs84['month'] == 10]
            print(f"\nüéØ October data: {len(october_points):,} points ({len(october_points)/len(gdf_wgs84)*100:.1f}%)")
        except Exception as e:
            print(f"‚ö†Ô∏è  Could not parse dates: {e}")
    else:
        print("‚ö†Ô∏è  No date column found")

TEMPORAL ANALYSIS

Date range: 2006-03-01 18:01:00 to 2015-08-25 06:00:00

Year distribution:
  2006: 2,611 points (2.5%)
  2007: 6,494 points (6.2%)
  2008: 7,717 points (7.4%)
  2009: 12,968 points (12.4%)
  2010: 6,414 points (6.1%)
  2011: 2,513 points (2.4%)
  2012: 7,843 points (7.5%)
  2013: 21,627 points (20.6%)
  2014: 28,879 points (27.5%)
  2015: 7,847 points (7.5%)

Month distribution:
  January: 10,183 points (9.7%)
  February: 9,711 points (9.3%)
  March: 10,267 points (9.8%)
  April: 9,080 points (8.7%)
  May: 5,892 points (5.6%)
  June: 5,333 points (5.1%)
  July: 6,634 points (6.3%)
  August: 7,200 points (6.9%)
  September: 7,491 points (7.1%)
  October: 11,228 points (10.7%)
  November: 11,063 points (10.5%)
  December: 10,831 points (10.3%)

üéØ October data: 11,228 points (10.7%)


## Step 5: Prepare Data for PathWild Integration

In [7]:
if gdf_wgs84 is not None:
    # Create PathWild-ready dataset
    pathwild_data = pd.DataFrame({
        'latitude': gdf_wgs84['latitude'],
        'longitude': gdf_wgs84['longitude'],
        'distance_to_area_048_km': gdf_wgs84['distance_to_area_048_km']
    })
    
    # Add temporal info if available
    if 'date' in gdf_wgs84.columns:
        pathwild_data['date'] = gdf_wgs84['date']
        pathwild_data['year'] = gdf_wgs84['year']
        pathwild_data['month'] = gdf_wgs84['month']
    
    # Add other relevant columns
    for col in gdf_wgs84.columns:
        if col not in pathwild_data.columns and col not in ['geometry', 'latitude', 'longitude']:
            if gdf_wgs84[col].dtype in ['int64', 'float64', 'object']:
                pathwild_data[col] = gdf_wgs84[col]
    
    print("=" * 60)
    print("PATHWILD-READY DATASET")
    print("=" * 60)
    print(f"\nShape: {pathwild_data.shape}")
    print(f"Columns: {list(pathwild_data.columns)}")
    print(f"\nFirst few rows:")
    print(pathwild_data.head())
    
    # Save to CSV
    output_file = Path("../data/processed/national_refuge_points.csv")
    output_file.parent.mkdir(parents=True, exist_ok=True)
    pathwild_data.to_csv(output_file, index=False)
    print(f"\n‚úì Saved to {output_file}")

PATHWILD-READY DATASET

Shape: (104913, 15)
Columns: ['latitude', 'longitude', 'distance_to_area_048_km', 'date', 'year', 'month', 'Elk_ID', 'DT', 'TZ', 'UTM_X', 'UTM_Y', 'Zone', 'Lat', 'Long', 't']

First few rows:
    latitude   longitude  distance_to_area_048_km                date  year  \
0  43.832868 -110.366700               390.644714 2006-03-01 18:01:00  2006   
1  43.832547 -110.366574               390.616672 2006-03-01 20:00:00  2006   
2  43.832470 -110.366370               390.598294 2006-03-01 22:00:00  2006   
3  43.832504 -110.366184               390.587899 2006-03-02 00:00:00  2006   
4  43.831202 -110.363635               390.337522 2006-03-02 02:01:00  2006   

   month  Elk_ID                DT   TZ        UTM_X        UTM_Y  Zone  \
0      3     572  2006-03-01 18:01  MTN  550916.6207  4853505.016    12   
1      3     572  2006-03-01 20:00  MTN  550927.0239  4853469.442    12   
2      3     572  2006-03-01 22:00  MTN  550943.4909  4853461.016    12   
3      3 

## Step 6: Summary and Next Steps

In [8]:
if gdf_wgs84 is not None:
    print("=" * 60)
    print("NATIONAL ELK REFUGE DATASET SUMMARY")
    print("=" * 60)
    print(f"\nTotal GPS points: {len(gdf_wgs84):,}")
    print(f"\nGeographic coverage:")
    print(f"  Latitude: {gdf_wgs84['latitude'].min():.4f}¬∞ to {gdf_wgs84['latitude'].max():.4f}¬∞")
    print(f"  Longitude: {gdf_wgs84['longitude'].min():.4f}¬∞ to {gdf_wgs84['longitude'].max():.4f}¬∞")
    print(f"\nProximity to Area 048:")
    print(f"  Average distance: {gdf_wgs84['distance_to_area_048_km'].mean():.2f} km")
    
    print(f"\nüìã Key Insights:")
    print(f"  ‚úì Large sample size for general elk behavior patterns")
    print(f"  ‚úì Long time series (2006-2015)")
    print(f"  ‚úì Useful for understanding seasonal timing")
    print(f"  ‚ö†Ô∏è  Geographic distance from Area 048 (~200 miles)")
    print(f"  ‚Üí Best used for general patterns, not geographic specificity")
    
    print(f"\nNext steps:")
    print("  1. Combine with South Bighorn data for hybrid training")
    print("  2. Use for general elk behavior patterns")
    print("  3. Integrate with DataContextBuilder to add environmental features")
    print("  4. Create training dataset with positive examples (GPS points)")
    print("  5. Generate negative examples (random points)")
    print("  6. Train XGBoost model with weighted combination of datasets")

NATIONAL ELK REFUGE DATASET SUMMARY

Total GPS points: 104,913

Geographic coverage:
  Latitude: 43.4530¬∞ to 44.2879¬∞
  Longitude: -110.7760¬∞ to -109.9710¬∞

Proximity to Area 048:
  Average distance: 405.06 km

üìã Key Insights:
  ‚úì Large sample size for general elk behavior patterns
  ‚úì Long time series (2006-2015)
  ‚úì Useful for understanding seasonal timing
  ‚ö†Ô∏è  Geographic distance from Area 048 (~200 miles)
  ‚Üí Best used for general patterns, not geographic specificity

Next steps:
  1. Combine with South Bighorn data for hybrid training
  2. Use for general elk behavior patterns
  3. Integrate with DataContextBuilder to add environmental features
  4. Create training dataset with positive examples (GPS points)
  5. Generate negative examples (random points)
  6. Train XGBoost model with weighted combination of datasets
