# NFL Big Data Bowl 2026 - Data Download

This notebook downloads the competition data directly from Kaggle using the Kaggle API.

**Prerequisites:**
1. You must have accepted the competition rules on Kaggle
2. Your `kaggle.json` API token must be in `~/.kaggle/` directory
3. Run `pip install -r requirements.txt` first

**Competition URL:** https://www.kaggle.com/competitions/nfl-big-data-bowl-2026-analytics

## Setup and Imports

In [2]:
import os
import zipfile
from pathlib import Path
import pandas as pd

# Kaggle API
from kaggle.api.kaggle_api_extended import KaggleApi

print("✅ All imports successful!")

✅ All imports successful!


## Configure Paths

In [29]:
# Create data directory if it doesn't exist
DATA_DIR = Path('./data')
DATA_DIR.mkdir(exist_ok=True)
COMPETITION_DIR = DATA_DIR / '114239_nfl_competition_files_published_analytics_final'
TRAIN_DIR = COMPETITION_DIR / 'train'

# Competition name
COMPETITION_NAME = 'nfl-big-data-bowl-2026-analytics'

print(f"📁 Data will be saved to: {DATA_DIR.absolute()}")

📁 Data will be saved to: d:\PyScripts\NFL-Big-Data-Bowl-2026-Analytics\data


## Authenticate with Kaggle API

In [6]:
# Initialize Kaggle API
api = KaggleApi()
api.authenticate()

print("✅ Kaggle API authenticated successfully!")
print("\n📋 Note: Make sure you've accepted the competition rules at:")
print(f"   https://www.kaggle.com/competitions/{COMPETITION_NAME}")

✅ Kaggle API authenticated successfully!

📋 Note: Make sure you've accepted the competition rules at:
   https://www.kaggle.com/competitions/nfl-big-data-bowl-2026-analytics


## Check Available Files

Let's see what files are available in the competition:

In [8]:
# List competition files
try:
    files = api.competition_list_files(COMPETITION_NAME)
    print("📦 Available files in competition:\n")
    if hasattr(files, 'files'):
        for file in files.files:
            print(f"   {file}")
    else:
        file_list = files
        for f in file_list:
            print(f"   - {f.name} ({f.size / (1024**2):.1f} MB)")
except Exception as e:
    print(f"❌ Error: {e}")
    print("\n⚠️  Make sure you've:")
    print("   1. Accepted the competition rules")
    print("   2. Set up your kaggle.json credentials correctly")

📦 Available files in competition:

   {"ref": "", "name": "114239_nfl_competition_files_published_analytics_final/train/input_2023_w01.csv", "description": "", "totalBytes": 48950314, "url": "", "creationDate": "2025-09-23T18:36:28.263Z"}
   {"ref": "", "name": "114239_nfl_competition_files_published_analytics_final/train/input_2023_w02.csv", "description": "", "totalBytes": 49485029, "url": "", "creationDate": "2025-09-23T18:36:28.263Z"}
   {"ref": "", "name": "114239_nfl_competition_files_published_analytics_final/train/input_2023_w03.csv", "description": "", "totalBytes": 51062128, "url": "", "creationDate": "2025-09-23T18:36:28.263Z"}
   {"ref": "", "name": "114239_nfl_competition_files_published_analytics_final/train/input_2023_w04.csv", "description": "", "totalBytes": 46685806, "url": "", "creationDate": "2025-09-23T18:36:28.263Z"}
   {"ref": "", "name": "114239_nfl_competition_files_published_analytics_final/train/input_2023_w05.csv", "description": "", "totalBytes": 43574971, 

## Download Competition Data

This will download all competition files to the data directory:

In [9]:
print("⬇️  Downloading competition files...\n")
print("This may take a few minutes depending on your internet speed.\n")

try:
    # Download all competition files
    api.competition_download_files(
        COMPETITION_NAME,
        path=str(DATA_DIR),
        quiet=False
    )
    print("\n✅ Download complete!")
    
except Exception as e:
    print(f"\n❌ Download failed: {e}")
    print("\nTroubleshooting:")
    print("1. Verify you accepted competition rules")
    print("2. Check your internet connection")
    print("3. Verify kaggle.json is in the correct location")

⬇️  Downloading competition files...

This may take a few minutes depending on your internet speed.

Downloading nfl-big-data-bowl-2026-analytics.zip to data


100%|██████████| 103M/103M [00:00<00:00, 1.51GB/s]



✅ Download complete!





## Extract ZIP Files

In [10]:
# Find and extract ZIP files
zip_files = list(DATA_DIR.glob('*.zip'))

if zip_files:
    print(f"📦 Found {len(zip_files)} ZIP file(s) to extract\n")
    
    for zip_path in zip_files:
        print(f"Extracting {zip_path.name}...")
        try:
            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall(DATA_DIR)
            print(f"  ✅ Extracted successfully")
            
            # Optionally remove ZIP file after extraction
            # zip_path.unlink()
            # print(f"  🗑️  Removed ZIP file")
            
        except Exception as e:
            print(f"  ❌ Error extracting: {e}")
    
    print("\n✅ All files extracted!")
else:
    print("⚠️  No ZIP files found to extract")

📦 Found 1 ZIP file(s) to extract

Extracting nfl-big-data-bowl-2026-analytics.zip...
  ✅ Extracted successfully

✅ All files extracted!


## Verify Downloaded Files

In [27]:
all_csv_files = list(DATA_DIR.rglob('*.csv'))

print(f"\n📊 Found {len(all_csv_files)} CSV files total\n")
print("="*80)

# Organize by directory
files_by_dir = {}
for csv_file in sorted(all_csv_files):
    parent_dir = csv_file.parent.name
    if parent_dir not in files_by_dir:
        files_by_dir[parent_dir] = []
    files_by_dir[parent_dir].append(csv_file)
    
# Display files organized by directory
for dir_name, files in files_by_dir.items():
    print(f"\n📂 Directory: {dir_name}/")
    print("-" * 80)
    for csv_file in files:
        size_mb = csv_file.stat().st_size / (1024**2)
        print(f"   ✓ {csv_file.name:<40} ({size_mb:>8.1f} MB)")
    print(f"\n   Subtotal: {len(files)} files")

print("\n" + "="*80)
print(f"✅ Total CSV files found: {len(all_csv_files)}")
print("="*80)


📊 Found 37 CSV files total


📂 Directory: 114239_nfl_competition_files_published_analytics_final/
--------------------------------------------------------------------------------
   ✓ supplementary_data.csv                   (     7.4 MB)

   Subtotal: 1 files

📂 Directory: train/
--------------------------------------------------------------------------------
   ✓ input_2023_w01.csv                       (    46.7 MB)
   ✓ input_2023_w02.csv                       (    47.2 MB)
   ✓ input_2023_w03.csv                       (    48.7 MB)
   ✓ input_2023_w04.csv                       (    44.5 MB)
   ✓ input_2023_w05.csv                       (    41.6 MB)
   ✓ input_2023_w06.csv                       (    44.2 MB)
   ✓ input_2023_w07.csv                       (    38.1 MB)
   ✓ input_2023_w08.csv                       (    45.9 MB)
   ✓ input_2023_w09.csv                       (    41.2 MB)
   ✓ input_2023_w10.csv                       (    42.6 MB)
   ✓ input_2023_w11.csv             

## Quick Data Preview

Let's peek at the structure of each dataset:

In [30]:
print("\n" + "="*80)
print("📊 SUPPLEMENTARY DATA PREVIEW")
print("="*80)

# Check supplementary data
supp_file = COMPETITION_DIR / 'supplementary_data.csv'
if supp_file.exists():
    try:
        df_supp = pd.read_csv(supp_file, nrows=5)
        print(f"\nFile: supplementary_data.csv")
        print(f"Shape: {df_supp.shape[0]} rows × {df_supp.shape[1]} columns (showing first 5)")
        print(f"\nColumns: {', '.join(df_supp.columns.tolist())}")
        print(f"\nFirst few rows:")
        print(df_supp.head())
    except Exception as e:
        print(f"❌ Error reading supplementary_data.csv: {e}")
else:
    print("⚠️  supplementary_data.csv not found")

print("\n" + "="*80)
print("📊 WEEKLY INPUT FILES PREVIEW")
print("="*80)

# Check input files (week 2 as example)
input_w02 = TRAIN_DIR / 'input_2023_w02.csv'
if input_w02.exists():
    try:
        df_input = pd.read_csv(input_w02, nrows=5)
        print(f"\nFile: input_2023_w02.csv (example week)")
        print(f"Shape: {df_input.shape[0]} rows × {df_input.shape[1]} columns (showing first 5)")
        print(f"\nColumns: {', '.join(df_input.columns.tolist())}")
        print(f"\nFirst few rows:")
        print(df_input.head())
    except Exception as e:
        print(f"❌ Error reading input file: {e}")
else:
    print("⚠️  input_2023_w02.csv not found")

print("\n" + "="*80)
print("📊 WEEKLY OUTPUT FILES PREVIEW")
print("="*80)

# Check output files (week 2 as example)
output_w02 = TRAIN_DIR / 'output_2023_w02.csv'
if output_w02.exists():
    try:
        df_output = pd.read_csv(output_w02, nrows=5)
        print(f"\nFile: output_2023_w02.csv (example week)")
        print(f"Shape: {df_output.shape[0]} rows × {df_output.shape[1]} columns (showing first 5)")
        print(f"\nColumns: {', '.join(df_output.columns.tolist())}")
        print(f"\nFirst few rows:")
        print(df_output.head())
    except Exception as e:
        print(f"❌ Error reading output file: {e}")
else:
    print("⚠️  output_2023_w02.csv not found")


📊 SUPPLEMENTARY DATA PREVIEW

File: supplementary_data.csv
Shape: 5 rows × 41 columns (showing first 5)

Columns: game_id, season, week, game_date, game_time_eastern, home_team_abbr, visitor_team_abbr, play_id, play_description, quarter, game_clock, down, yards_to_go, possession_team, defensive_team, yardline_side, yardline_number, pre_snap_home_score, pre_snap_visitor_score, play_nullified_by_penalty, pass_result, pass_length, offense_formation, receiver_alignment, route_of_targeted_receiver, play_action, dropback_type, dropback_distance, pass_location_type, defenders_in_the_box, team_coverage_man_zone, team_coverage_type, penalty_yards, pre_penalty_yards_gained, yards_gained, expected_points, expected_points_added, pre_snap_home_team_win_probability, pre_snap_visitor_team_win_probability, home_team_win_probability_added, visitor_team_win_probility_added

First few rows:
      game_id  season  week   game_date game_time_eastern home_team_abbr  \
0  2023090700    2023     1  09/07/202

## Load Full Datasets (Memory Check)

Let's check how much memory we'll need:

In [32]:
import sys

def load_and_check_memory(filepath, sample_only=True, nrows=10000):
    """Load a CSV and report memory usage"""
    
    if not filepath.exists():
        print(f"⚠️  {filepath.name} not found")
        return None
    
    print(f"\nLoading {filepath.name}...")
    
    try:
        if sample_only:
            # Load sample for memory estimation
            df = pd.read_csv(filepath, nrows=nrows)
            print(f"  ℹ️  Loaded sample: {len(df):,} rows (out of potentially more)")
        else:
            # Load full file
            df = pd.read_csv(filepath)
            print(f"  ✓ Loaded: {len(df):,} rows × {len(df.columns)} columns")
        
        memory_mb = df.memory_usage(deep=True).sum() / (1024**2)
        print(f"  💾 Memory: {memory_mb:.1f} MB")
        
        return df
        
    except Exception as e:
        print(f"  ❌ Error loading: {e}")
        return None

print("\n" + "="*80)
print("Loading datasets (samples)...")
print("="*80)

# Load supplementary data
print("\n📊 Supplementary Data:")
supp_data = load_and_check_memory(COMPETITION_DIR / 'supplementary_data.csv', sample_only=False)

# Load a few weekly input files as examples
print("\n📊 Sample Input Files:")
input_w02 = load_and_check_memory(TRAIN_DIR / 'input_2023_w02.csv', sample_only=True)
input_w10 = load_and_check_memory(TRAIN_DIR / 'input_2023_w10.csv', sample_only=True)

# Load a few weekly output files as examples
print("\n📊 Sample Output Files:")
output_w02 = load_and_check_memory(TRAIN_DIR / 'output_2023_w02.csv', sample_only=True)
output_w10 = load_and_check_memory(TRAIN_DIR / 'output_2023_w10.csv', sample_only=True)

print("\n" + "="*80)
print("✅ Sample datasets loaded successfully!")
print("="*80)
print("\nNote: Weekly files loaded as samples (10,000 rows) to save memory.")
print("You can load full files by setting sample_only=False")

# Count all weekly files
input_files = sorted(TRAIN_DIR.glob('input_2023_w*.csv'))
output_files = sorted(TRAIN_DIR.glob('output_2023_w*.csv'))

print("\n" + "="*80)
print("📈 DATASET SUMMARY")
print("="*80)
print(f"\n✓ Supplementary data file: 1")
print(f"✓ Weekly input files: {len(input_files)}")
print(f"✓ Weekly output files: {len(output_files)}")
print(f"\nTotal files: {1 + len(input_files) + len(output_files)}")

print("\n📋 Input files (by week):")
for f in input_files:
    print(f"   • {f.name}")

print("\n📋 Output files (by week):")
for f in output_files:
    print(f"   • {f.name}")


Loading datasets (samples)...

📊 Supplementary Data:

Loading supplementary_data.csv...
  ✓ Loaded: 18,009 rows × 41 columns
  💾 Memory: 22.1 MB

📊 Sample Input Files:

Loading input_2023_w02.csv...
  ℹ️  Loaded sample: 10,000 rows (out of potentially more)
  💾 Memory: 5.0 MB

Loading input_2023_w10.csv...
  ℹ️  Loaded sample: 10,000 rows (out of potentially more)
  💾 Memory: 5.0 MB

📊 Sample Output Files:

Loading output_2023_w02.csv...
  ℹ️  Loaded sample: 10,000 rows (out of potentially more)
  💾 Memory: 0.5 MB

Loading output_2023_w10.csv...
  ℹ️  Loaded sample: 10,000 rows (out of potentially more)
  💾 Memory: 0.5 MB

✅ Sample datasets loaded successfully!

Note: Weekly files loaded as samples (10,000 rows) to save memory.
You can load full files by setting sample_only=False

📈 DATASET SUMMARY

✓ Supplementary data file: 1
✓ Weekly input files: 18
✓ Weekly output files: 18

Total files: 37

📋 Input files (by week):
   • input_2023_w01.csv
   • input_2023_w02.csv
   • input_2023_w

  df = pd.read_csv(filepath)


## Summary

Data download and setup complete! You now have:

1. ✅ Competition data downloaded from Kaggle
2. ✅ Files extracted to `data/` directory
3. ✅ Datasets loaded and verified

**Next Steps:**
- Continue to `02_exploration.ipynb` to explore the data
- Or start analyzing in `03_analysis.ipynb`

In [33]:
print("\n🎉 Setup Complete!\n")
print("Next notebook: 02_exploration.ipynb")
print("\nHappy analyzing! 🏈")


🎉 Setup Complete!

Next notebook: 02_exploration.ipynb

Happy analyzing! 🏈
