# Group Name & numer: DLNK, #26
## Group members:
Natiq Khan (nak9135@nyu.edu)\
David Lopez (dld388@nyu.edu)

Welcome to our project! For our project, we will be analysing and attempting to predict underdog-wins in the NFL using machine learning techniques. We use a variety of data sources, feature engineering and prediction models to accomplish this goal.\
We define an `underdog-win` as a situation where a team with an unfavorable spread (i.e., betting odds) takes the win in a game.\
We acknowledge the possibility that our model fails to reliably predict underdogs in a way that is better than the betting lines, but regardless, this presents a great exercise about data analysis, machine learning modeling as well as industry coding practices. It actually proved useful to have the Vegas line as a point of reference: if our model is making the same predictions as the vegas line, then that would be signify our code is mostly on the right track :)\

**DISCLAIMER**: *We are not doing this for gambling purposes, but rather to try and uncover patterns and trends previously undetected. This project is motivated purely by curiosity and has zero monetary incentives.* 

Our project is divided into 4 notebooks, each serving its own unique function:
1. **→Notebook-1: Data retrieval and loading**
2. Notebook-2: Data cleaning and standardizing
3. Notebook-3: Exploratory analysis and visualizations
4. Notebook-4: Machine Learning and predictions

We made each notebook in keeping with the principles of **modularization and testing**. As well as **abstraction**, which made it easier to share our work between members seamlessly. 

# 1.1 Configuring environment
While we had access to data between 1999 and 2025, there have been changes in team names and major rule overhauls over the years which make it very difficult to carry out analysis over this complete timeline. Thus, we chose the largest consistent interval in NFL history, which is **2005-2015** for our analysis. 

### We begin by downloading the CSV files for the play-by-play datasets from the `nflverse` repository: (https://github.com/nflverse/nflverse-data/releases/tag/pbp). 

General overview of what the code below does:
1. Import necessary libraries
2. Download NFL play-by-play CSV files from the nflverse GitHub release, restricted to 2005–2015.
3. Sequential execution

In [1]:
# To create folders and save files on system
import os
# For parsing and selecting specific files
import re
# In order to make web requests
import requests
# For type hints, to make code easier to read and debug
from typing import Optional, Dict, Any

In order to keep our code modular, we made reusable functions which can easily be passed new parameters in case our requirements change in the future:

In [2]:
# Configuration
OWNER = "nflverse"
REPO = "nflverse-data"
TAG = "pbp"  # GitHub release tag where PBP data resides
DOWNLOAD_DIR = "pbp_csv"
API_URL = f"https://api.github.com/repos/{OWNER}/{REPO}/releases/tags/{TAG}"

# Only download files whose filename contains a year in this interval:
START_SEASON = 2005
END_SEASON = 2015

# 1.2 Functions definitions:

In [3]:
# Utility functions:

# We added Universal timeout limit for all functions using internet
# This handled almost all of our test cases
DEFAULT_TIMEOUT = 20  # seconds

def extract_year_from_filename(name: str) -> Optional[int]:
    """
    Extract a 4-digit year from a filename.

    Parameters
    ----------
    name : str
        Filename to inspect.

    Returns
    -------
    int or None
        Four-digit year if found, otherwise None.
    """
    match = re.search(r"(\d{4})", name)
    if not match:
        return None
    return int(match.group(1))


def fetch_release(tag: str) -> Dict[str, Any]:
    """
    Fetch metadata for a GitHub release by tag.

    Parameters
    ----------
    tag : str
        Release tag to query (e.g., 'pbp').

    Returns
    -------
    dict
        JSON metadata describing the release, including assets.

    Raises
    ------
    RuntimeError
        If the GitHub request fails.
    """
    try:
        resp = requests.get(API_URL, timeout=DEFAULT_TIMEOUT)
        resp.raise_for_status()
    except requests.RequestException as e:
        raise RuntimeError(f"Failed to fetch release '{tag}' from GitHub: {e}") from e

    return resp.json()


def download_asset(url: str, out_path: str) -> None:
    """
    Download a file from a URL and save it locally in streamed chunks.

    Parameters
    ----------
    url : str
        Direct browser_download_url from the GitHub release.

    out_path : str
        Output path for the downloaded file.

    Raises
    ------
    RuntimeError
        If the download operation encounters a network or file error.
    """
    os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)

    try:
        with requests.get(url, stream=True, timeout=DEFAULT_TIMEOUT) as r:
            r.raise_for_status()
            with open(out_path, "wb") as f:
                for chunk in r.iter_content(chunk_size=8192):
                    if chunk:
                        f.write(chunk)
    except (requests.RequestException, OSError) as e:
        raise RuntimeError(f"Failed to download asset from {url}: {e}") from e

# 1.3 Function calls
Now that we have our utility functions ready, we can begin the html requests and retrieve our data files. The code below does the following:
1. Fetch the metadata, so it knows how many files of each kind there are
2. Filter by files ending in `.csv` and have years `2005-2015`
3. Download the data from GitHub
4. Create a folder called `pbp_csv` in working dir, and save retrieved CSV files there. 

In [4]:
# Fetch Release Metadata

release = fetch_release(TAG)
assets = release.get("assets", [])

# print statements to keep track and help with debugging
print(f"Total assets in release '{TAG}': {len(assets)}")

# Filter Assets by .csv Extension and Year Range

csv_assets = [a for a in assets if a.get("name", "").endswith(".csv")]

filtered_assets = []
for asset in csv_assets:
    name = asset.get("name", "")
    year = extract_year_from_filename(name)
    if year is not None and START_SEASON <= year <= END_SEASON:
        filtered_assets.append(asset)

print(f"CSV assets detected in seasons {START_SEASON}–{END_SEASON}: {len(filtered_assets)}")
for a in filtered_assets:
    print(" -", a["name"])

# Download Selected Assets

os.makedirs(DOWNLOAD_DIR, exist_ok=True)

for asset in filtered_assets:
    name = asset["name"]
    url = asset["browser_download_url"]
    out_path = os.path.join(DOWNLOAD_DIR, name)

    # constant print statement to keep track of progress and potential debugging
    print(f"Downloading {name} -> {out_path}")
    download_asset(url, out_path)
    print(f"Finished {name}\n")

print("All requested downloads complete.")

Total assets in release 'pbp': 160
CSV assets detected in seasons 2005–2015: 11
 - play_by_play_2005.csv
 - play_by_play_2006.csv
 - play_by_play_2007.csv
 - play_by_play_2008.csv
 - play_by_play_2009.csv
 - play_by_play_2010.csv
 - play_by_play_2011.csv
 - play_by_play_2012.csv
 - play_by_play_2013.csv
 - play_by_play_2014.csv
 - play_by_play_2015.csv
Downloading play_by_play_2005.csv -> pbp_csv/play_by_play_2005.csv
Finished play_by_play_2005.csv

Downloading play_by_play_2006.csv -> pbp_csv/play_by_play_2006.csv
Finished play_by_play_2006.csv

Downloading play_by_play_2007.csv -> pbp_csv/play_by_play_2007.csv
Finished play_by_play_2007.csv

Downloading play_by_play_2008.csv -> pbp_csv/play_by_play_2008.csv
Finished play_by_play_2008.csv

Downloading play_by_play_2009.csv -> pbp_csv/play_by_play_2009.csv
Finished play_by_play_2009.csv

Downloading play_by_play_2010.csv -> pbp_csv/play_by_play_2010.csv
Finished play_by_play_2010.csv

Downloading play_by_play_2011.csv -> pbp_csv/play_b

This notebook is only meant to be run once to create the `pbp_csv` folder and load it with the 10 CSV files.\
Once that has been created, you may move on to ***Notebook-2*** where subsequent data cleaning, manipulation and analysis is done. 