# Importing Necessary Libraries and Setting Up Environment

This cell sets up the environment and imports the required libraries and functions to begin the scraping process:

- **Libraries**:
  - `BeautifulSoup`: For parsing HTML content.
  - `Path` (from `pathlib`): For working with file paths.
  - `pandas`: For handling tabular data.
  - `requests`: For making HTTP requests.
  - `sys`: For modifying the Python path.
  - `time`: For adding delays between requests.

- **Utility Functions**:
  - The utility functions (`make_request`, `extract_boxscore_links`, and `get_boxscore`) are imported from `utils.py` located in the `src/` folder. The `src/` folder is added to the Python path using `Path` and `sys.path`.

- **Setup Variables**:
  - `year`: Specifies the season year for scraping data.
  - `url_year`: The URL for the NFL schedule page for the specified year.
  - `url_box`: A sample URL for a specific game’s boxscore.

This cell ensures that the environment is ready for the scraping workflow by organizing imports and setting initial variables.

In [None]:
from bs4 import BeautifulSoup
from pathlib import Path
import pandas as pd
import requests
import sys
import time

# Add the src/ folder to the Python path
sys.path.append(str(Path().resolve() / "src"))

from utils import make_request, extract_boxscore_links, get_boxscore

year = 1978
url_year = f"https://www.footballdb.com/games/index.html?lg=NFL&yr={year}"
url_box = "https://www.footballdb.com//games/boxscore/new-york-jets-vs-cleveland-browns-1978121006"

# Scraping NFL Boxscore Data from 1978 to 2023

This cell implements the main scraping workflow to gather NFL boxscore data for each game from 1978 to 2022. The process is as follows:

1. **Setup**:
   - A timer (`start_time`) is initialized to track the total runtime.
   - An empty dictionary (`boxscores_dict`) is created to store raw data by year and week.
   - An empty DataFrame (`boxscores_df`) is initialized to hold the final structured data.

2. **Iterating Over Years**:
   - For each year in the range 1978–2022:
     - The URL for the year’s schedule page is constructed (`url_year`).
     - The schedule page is fetched using the `make_request` utility function.
     - The HTML content is parsed with `BeautifulSoup`.

3. **Extracting Weekly Links**:
   - The `extract_boxscore_links` function identifies all game links for each week.
   - If no links are found for a week, the loop breaks early.

4. **Processing Each Game**:
   - For each game in a week:
     - The boxscore URL is fetched and parsed.
     - The game’s data is extracted using the `get_boxscore` utility function.
     - A temporary DataFrame is created to hold the game’s data, including additional columns for the season and week.
     - The temporary DataFrame is appended to the main DataFrame (`boxscores_df`).

5. **Tracking Progress**:
   - After processing all weeks and games in a season, the elapsed time is printed for tracking performance.

6. **Final Output**:
   - The `head()` method is called on `boxscores_df` to display the first few rows of the compiled dataset.

This workflow ensures that all relevant boxscore data is collected, structured, and stored in a DataFrame for analysis.

In [None]:
start_time = time.time()
years = range(1978, 2024)
boxscores_dict = {}
boxscores_df = pd.DataFrame()

for year in years:
    boxscores_dict[year] = {}
    
    url_year = f"https://www.footballdb.com/games/index.html?lg=NFL&yr={year}"
    response_year = make_request(url_year)
    soup_year = BeautifulSoup(response_year.content, 'html.parser')

    links = extract_boxscore_links(soup_year)
    weeks = range(1, 18)
    for week in weeks:
        boxscores_dict[year][week] = {}
        if week not in links.keys():
            break
        
        for game_ind, url_game in enumerate(links[week]):
            response_game = make_request(url_game)
            soup_game = BeautifulSoup(response_game.content, 'html.parser')
            boxscore = get_boxscore(soup_game)
            game_df = pd.DataFrame([boxscore])
            game_df["season"] = year
            game_df["week"] = week
        
            boxscores_df = pd.concat([boxscores_df, game_df], ignore_index=True)

    end_time = time.time()
    print("Season", year)
    print(f"elapsed time: {end_time - start_time}s")

boxscores_df.head()

# Saving the Boxscore Data to a CSV File

This cell saves the compiled NFL boxscore data stored in the `boxscores_df` DataFrame to a CSV file for further analysis or sharing.

- **File Name**: The file is saved as `"data/nfl box scores 2.csv"` in the `data/` directory.
- **Index Exclusion**: The `index=False` parameter ensures that the DataFrame index is not included in the CSV file, keeping the output clean and focused on the data.

This step finalizes the scraping workflow by exporting the processed data into a convenient and portable format.

In [None]:
boxscores_df.to_csv("data/nfl box scores 2.csv", index=False)