In [1]:
import nflreadpy as nfl
import polars as pl

In [2]:
raw_nfl_data = nfl.load_schedules(range(1999, 2025))

## GAME DATA
Read in all data relating to nfl games. For each game played since 1999, get the following: 

1. **game_id**: unique identifier for the specific game
2. **season**: year of the season the game belongs to. Note that, for example, all games of the 2024 season have a season value of 2024 even if they were played in 2025 (eg Eagles vs Chiefs in the Super Bowl belongs to the 2024 season, even though the game was in 2025). 
3. **week**: week number
4. **gameday**: date of game being played in YYYY-MM-DD format
5. **home_team**: abbreviation for home team
6. **away_team**: abbreviation for away team
7. **stadium**: name of stadium the game was played at
8. **roof**: status of the roof. Most teams are either "outdoors" or "dome," but for teams with retractable roofs it shows the status of the roof for the given game
9. **is_dome**: true/false for if game was a dome or not
10. **is_soldier_field**: true/false for if game was played at Soldier Field

In [3]:
game_data = (
    raw_nfl_data
    .select([
        "game_id",
        "season",
        "week",
        "gameday",
        "home_team",
        "away_team",
        "stadium",
        "roof",
    ])
    .with_columns([
        # Dome / roof closed flag
        pl.col("roof")
        .cast(pl.Utf8)
        .str.to_lowercase()
        .str.contains("dome|closed")
        .alias("is_dome"),

        # Soldier Field flag
        pl.col("stadium")
        .cast(pl.Utf8)
        .str.to_lowercase()
        .str.contains("soldier", literal=False)
        .alias("is_soldier_field"),
    ])
    .sort(["season", "week", "game_id"])
)

In [10]:
for c in game_data.columns:
    assert game_data.select(pl.col(c).null_count()).item() == 0

## WEATHER DATA

This is the weather data for each game played since 1999. Each game will have the following: 

1. **game_id**: unique identifier for the specific game
2. **temp**: temerature of the game in fahrenheit
3. **wind**: wind levels measured in mph
4. **precip**: true/false value indicating if the game had any precipitation

This dataset does not have official precipitation levels. For now, precip will be a boolean value, where it will be set to true when one of the following keywords is found in the game's first play's description: rain, snow, sleet, shower. Also note that the weather description comes from a separate dataset, which will be joined in using the game_id

In [29]:
weather_data = (
    raw_nfl_data
    .select([
        "game_id",
        "temp",
        "wind"
    ])
)

In [41]:
weather_data.shape

(6991, 5)

In [30]:
raw_pbp_data = nfl.load_pbp(range(1999, 2025))

In [31]:
precip_data = (
    raw_pbp_data
    .select(["game_id", "weather"])
    .unique(subset=["game_id"])
    .with_columns(
        pl.col("weather")
        .cast(pl.Utf8)
        .str.to_lowercase()
        # look for the following precipitation keywords
        .str.contains("rain|snow|sleet|shower")
        .alias("precip")
    )
)

In [42]:
precip_data.shape

(6988, 3)

In [39]:
weather_data.shape

(6991, 5)

We see that we have 6,991 rows in our weather df, and only 6,988 rows in our precip df. Lost 3 rows
<br>
<br>
Now we can join the two dataframes on game_id to a single dataframe with temp, wind, and precip

In [32]:
weather_data = weather_data.join(precip_data, on="game_id", how="left")

In [43]:
weather_data.select(pl.col("temp").null_count())

temp
u32
1975


In [44]:
weather_data.select(pl.col("wind").null_count())

wind
u32
1975


In [40]:
weather_data.select(pl.col("precip").null_count())

precip
u32
561


We see we have quite a few nulls in our data. There is no way to accurately fill in these nulls with the data available from these datasets, so dropping them for now. 

In [48]:
weather_data = weather_data.drop_nulls()
weather_data.shape

(4612, 5)

## QB DATA

## QB GAME STATS

## DEFENSE STATS