# Data Establishment: QB Transfer Portal Analysis

**Project:** Do QB Transfers Pay Off? An Analysis of NCAA Quarterback Performance Post-Transfer
**Data Period:** 2021-2024 Seasons  

---

## Data Sources

All data is sourced from the **College Football Data API (CFBD)**, a open-source, comprehensive database of college football statistics.

- **API Documentation:** https://apinext.collegefootballdata.com/
- **Access Method:** RESTful API with authentication via API key
- **Data Producer:** CollegeFootballData.com

## Setup and Imports

In [3]:
import sys
from pathlib import Path
import pandas as pd
import numpy as np

# add project root to path to import modules like src.loaders
ROOT = Path().resolve().parent
sys.path.append(str(ROOT))

from src.loaders import (
    load_transfer_portal_multi,
    load_player_season_stats_multi,
    load_sp_plus_multi
)

# configs
START_YEAR = 2021
END_YEAR = 2024

## Data Acquisition

### How to Get the Data

I get the data through my simple API wrapper (`src/api.py`) and loader functions (`src/loaders.py`). These modules:

1. **Authenticate** with the CFBD API using an API key stored in `.env`
2. **Cache data locally** in `data/raw/` to avoid repeated API calls
3. **Load from cache** if data already exists, otherwise fetch from API

**To reproduce my analysis:**
1. Get a free API key from https://collegefootballdata.com/key
2. Create a `.env` file in the project root with: `CFBD_API_KEY=your_key_here`
3. Run this notebook - data will be automatically downloaded and cached

### Dataset 1: Transfer Portal Records

In [7]:
# load transfer portal data for all positions
transfers = load_transfer_portal_multi(start_year=START_YEAR, end_year=END_YEAR)

# print some summary statistics to make sure data loaded correctly
print(f"Columns: {list(transfers.columns)}")
print(f"\nData shape: {transfers.shape}")
print(f"Years covered: {sorted(transfers['season'].unique())}")

# filter for QBs with destinations
qb_transfers = transfers[
    (transfers["position"] == "QB") &
    (transfers["destination"].notna())
].copy()

print(f"\nQB transfers with destination: {len(qb_transfers):,}")

Columns: ['season', 'firstName', 'lastName', 'position', 'origin', 'destination', 'transferDate', 'rating', 'stars', 'eligibility', 'year']

Data shape: (9923, 11)
Years covered: [np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024)]

QB transfers with destination: 547


### Dataset 2: Player Season Statistics

In [9]:
# load player statistics
player_stats = load_player_season_stats_multi(start_year=START_YEAR, end_year=END_YEAR)

print(f"Columns: {list(player_stats.columns)}")
print(f"\nData shape: {player_stats.shape}")
print(f"\nStat categories: {sorted(player_stats['category'].unique())}")

# Show QB stats only
qb_stats = player_stats[player_stats["position"] == "QB"]
print(f"\nQB stat records: {len(qb_stats):,}")

Columns: ['season', 'playerId', 'player', 'position', 'team', 'conference', 'category', 'statType', 'stat', 'year']

Data shape: (513550, 10)

Stat categories: ['defensive', 'fumbles', 'interceptions', 'kickReturns', 'kicking', 'passing', 'puntReturns', 'punting', 'receiving', 'rushing']

QB stat records: 44,144


### Dataset 3: Team Quality Ratings (SP+)

In [10]:
# load SP+ team ratings
sp_plus = load_sp_plus_multi(START_YEAR, END_YEAR)

## COLS Table: Column Descriptions

### Transfer Portal Dataset

| Column | Type | Description | Example |
|--------|------|-------------|--------|
| `season` | int | Year the player entered transfer portal | 2022 |
| `firstName` | str | Player's first name | Michael |
| `lastName` | str | Player's last name | Penix |
| `position` | str | Player's position | QB |
| `origin` | str | School player transferred from | Indiana |
| `destination` | str | School player transferred to | Washington |
| `transferDate` | str | Date entered transfer portal | 2021-12-06 |
| `rating` | float | Recruiting rating (if available) | 0.9543 |
| `stars` | int | Recruiting stars (if available) | 4 |
| `eligibility` | str | Eligibility status | Graduate Transfer |

### Player Season Stats Dataset

| Column | Type | Description | Example |
|--------|------|-------------|--------|
| `season` | int | Season year | 2022 |
| `playerId` | int | Unique player identifier | 4259545 |
| `player` | str | Player full name | Michael Penix Jr. |
| `team` | str | Team abbreviation | WASH |
| `conference` | str | Conference name | Pac-12 |
| `category` | str | Stat category | passing, rushing, fumbles |
| `statType` | str | Specific stat | ATT, COMPLETIONS, YDS, TD, INT |
| `stat` | str/float | Stat value | 476 (attempts), 4641 (yards) |

### SP+ Team Ratings Dataset  

| Column | Type | Description | Example |
|--------|------|-------------|--------|
| `year` | int | Season year | 2022 |
| `team` | str | Team name | Washington |
| `conference` | str | Conference name | Pac-12 |
| `rating` | float | Overall SP+ rating (0=avg, higher=better) | 19.8 |
| `ranking` | int | National ranking by SP+ | 12 |
| `offense.rating` | float | Offensive efficiency rating | 35.2 |
| `defense.rating` | float | Defensive efficiency rating (lower=better) | -15.4 |
| `offense.ranking` | int | National offensive ranking | 8 |
| `defense.ranking` | int | National defensive ranking | 42 |
| `specialTeams.rating` | float | Special teams rating | 0.8 |

**Note:** SP+ is a tempo and opponent adjusted efficiency metric created by Bill Connelly. Higher ratings mean better teams. A rating of 0 represents an average FBS team.


## Data Quality Checks

### Missing Values

In [14]:
# check for missing values
print("Player Stats - Missing Values:")
print(player_stats.isnull().sum()[player_stats.isnull().sum() > 0])

print("SP+ Ratings - Missing Values:")
print(sp_plus.isnull().sum()[sp_plus.isnull().sum() > 0])

Player Stats - Missing Values:
Series([], dtype: int64)
SP+ Ratings - Missing Values:
conference                    4
ranking                       4
secondOrderWins             532
sos                         532
offense.ranking               4
offense.success             532
offense.explosiveness       532
offense.rushing             532
offense.passing             532
offense.standardDowns       532
offense.passingDowns        532
offense.runRate             532
offense.pace                532
defense.ranking               4
defense.success             532
defense.explosiveness       532
defense.rushing             532
defense.passing             532
defense.standardDowns       532
defense.passingDowns        532
defense.havoc.total         532
defense.havoc.frontSeven    532
defense.havoc.db            532
specialTeams.rating         131
dtype: int64


It is clear that while there are a lot of colums that would have useful information, many of them have are completely empty.  Overall rating, offensive and defensive ratings are the most useful columns that have near complete data.  These are what I will focus on in my analysis regarding team quality.

### Sample Data Inspection

In [15]:
print("Sample QB Transfers:")
display(qb_transfers[["season", "firstName", "lastName", "origin", "destination"]].head(10))

print("\nSample Player Stats (Michael Penix Jr. - 2022):")
penix_stats = player_stats[
    (player_stats["player"] == "Michael Penix Jr.") &
    (player_stats["season"] == 2022) &
    (player_stats["category"] == "passing")
]
display(penix_stats[["player", "team", "statType", "stat"]].head(10))

print("\nSample SP+ Ratings (2022 Top 5):")
sp_2022 = sp_plus[sp_plus["year"] == 2022].sort_values("rating", ascending=False)
display(sp_2022[["team", "conference", "rating", "ranking", "offense.rating", "defense.rating"]].head())

Sample QB Transfers:


Unnamed: 0,season,firstName,lastName,origin,destination
43,2021,Jalen,Hamler,Cal Poly,San José State
62,2021,Ryan,Kelley,Arizona State,Eastern Washington
95,2021,Sam,Noyer,Colorado,Oregon State
111,2021,Theo,Day,Michigan State,Northern Iowa
113,2021,Christian,Gelov,TCU,Purdue
128,2021,John,Bledsoe,Washington State,San Diego
159,2021,Jaylen,Gipson,Texas State,North Alabama
181,2021,Suddin,Sapien,UTSA,Texas-Permian Basin
229,2021,Wilson,Long,TCU,Vanderbilt
251,2021,Allan,Walters,Mississippi State,Arkansas State



Sample Player Stats (Michael Penix Jr. - 2022):


Unnamed: 0,player,team,statType,stat
121523,Michael Penix Jr.,Washington,ATT,554.0
121524,Michael Penix Jr.,Washington,COMPLETIONS,362.0
121525,Michael Penix Jr.,Washington,INT,8.0
121526,Michael Penix Jr.,Washington,PCT,0.653
121527,Michael Penix Jr.,Washington,TD,31.0
121528,Michael Penix Jr.,Washington,YDS,4641.0
121529,Michael Penix Jr.,Washington,YPA,8.4



Sample SP+ Ratings (2022 Top 5):


Unnamed: 0,team,conference,rating,ranking,offense.rating,defense.rating
131,Georgia,SEC,35.3,1.0,37.2,3.7
132,Michigan,Big Ten,32.0,2.0,38.0,7.6
133,Ohio State,Big Ten,31.1,3.0,45.1,15.7
134,Alabama,SEC,30.3,4.0,42.9,14.2
135,Tennessee,SEC,25.2,5.0,47.3,22.6
