Skip to content
No description, website, or topics provided.
Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
analysis updates and analysis code Sep 14, 2019
tagged_transcripts uploaded extract mention text script Sep 15, 2019

FOOTBALL dataset


This repository contains scripts to analyze and process the FOOTBALL dataset, which is built from transcripts of 1,455 full game broadcasts from the U.S. National Football League (NFL) and National Collegiate Athletic Association (NCAA) recorded between 1960 and 2019. To start, please download the dataset at the following link and extract the zip file into this directory:

The dataset and associated experiments are described fully in our associated EMNLP 2019 paper, Investigating Sports Commentator Bias within a Large Corpus of American Football Broadcasts.

Dataset contents

FOOTBALL contains 1455 total games (601 NFL, 854 NCAA) whose transcripts amount to 27,144,587 tokens in total (tokenized with spaCy). Within these transcripts, we identify 545,232 mentions of players labeled with their position and name, of which 267,778 are also tagged for race (white, nonwhite). A total of 23,313 unique football players appear in the dataset (4,604 who we were able to label with race information). In additions to the player mention context dataset, we include the raw transcripts obtained from YouTube, as well as team roster data split by league.

Each entry in our player mention dataset has a label and a mention. The label stores information about the player mentioned, including canonical name, race, reference name (i.e., how the commentators referred to the player), the teams playing in the game, and the year the mention is from. The mention contains tokens from a k-length window around the reference; for example, given the following text "this is a guy Jesse James does nothing but work", the corresponding window with k=4 would be: ['this', 'is', 'a', guy', 'does', 'nothing', 'but', 'work']. We provide files with multiple window sizes for convenience as football-k.json. If players other than the one in the label field appear in the window, any references to them are replaced with a special <player> token.

Code contents

Please see the analysis subdirectory for scripts and instructions on how to replicate our experiments. provides example code for how to find mentions in the tagged transcripts for future experiments (e.g. adjusting window size).


If you use this dataset or code for your research, please cite:

  Author = {Jack Merullo and Luke Yeh and Abram Handler and Alvin {Grissom II} and Brendan O'Connor and Mohit Iyyer},
  Booktitle = {Empirical Methods in Natural Language Processing},
  Year = "2019",
  Title = {Investigating Sports Commentator Bias within a Large Corpus of American Football Broadcasts}
You can’t perform that action at this time.