Skip to content
Go to file

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Closed Captions of News Videos from

The repository provides scripts for downloading the data, and link to two datasets that were built using the scripts:

Downloading the Data from

Download closed caption transcripts of nearly 1.3M news shows from

There are three steps to downloading the transcripts:

  1. We start by searching with collection collection:"tvarchive". This gets us unique identifiers for each of the news shows. An identifier is a simple string that combines channel_name, show_name, time, and date. The current final list of identifiers (2009--Nov. 2017) is posted here.

  2. Next, we use the identifier to build a URL where the metadata file and HTML file with the closed captions is posted. The general base URL is followed by the identifier.

  3. The third script parses the downloaded metadata and HTML closed caption files and creates a CSV along with the meta data.

For instance, we will go for identifier CSPAN_20090604_230000 And from, we read the link, from which we get the text from HTML file. We also store the meta data from the META XML file.


  1. Get Show Identifiers

  2. Download Metadata and HTML Files

    • Download the Metadata and HTML Files
    • Saves the metadata and HTML files to two separate folders specified in --meta and --html respectively. The default folder names are meta and html respectively.
  3. Parse Metadata and HTML Files

Running the Scripts

  1. Get all TV Archive identifiers from

    python -o ../data/search.csv
  2. Download metadata and HTML files for all the shows in the sample input file

    python ../data/search-test.csv

    This will create two directories meta and html by default in the same folder as where the script is. We have included the first 25 metadata and first 25 html files.

    You can change the folder for meta by using the --meta flag. To change the directory for html, use the --html flag and specify the new directory. For instance,

    python --meta meta-foxnews --html html-foxnews ../data/search-test.csv

    Use -c/--compress option to store and parse the downloaded files in compression format (GZip).

  3. Parse and extract meta fields and text from sample metadata and HTML files.

    python ../data/search-test.csv

    A sample output file.


The data are hosted on Google Cloud Storage. It is setup such that the requestor pays. To learn more about how to download the files from Google Storage, click here.

There are two separate datasets, one with ~ 500k and another with about 860k news transcripts.

500k Dataset from 2014

860k Dataset from 2017


We are releasing the scripts under the MIT License.

Suggested Citation

Please credit Internet Archive for the data.

If you wanted to refer to this particular corpus so that the research is reproducible, you can cite it as: TV News Closed Caption Corpus. Laohaprapanon, Suriyan and Gaurav Sood. 2017.     


Closed Caption Transcripts of News Videos from + Scripts for Downloading and Parsing the data




No releases published
You can’t perform that action at this time.