Closed Captions of News Videos from Archive.org
The repository provides scripts for downloading the data, and link to two datasets that were built using the scripts:
Downloading the Data from Archive.org
Download closed caption transcripts of nearly 1.3M news shows from http://archive.org.
There are three steps to downloading the transcripts:
We start by searching https://archive.org/advancedsearch.php with collection
collection:"tvarchive". This gets us unique identifiers for each of the news shows. An identifier is a simple string that combines channel_name, show_name, time, and date. The current final list of identifiers (2009--Nov. 2017) is posted here.
Next, we use the identifier to build a URL where the metadata file and HTML file with the closed captions is posted. The general base URL is http://archive.org/download followed by the identifier.
The third script parses the downloaded metadata and HTML closed caption files and creates a CSV along with the meta data.
For instance, we will go http://archive.org/download/CSPAN_20090604_230000 for identifier
CSPAN_20090604_230000 And from http://archive.org/download/CSPAN_20090604_230000/CSPAN_20090604_230000_meta.xml, we read the link http://archive.org/details/CSPAN_20090604_230000, from which we get the text from HTML file. We also store the meta data from the META XML file.
Get Show Identifiers
Download Metadata and HTML Files
- Download the Metadata and HTML Files
- Saves the metadata and HTML files to two separate folders specified in
--htmlrespectively. The default folder names are
Parse Metadata and HTML Files
Running the Scripts
Get all TV Archive identifiers from archive.org.
python get_news_identifiers.py -o ../data/search.csv
Download metadata and HTML files for all the shows in the sample input file
python scrape_archive_org.py ../data/search-test.csv
You can change the folder for
metaby using the
--metaflag. To change the directory for
html, use the
--htmlflag and specify the new directory. For instance,
python scrape_archive_org.py --meta meta-foxnews --html html-foxnews ../data/search-test.csv
-c/--compressoption to store and parse the downloaded files in compression format (GZip).
python parse_archive.py ../data/search-test.csv
The data are hosted on Google Cloud Storage. It is setup such that the requestor pays. To learn more about how to download the files from Google Storage, click here.
There are two separate datasets, one with ~ 500k and another with about 860k news transcripts.
500k Dataset from 2014
860k Dataset from 2017
We are releasing the scripts under the MIT License.
Please credit Internet Archive for the data.
If you wanted to refer to this particular corpus so that the research is reproducible, you can cite it as:
archive.org TV News Closed Caption Corpus. Laohaprapanon, Suriyan and Gaurav Sood. 2017. https://github.com/notnews/archive_news_cc/