Skip to content

notnews/cnn_transcripts

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

CNN Transcripts

CNN provides transcripts for its shows at http://edition.cnn.com/TRANSCRIPTS/.

The transcripts are available for shows starting 1999/10/01. See http://edition.cnn.com/TRANSCRIPTS/1999.10.01.html. However, we get a 'Page not found' error when we follow links until 1999/12/31. So we started scraping the data from 2000/01/01.

CNN went through a few HTML styles of the news transcripts between 2000/01/01and 2014. So there are two scapers to parse the different HTML styles:

Data

The parsed data are posted at http://dx.doi.org/10.7910/DVN/ISDPJU. For copyright reasons, access is restricted for research purposes only. The data are split into 6 files:

  • cnn-1.csv. Data from 2000/01/01--2000/04/20. No. of transcripts = 7017
  • cnn-2.csv. Data from 2000/04/21--2001/04/03. No. of transcripts = 21381
  • cnn-3.csv. Data from 2001/04/04--2002/08/06. No. of transcripts = 35269
  • cnn-4.csv. Data from 2002/08/07--2002/09/16. No. of transcripts = 2343
  • cnn-5.csv. Data from 2002/09/17--2012/05/18. No. of transcripts = 101336
  • cnn-6.csv. Data from 2012/05/19--2014/06/17. No. of transcripts = 23536
  • cnn-7.csv. Data from 2014/06/18--2022/02/05. No. of transcripts = 102458

Total number of transcripts: 293,340

Notes

  • 2000-04-21 New format error
  • 2000-04-22 content within

    and

    tag
  • 2001-04-04 No URL prefix, subheader ==> h4, content next table
    tag
  • Scripts from 2014

About

CNN Transcripts: 2000--2021

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published