Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time
PBS Newshour logo

PBS Newshour - Data Collection and Analysis

Author: Patrick Stetz (github)


All the data collected can be found on Kaggle (link).


PBS Newshour is an American daily news program currently hosted by Judy Woodruff on the weekdays and Hari Sreenivasan on the weekends.

Below we can see how the 17,617 clips/transcripts are distributed over time.

Transcripts over time

The number of transcripts and clips available has signicantly increased since 2011. This matches the year that Jim Lehrer retired as anchor.

Data Collection

Code can be found under scraping/.

Web Scraping

Web Scraping is done through one of Python's libraries, Beautiful Soup


Significant time can be saved by using another Python library, Multiprocessing


Code can be found under preprocessing/.

The preprocessing includes formatting the datetime, speakers, and transcripts. Speakers were processed to remove titles, qualifiers, and typos so that a one to one relationship between people and names is achieved.

Transcripts needed as a result of a formatting quirk with PBS Newshour transcripts. Transcripts don't identify a speaker directly. Instead speakers are identified with bold text, however this is also used for emphasis. Fixing this required a very tedious step of manually differentiately bold text from speakers.


Code can be found in the notebook analysis.ipynb.

Data Overview

Sentiment Analysis

I'd like to explore text sentiment in various ways. It is somewhat tricky to gauge positivity in text.

I decided to train a model on positive movie reviews and negative movie reviews. Then use this model to gauge the sentiment of political text.

The model is a Light Gradient Boosting model trained on lemmatized words. These words are selected by limiting the total words so that they're not too common (appear in less than 60% of samples) and not too rare (appear in at least 20 samples). The model generalizes very nicely to test data and has an 84.6% classification rate. Below we can see how well the model separates the test data into positive and negative categories.

The hope is that this model extends well to PBS transcript data.


Scraps PBS Newshour for every transcript. Includes some analysis




No releases published


No packages published