Skip to content

A script to get news headlines from Wikipedia's current events portal from specific start and end dates.

License

Notifications You must be signed in to change notification settings

justinslud/analyze-headlines

Repository files navigation

scrape-wikipedia-current-events

Please see the general info for this project on the Projects page

I split this project into 5 parts. The links are to the scripts and notebooks within this repository.

  1. Collecting the data
  2. Exploring and understanding the data
  3. Machine learning modelling
  4. Building a Flask API
  5. Making an interactive Streamlit app

You can access the Streamlit app here and clicking 'Wikipedia Current Events Analysis' on the sidebar.

Here are a summary and notes on how I carried out the project:

1. Collecting the data

The Wikipedia current events portal has changed their HTML structure over the years. Despite not scraping all headlines, I still scraped over 50,000 headlines from 1995-2017.


2. Exploring and understanding the data

I try and plot how often a term appears in headlines over the years, but this is not a good measure of how popular a term actually is. Wikipedia is heavily biased towards specific people and certain types of events. If I wanted to do an actual trend plot, my dataset of choice would be a media website, but even those are biased towards certain people and events.


3. Exploring and understanding the data

Since this project was more exploratory, I did not have a specific prediction task in mind. I tried to predict which year a headline was from through KMeans clustering and logistic regression, but the heavy class imbalance (only 200 articles from 1995 but 2000 from 2015) made the task difficult.

Predicting the subject of headlines was a more interesting and successful task. I had to reduce the number of possible categories to 9 by combining subjects. The logistic regression model had an average recall of .7, which is faily accurate considering a headline could be strongly related to 2 or more categories.


4. Building a Flask API

I wanted to build an example API where a user could send headlines and receive my model's predictions and calculated probabilities. The actual prediction is taken care of by sci-kit, so the challenge here came from structuring the response JSON, handling 1 or many inputs, and handling errors.


5. Making an interactive Streamlit app

The trend plotting exploratory data analysis and subject prediction look better with visuals on a website. Streamlit made this really straightforward and I used bokeh to make the trend plot.

About

A script to get news headlines from Wikipedia's current events portal from specific start and end dates.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages