Headline Analysis Project
- A view of or attitude toward a situation or event; an opinion. A general feeling or opinion.
- A feeling or emotion.
In this context sentiment was used to gauge how postive vs negative a headline was.
#####This was a Devbootcamp Chicago final team project. Our goal for the 8 day project was to attempt to analyze the sentiment of different news agencies' headlines over the course of time.
Interesting questions we were hoping to answer include:
- Are there any general sentimental trends for a news source's headlines over time?
- Is there any correlation between time of the year and sentiment?
- Is there any correlation between the current political environment and different news source's headline sentiments?
- Can we viusually detect the positive/negative current events over time?
Besides the un-surprising result that all news trends negative no scientific conclusions can be made from this first iteration of the project.
- Despite collecting a ton of headlines there is still not sufficient data yet. (Wayback machine data collection really picks up only about mid 2011)
- No true statistical analysis was completed. (we only had 8 days, weren't data scientists, and chose interactivty over further analysis)
- Only one sentiment engine was used. Even with sufficient data more than one scoring engine would be needed to back any type of implications.
##How it works
#####The quick and dirty of how this worked:
- Target a specific news site on the wayback machine.
- Open up frontpage of news site and save up to 20 headlines for that day.
- Repeat for every day available going back in time up to 5ish years. (about 20-30 thousand headlines per news site)
- Feed each headline through AlchemyApi's sentiment analysis engine and save respective score. (score ranged from -1 (negative) to 1 (positive) )
- Save headline, date, score, and respective news source to our database.
- Plot data using D3.
You can find a slightly more detailed explanation here
- RoR, Postgres/MemCache
- Alchemy API was cool enough to provide us with an API key worth 30,000 requests a day. Their engine and robust API allowed us to score each headline very quickly.
- Wayback Machine was our source of the actual news pages going back time.
- Wayback Gem turned out to be incredibly useful/important for our project. It allowed for a very easy collection of all the different urls needed for scraping.
- Nokogiri is a well known scraping gem that made grabbing what we needed from each page relatively simple.
- D3 is an impressive plotting tool that was a challenge to learn but is the backbone to our visualizations.
- The script is not bad for how simple/short it is but could be refined to collect better/more headlines
- More than one sentinment engine score would be cool/interesting
- Include a counter to track keywords within headlines. (This might be difficult but would be a very powerful analysis tool)
- More unique/interesting D3 visualizations.
Feel free to fork away!
#####Luiz - Script and visualizations #####Corey s - Script and visualizations #####Kelmer - Database seeding, migrating and optimization (memcahce) #####Corey W - Database seeding and front-end framework.
Clone the repository
bundle install if you do not have any of the gems used in this project.
Set up your database by using the
rake db:create, and
rake db:migrate commands
Visit the Alchemy API website and request an API key.
###Wayback Scraper Script
This script searches through the given URL to scrape headlines. The script can be found in lib/wayback_scrapper. script_v1.rb drives the code for the script. The scraped headlines will be written to a local CSV file. (Be sure to create a folder for the source first) CSVs can be found in the lib directory headlines directories
Run the seed file to populate the database from the CSV
Th file also runs the article title through the Alchemy API to update the
sentiment_score field in the database