# Notebook Instructions
<i>You can run the notebook document sequentially (one cell at a time) by pressing <b> shift + enter</b>. While a cell is running, a [*] will display on the left. When it has been run, a number will display indicating the order in which it was run in the notebook [8].</i>

<i>Enter edit mode by pressing <b>`Enter`</b> or using the mouse to click on a cell's editor area. Edit mode is indicated by a green cell border and a prompt showing in the editor area.</i> <BR>
    
This course is based on specific versions of python packages. You can find the details of the same in <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank" >this manual</a>. The manual also explains how to use these codes in other versions of python packages. <BR>

## Recent News Headlines

There are various Python APIs such as [Webhose](http://webhose.io/), [NewsAPI](http://newsapi.org/pricing), [GoogleNews](http://pypi.org/project/GoogleNews/) which aggregate news headlines from various media sources.

In this notebook, you will learn how to fetch articles from the Webhose API. This notebook is divided into the following sections:

1. Import libraries
2. Fetch API key
2. Apply filters
3. Fetch 100 news headlines
4. Fetch remaining headlines

## Import libraries

In [1]:
# For data manipulation
import pandas as pd

# Import webhoseio
import webhoseio

## Get API key

To use the Webhose API, you need to obtain an API  key that would be used on every request. To get an API key, create an [account](https://webhose.io/auth/signup), and then go on [dashboard](https://webhose.io/dashboard ) to see your token. 

<b>Note:</b> For free users, API requests are limited to 1000 per month. Once the limit is reached, you need to purchase the API access for further use.

In [0]:
# Import the get_webhoseio_key from the sentiment_analysis_quantra module
# The code of this module can be found in the downloads (last section) of this course
# You need to edit sentiment_analysis_quantra.py file and add your webhoseio key manually before you continue
from data_modules.sentiment_analysis_quantra import get_webhoseio_key
api_key = get_webhoseio_key()
webhoseio.config(token=api_key)

## Apply filters

To fetch news headlines, you can use the `query` function which takes filterWebContent and filter as arguments. The filterWebContent is used to access to the news/blogs/forums/reviews API of Webhose API.

Filter that we are using to get the news articles are:

1. language: The language of the news headline. We are using English as the language. Other supported languages are French, Spanish, Hindi, Chinese etc.

2. site_type: site_type is the type of data. We are using 'news' as the site_type. Other site_types are Blogs and Discussions.

3. site_category: site_category is the category of the news you want, such as financial, sports etc. We are using finanacial_news.

4. site: This is the site name you wish to fetch the data from. We fetch the data from 'cnn.com'. You can fetch the data from any available site.

5. thread.country: It is the country you wish to seek the news headline data from. In our case, we fetch the US data.

6. ts: The "ts" (timestamp) parameter returns results that were crawled after the timestamp (Unix Timestamp in milliseconds). When not specified the default is the past 3 days. Free users can get the maximum data of the past 30 days.


There are various filters that you can use. Read more about the [filters](https://docs.webhose.io/docs/filters-reference) supported by this API .

In [0]:
webhoseio.config(token="783c73ba-1fd9-4be0-b46f-29069f83a97f")
filters = {"q": "language:english site_type:news site:cnn.com thread.country:US"}
cursors = webhoseio.query("filterWebContent", filters)

In [0]:
cursors.keys()



These are the keys of dictionary cursor:

1. 'posts': Returns the article
2. 'totalResults': Returns the total number of articles for a particular filter
3. 'moreResultsAvailable':  Returns the number of remaining articles 
4. 'next': Returns the information about filters, API key etc.
5. 'RequestsLeft': Returns number of request left for the API key 
6. 'warnings': Returns the associated warnings


## Fetch the news headlines

To get the headlines, we use posts key of dictionary cursor.

In [0]:
# Fetching the first headline
# Print the text of the first post
cursors['posts'][1]

{'thread': {'uuid': '4b814f6029da5dfb9d3bfe3622219336b2214ff6',
  'url': 'https://edition.cnn.com/2019/11/08/us/west-virginia-pedestrian-fatally-dragged/index.html',
  'site_full': 'edition.cnn.com',
  'site': 'cnn.com',
  'site_section': 'http://rss.cnn.com/rss/cnn_latest.rss',
  'site_categories': ['media'],
  'section_title': 'CNN.com - RSS Channel',
  'title': 'Woman died after being hit by a truck and then dragged for miles by another vehicle, police say',
  'title_full': 'Woman died after being hit by a truck and then dragged for miles by another vehicle, police say - CNN',
  'published': '2019-11-08T02:00:00.000+02:00',
  'replies_count': 0,
  'participants_count': 1,
  'site_type': 'news',
  'country': 'US',
  'spam_score': 0.0,
  'main_image': 'https://cdn.cnn.com/cnnnext/dam/assets/191108040820-west-virginia-woman-dragged-by-car-super-tease.jpg',
  'performance_score': 0,
  'domain_rank': None,
  'social': {'facebook': {'likes': 34, 'comments': 14, 'shares': 39},
   'gplus': 

In [0]:
cursors['posts'][0].keys()

dict_keys(['thread', 'uuid', 'url', 'ord_in_thread', 'author', 'published', 'title', 'text', 'highlightText', 'highlightTitle', 'language', 'external_links', 'external_images', 'entities', 'rating', 'crawled'])

These are the keys you can use to get the specific information about the headline. You can use text to get the text , tilte to get the headline, published to fetch the date of the headline. 

In [0]:
# Fetching the text of the article
cursors['posts'][0]['text']



## Getting 100 news headlines

We loop over the length of articles and get all the headlines up to 100. There are only 100 news headlines in one page of results. Therefore we can fetch only 100 headlines.

In [0]:
# Deifne a function store_articles with parameters df and cursor
def store_articles(df, cursors):

    # Run a for loop over the length of posts.
    # Get the publishing date, title and text of the articles.
    for i in range(len(cursors['posts'])):
        date = cursors['posts'][i]['published']
        title = cursors['posts'][i]['title']
        posts = cursors['posts'][i]['text']

        # Append the date, title and articles
        df = df.append({'Date': date, 'Title': title,
                        'News_Headlines': posts}, ignore_index=True)
    return df


# Create a dataframe with column name Date, Title and Articles
df = pd.DataFrame(columns=['Date', 'Title', 'News_Headlines'])

df = store_articles(df, cursors)
df.head()

Unnamed: 0,Date,Title,News_Headlines
0,2019-11-08T02:00:00.000+02:00,Australia bushfires: 'Unprecedented' number of...,"(CNN) A series of ""unprecedented"" bushfires ar..."
1,2019-11-08T02:00:00.000+02:00,Woman died after being hit by a truck and then...,(CNN) A pedestrian was struck by a truck in Wh...
2,2019-11-08T02:00:00.000+02:00,Tampons will no longer be taxed as luxury item...,"Berlin (CNN) What do wine, cigarettes and tamp..."
3,2019-11-08T02:00:00.000+02:00,Hong Kong protests could wipe $275 million off...,Hong Kong (CNN Business) Hong Kong's political...
4,2019-11-08T02:00:00.000+02:00,A man returning home from work found a woman's...,"(CNN) A gated community in Simi Valley, Califo..."


In [0]:
# Prints the length of the dataset
len(df)

100

## Fetch remaining news headlines

To fetch the next page of results articles, we use `get_next` method of webhoseio object. We run a while loop until `totalResults` is equal to 0.

<b>Note</b>: Fetching will take some time as we are fetching many news headlines. 

In [0]:
while True:
    cursors = webhoseio.get_next()
    df = store_articles(df, cursors)
    if cursors['totalResults'] == 0:
        break

In [0]:
# Returns the length of the dataset
len(df)

616

In [0]:
# Returns the bottom 5 rows of the dataset
df.tail()

Unnamed: 0,Date,Title,News_Headlines
611,2019-11-11T02:00:00.000+02:00,Bill Moyers: 'Do facts matter anymore? I think...,"New York (CNN Business) Bill Moyers, the legen..."
612,2019-11-11T02:00:00.000+02:00,Cyclone Bulbul: Over 2 million people evacuate...,(CNN) At least 10 people have been killed and ...
613,2019-11-11T02:00:00.000+02:00,Michigan woman called a friend to say she was ...,(CNN) A Michigan woman called her friend to sa...
614,2019-11-11T07:44:00.000+02:00,People's Choice Awards 2019: Best fashion on t...,People's Choice Awards 2019: Best fashion on t...
615,2019-11-11T02:00:00.000+02:00,Hong Kong protests rage after police shooting:...,55 min ago Here's what you need to know Police...


In the upcoming units, you will learn to practice these codes in the Interactive exercises.<br><br>