## Extracting news from publicly-available RSS Feeds

This notebook demonstrates the extraction of news from publicly-available RSS Feeds. **For showcasing purposes, the providers and URLs are removed.** The objective of this function is to extract news from such feeds for sentiment analysis.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re 
import html
import os 
import datetime

urls = {'XXX' : "XXX"}

RSS feeds are based on XML. Using BeautifulSoup's in-built XML parser, we will be able to find the information we want based on the tags. If any of the tags are not available (eg. description), we will discard that article, adding to *skip_count*, which occurs occasionally. 

For each different source, we will pre-process the incoming articles with different regular expressions (regex). This is because each source formats their RSS feed differently, thus the need to pre-process them individually. 

Once pre-processed, compile all article into a CSV file:
1. Append if news.csv exist : slice df based on existence of link in csv file into 'update'
2. Create a new news.csv file if it does not exist. 

First, we attempt to open the 'news.csv' file. If the file does not exist, we will create the file when we export our first set of articles.

In [2]:
new_count, skip_count, create_file = 0, 0, 0
try:
    read = pd.read_csv('news.csv', sep = ',')
    print("Reading of CSV file successful.")
    link = read['link']
except:
    create_file = 1 # Create a new news.csv file if it does not exist.

Reading of CSV file successful.


In [3]:
for n in range(len(urls)):
    source = list(urls.keys())[n] 
    url = list(urls.values())[n] 
        
    try:
        resp = requests.get(url) # To handle any invalid URLs that cannot be requested 
    except:
        print('Invalid URL: {}'.format(url))
        continue
        
    soup = BeautifulSoup(resp.content, features= 'xml')
    items = soup.find_all('item')
    news = []
    for item in items:
        try:
            i = {}
            i['title'] = html.unescape(item.title.text)
            i['description'] = html.unescape(item.description.text)
            i['link'] = item.link.text
            i['pubDate'] = pd.to_datetime(item.pubDate.text).tz_convert(None).normalize()
            i['source'] = source
            news.append(i)
        except:
            skip_count += 1
            continue
        
    df = pd.DataFrame(news, columns = ['title', 'description', 'link', 'pubDate', 'source'])
        
    """
    For each different source, pre-process the incoming news with individual Regex.
    This is because each source formats their RSS feed differently. 
    """
    if source in ['XXX', 'XXA', 'XXB', 'XXC']:
        regex = '[ ]?\\.\\.\\.'
        for i in range(len(df)):
            df.loc[i, 'description'] = re.sub(regex, '', df.loc[i, 'description'])
                
    if create_file != 1:
        update = df.loc[df['link'].isin(link.values) == False]
        if len(update) != 0:
            new_count += len(update)
            update.to_csv('news.csv', mode = 'a', index = False, header = False)
            print('Updated on {} for {} with {} new article(s).'.format(datetime.datetime.now(), source, len(update)))
    else:
        df.to_csv('news.csv', mode = 'w', index = False, sep = ',')
        create_file = 0
        print('File created: {}'.format(datetime.datetime.now()))

In [4]:
print('No. of items skipped: {}'.format(skip_count),'\nNew article(s): {}'.format(new_count), '\nTotal articles: {}'.format(len(read)))

No. of items skipped: 0 
New article(s): 0 
Total articles: 100
