# Spectral Clustering News Stories

The first step is to gather some data.  Our goal here will be to use goose3 to parse a Google News page, and use a small custom function to generate a list of news articles that we want to parse.

The goal of clustering these news articles is to identify topics. Google will likely give us a few potential tags we could use for each article including the overarching list of topics such as:
  * U.S.
  * World
  * Sports
  * Technology
  * Entertainment
  * etc...

In [41]:
import bs4
import lxml #xml parser
import ssl
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
from goose3 import Goose
from tqdm import tqdm
import pandas as pd

def scape_news_links(xml_news_url):
    '''
    Provide a url which will return an XML file to parse. Returns the links to news stories.
    
    Params
    ------
    xml_news_url: string
        A URL which will return 
    '''

    context = ssl._create_unverified_context()
    Client=urlopen(xml_news_url, context=context)
    xml_page=Client.read()
    Client.close()

    soup_page=soup(xml_page,"xml")

    news_list=soup_page.findAll("item")

    return [news.link.text for news in news_list]

def scrape_news_articles(topics=[]):
    if topics == []:
        topics = ['WORLD', 'SPORTS', 'TECHNOLOGY', 'BUSINESS', 'ENTERTAINMENT', 'SCIENCE', 'HEALTH']
    stem = "https://news.google.com/news/rss/headlines/section/topic/"
    g = Goose()
    articles = []
    for topic in topics:
        links = scape_news_links(stem + topic)
        for link in tqdm(links):
            try:
                article = g.extract(url=link)
                articles.append({
                    'title': article.title,
                    'tag': topic,
                    'text': article.cleaned_text
                })
            except:
                pass
    return articles

In [49]:
pull_new = False
if pull_new:
    data = scrape_news_articles()
    df = pd.DataFrame(data)
    df.to_pickle('raw_articles.pickle')
else:
    df = pd.read_pickle('raw_articles.pickle')
df.head()

Unnamed: 0,tag,text,title
0,WORLD,(LONDON) — British Prime Minister Theresa May ...,U.K. Prime Minister Theresa May Promises Depar...
1,WORLD,,"Happy 1st wedding anniversary, Prince Harry an..."
2,WORLD,The United States has reached a deal to lift s...,US reaches deal to lift steel and aluminum tar...
3,WORLD,LONDON/OSLO (Reuters) - Iran’s elite Revolutio...,Exclusive: Insurer says Iran's Guards likely t...
4,WORLD,WARSAW — Anna Misiewicz was just 7 years old w...,‘Tell No One’: Poland Is Pushed to Confront Ab...


Unnamed: 0,tag,text,title
0,WORLD,(LONDON) — British Prime Minister Theresa May ...,U.K. Prime Minister Theresa May Promises Depar...
1,WORLD,,"Happy 1st wedding anniversary, Prince Harry an..."
2,WORLD,The United States has reached a deal to lift s...,US reaches deal to lift steel and aluminum tar...
3,WORLD,LONDON/OSLO (Reuters) - Iran’s elite Revolutio...,Exclusive: Insurer says Iran's Guards likely t...
4,WORLD,WARSAW — Anna Misiewicz was just 7 years old w...,‘Tell No One’: Poland Is Pushed to Confront Ab...
