# AXIOS News Scraper
### You need Google Chrome for running this notebook

#### When you already have Chrome installed:
1. Go to the website: https://sites.google.com/chromium.org/driver/downloads?authuser=0
2. Download a chrome driver that is the same version as your chrome
3. Double click the driver to open it
4. Come back here, start running the notebook


In [116]:
# Necessary libraries
from bs4 import BeautifulSoup
from selenium import webdriver   
from requests import get
import time
import re
import csv
import pandas as pd
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import datetime
import random

# the dictionary for storing links of news
# dictionary can prevent duplicates
links_dict = {}

In [53]:
# This snippet will open a chrome window, direct to the politic news on axios on website
# DON'T CLOSE THE WINDOW!

# Here, I set the start date being October 7th, you can also set it to any date as a start date
# Then, the following cell will web-scrape news prior to this start date.
url = "https://www.axios.com/politics-policy/2021/10/07"
driver = webdriver.Chrome('/Users/wufangzheng/Downloads/chromedriver') #set the webdriver to Chrome driver
driver.get(url)  


## Web-scrape links of news

In [54]:
# Start web-scraping ONLY LINKS of news from the chrome page we opened, 
# then jump to next page, which contains the news of the previous date.

# This cell may fail every once in a while, everytime it fails, just re-run this cell.
for i in range(3000):
    try:
        # see if this page contains any news
        resultList = driver.find_elements_by_class_name("title-link")
        
        # if the page does not contain any news, jump to next page
        if len(resultList)==0:
            print("This page does not contain news, continue")
            print(driver.current_url)
            button = driver.find_element_by_class_name(
                "pagination-area"
            ).find_elements_by_class_name("gtm-content-link")
            button = button[len(button)-1]
            button.click()
            time.sleep(1)
            continue
        
        # if the page contains at least one news, store their links to links_dict
        for result in resultList:
            link = result.get_attribute('href')
            if link not in links_dict:
                links_dict[link] = 0

        # jump to next page, which contains the news of the previous day.
        button = driver.find_element_by_class_name(
            "pagination-area"
        ).find_elements_by_class_name("gtm-content-link")
        button = button[len(button)-1]
        button.click()
        time.sleep(5)
        
    # if any error happens, just wait 8 seconds and try again
    except BaseException as err:
        print("Something is wrong, keep going after 8 secs...")
        time.sleep(8)
    


This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/21
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/20
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/19
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/18
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/17
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/16
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/15
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/14
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/13
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/12
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/12/11

This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/21
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/20
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/19
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/18
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/17
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/16
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/15
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/14
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/13
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/12
This page does not contain news, continue
https://www.axios.com/politics-policy/2016/09/11

KeyboardInterrupt: 

In [129]:
print(len(links_dict))

21357


## Web-scrape article text from those links

In [124]:
links = list(links_dict.keys())

In [134]:
# Open every link from the links, and web-scrape text from it, and store the data to AXIOS_articles.csv

# The data we scrape: title, text (article body), date
# The output data will have the format of: index, date, title, text, label(left/lean left/neutral/lean right/right), link


# This cell may also fail
# Every time it fails, check the number stored in variable "real_count", and change the variable "startpoint" to that number plus one
# E.g.: print(str(real_count)) outputs 355
#       then do: startpoint = 356
#       and re-run this snippet

current_time = time.time()
start_point = 19995
count = 19909
real_count = start_point - 1
for link in links[start_point:]:
    count+=1
    real_count+=1

    # If a link is denied, wait for 5 secs and try again.
    # If the link is denied more than 5 times, skip this link
    deny_count = 0
    while True:
        page = get(link)
        if page.status_code==200:
            break
        else:
            print("Access denied! Reconnect in 5 secds...")
            deny_count+=1
            time.sleep(5)
            if deny_count>4:
                break
    if deny_count>4:
        print("Skip link " + link )
        count-=1
        continue
    
    # Get the html source code of this article
    soup = BeautifulSoup(page.content, 'html.parser')
    content = soup.find('article')
    
    # If there is no content, skip
    if not content:
        print("This link does not have article tag")
        print(link)
        count-=1
        continue
        
    # Get date
    date = content.find('time')
    if date:
        date = date.get('datetime').split("T")[0].replace("-","/")
    else:
        print("This is not a policy article")
        print(link)
        count-=1
        continue
        
    # Get title
    title = content.find('h1').text
    
    # Get text
    content = content.find(class_='gtm-story-text')
    text = ""
    child_count = 0
    for child in content.children:
        child_count+=1
        if child.name!="p" and child.name!='ul' and child.name!='blockquote' and child.name!='cite':
            continue
        if child.name=="ul":
            for li in child.children:
                text+=li.text+"\n"
        else:
            text+=child.text+"\n"
            
    # If the article only contains two paragraphes, print the link out just for information
    if child_count<3:
        print("This article only has " + str(child_count) + " paragraphes")
        print(link)
        
    # If there is no content, skip
    if text=="":
        print("This link does not contain article")
        print(link)
        count-=1
        continue
        
    # Store the data to AXIOS_articles.csv
    with open("AXIOS_articles.csv", "a") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow([str(count), date, title, text, "neutral", link])
    if(count%200==0):
        print(count)
        print("Time Spent: " + str(time.time()-current_time))
        current_time = time.time()

This article only has 2 paragraphes
https://www.axios.com/how-to-watch-the-georgia-6th-special-election-1513303129-aa78c1ad-2a2b-4810-b88d-761f9cf8b9c6.html
This article only has 2 paragraphes
https://www.axios.com/spicer-doesnt-know-if-trump-has-seen-the-senate-health-bill-1513303121-f634909d-c620-43d3-a2b8-46217d157f07.html
This article only has 2 paragraphes
https://www.axios.com/trump-welcomes-ukrainian-president-to-white-house-1513303116-49701c9c-9e44-4793-b967-6f2e8dadc895.html
This article only has 2 paragraphes
https://www.axios.com/fbi-director-nominee-deleted-russia-case-from-his-law-bio-1513303111-0ad7ce06-4f57-47b5-9786-d1ea1e3a000e.html
This article only has 1 paragraphes
https://www.axios.com/tim-cook-to-trump-put-more-heart-in-immigration-debate-1513303104-f5799556-4f78-4c80-aca3-d7b48864a917.html
This link does not have article tag
https://www.axios.com/donald-trumps-2016-financial-disclosures-document-1513303063-0b51a506-6883-4406-a9bc-378846365e30.html
This article on

This article only has 2 paragraphes
https://www.axios.com/trump-advisor-us-has-global-cyberattack-under-control-1513302302-89f97779-9530-401e-89e1-34cc3c9ec534.html
This article only has 2 paragraphes
https://www.axios.com/comey-willing-to-testify-but-only-in-public-report-1513302277-92712fb9-35e9-42f6-8dcc-1af6014bf192.html
This article only has 2 paragraphes
https://www.axios.com/fbi-director-short-list-emerges-1513302275-090beb06-1f51-4093-a774-6c80233dd4cd.html
This article only has 2 paragraphes
https://www.axios.com/deputy-ag-rosenstein-to-brief-full-senate-next-week-1513302265-ccb3368c-5408-4c9d-9439-3e052400deee.html
This article only has 2 paragraphes
https://www.axios.com/white-house-wont-deny-trump-taping-comey-talks-1513302259-97d99474-fd4d-480c-86a6-1b14e5cf2a83.html
This article only has 2 paragraphes
https://www.axios.com/the-not-so-viral-comey-firing-1513302232-1eaf6fba-20c8-4152-895a-bae8894066b5.html
This article only has 2 paragraphes
https://www.axios.com/trump-i-as

Access denied! Reconnect in 5 secds...
Access denied! Reconnect in 5 secds...
Access denied! Reconnect in 5 secds...
Access denied! Reconnect in 5 secds...
Skip link https://www.axios.com/local-state-tax-deductions-on-gary-cohns-list-1513301695-885bece3-4aa0-4484-8061-cfe0ba4d0e64.html
This article only has 2 paragraphes
https://www.axios.com/trump-signs-buy-american-hire-american-order-1513301688-2b26b94e-2c85-40d0-a689-49264a46aa7d.html
This article only has 2 paragraphes
https://www.axios.com/mark-cuban-sees-a-democratic-invasion-of-the-white-house-1513301688-8cdf239f-3302-427a-99e8-6dd3ffedaf12.html
This article only has 2 paragraphes
https://www.axios.com/trumps-new-executive-order-buy-and-hire-american-1513301687-570e3de6-ef0a-4cda-aa86-b698ab165e14.html
This article only has 2 paragraphes
https://www.axios.com/sen-warren-reveals-her-plans-for-political-office-1513301673-ebcf895a-8793-4994-99fa-edab3d1e88b6.html
This article only has 2 paragraphes
https://www.axios.com/all-you-ne

This article only has 2 paragraphes
https://www.axios.com/trump-losing-support-of-republicans-white-voters-men-1513301111-27be4efe-6461-4d1d-9e25-f19dab918c62.html
This article only has 2 paragraphes
https://www.axios.com/the-highlights-from-spicers-wednesday-briefing-1513301107-77cc68b8-7dec-481f-bff6-b10e089f0f3e.html
This article only has 2 paragraphes
https://www.axios.com/manafort-memo-can-greatly-benefit-the-putin-government-1513301097-463027a8-87d4-4703-ad94-f481cddbef72.html
This article only has 2 paragraphes
https://www.axios.com/the-takeaways-from-spicers-tuesday-briefing-1513301082-87dcccfe-8501-46bd-a3fc-ef6cc262faf8.html
This article only has 2 paragraphes
https://www.axios.com/uk-follows-us-bans-electronics-on-some-foreign-flights-1513301079-7971a954-c91a-413f-860d-2209529d6613.html
This article only has 2 paragraphes
https://www.axios.com/the-real-gorsuch-fireworks-are-coming-tomorrow-1513301053-63cf4784-d19a-4fd4-a503-3ca1fe20bc71.html
This article only has 2 paragraph

This article only has 2 paragraphes
https://www.axios.com/trumps-leaks-crackdown-unleashed-a-gusher-1513300736-4c257bcf-ef6e-4a69-a3bb-9b87f890df64.html
This article only has 2 paragraphes
https://www.axios.com/george-w-bush-pokes-fun-at-himself-for-snl-mockery-1513300734-ee951458-b0db-4010-b5d9-7b7ec9195778.html
This article only has 2 paragraphes
https://www.axios.com/just-in-sessions-calls-press-conference-1513388131-07db233a-0d0c-4358-9851-38035ab33e1c.html
This article only has 2 paragraphes
https://www.axios.com/sessions-it-is-false-1513388131-8c6babd8-2c99-4aa3-abf0-104665efa2ec.html
This article only has 2 paragraphes
https://www.axios.com/the-12-things-that-mattered-in-trumps-speech-1513300674-cf587d36-c57d-432b-96a2-b8af977124a5.html
This article only has 2 paragraphes
https://www.axios.com/comparing-trumps-speech-to-his-predecessors-in-one-chart-1513300678-3b70b71d-8486-460e-9426-7e8ba0701859.html
This article only has 2 paragraphes
https://www.axios.com/white-house-floats-a

This article only has 2 paragraphes
https://www.axios.com/trump-fan-peter-thiel-isnt-running-for-governor-spokesman-says-1513300327-361c437f-731d-42d7-a162-afb3733b686f.html
This article only has 2 paragraphes
https://www.axios.com/trump-fans-used-twitter-bots-to-create-a-giant-megaphone-1513300312-0e34a1e2-4f57-4efa-9d8d-7e9c35cfdd5a.html
This article only has 2 paragraphes
https://www.axios.com/the-state-of-trumpland-1513300303-171cc724-20f2-4997-b440-f92ef8fdb6fa.html
This article only has 2 paragraphes
https://www.axios.com/trump-on-putin-you-think-our-countrys-so-innocent-1513300299-100cf36b-f63d-485b-adbf-d477cab82728.html
This article only has 2 paragraphes
https://www.axios.com/trump-ridiculous-halt-on-travel-ban-will-be-overturned-1513300296-52f3eb36-cbd0-41dc-8e71-2fb5d98a08d9.html
This article only has 2 paragraphes
https://www.axios.com/prosecutor-paris-attacker-is-a-egyptian-born-uae-resident-1513300291-c3185269-3855-45ec-b00c-84fa4cd9522e.html
This article only has 2 para

This article only has 2 paragraphes
https://www.axios.com/trumps-3-executive-orders-this-morning-1513300055-cde4e75a-6821-417b-9fcb-6581f3ccda8f.html
This article only has 1 paragraphes
https://www.axios.com/carl-icahn-trump-saved-us-from-socialism-1513300054-3b8abece-eacf-4acb-9a68-385a8703a5f4.html
This article only has 2 paragraphes
https://www.axios.com/philip-roth-on-trump-1513300050-ad3988cd-67d8-44aa-bd35-3f13b609a68c.html
This article only has 2 paragraphes
https://www.axios.com/shinzo-abe-i-trust-trump-1513300046-8cb747f8-fc9b-493b-aa45-af57c8f2fe6e.html
This article only has 2 paragraphes
https://www.axios.com/trumps-bleak-view-1513300028-fd2dc601-e306-4d4f-87c8-79bc555a9b45.html
This article only has 2 paragraphes
https://www.axios.com/the-world-trump-inherits-1513300019-4475989a-93c8-4b82-9944-fb5496fa4c48.html
This article only has 2 paragraphes
https://www.axios.com/introducing-trumps-confirmed-cabinet-mad-dog-and-kelly-1513300024-16ffec4f-1bef-4a7a-87cb-49c96e1b0ed6.html

In [128]:
real_count

19996

In [135]:
count

21260

### Print some random links

In [133]:
links[real_count-2]

'https://www.axios.com/house-dems-pressure-white-house-over-flynn-kushner-clearances-1513303156-1f730f8c-4adc-48f1-afb5-fb3101084df4.html'

In [109]:
links[count-1]

'https://www.axios.com/hard-truths-deep-dive-environment-urban-heat-islands-8d13e15a-0d18-4bf5-a1ff-b4208c9d8374.html'

### Experimental code

In [111]:
for i in range(1):
    count+=1
    #     link = "https://www.axios.com/sarah-sanders-trump-cohen-manafort-daniels-payment-d85ebe25-8670-4102-874e-e9d3f7524021.html"
    #     print(link)
    link = "https://www.axios.com/biden-cabinet-infrastructure-tour-52b2067a-f152-463e-9cd4-c36050f14c7c.html"
    deny_count = 0
    while True:
        page = get(link)
        if page.status_code==200:
            break
        else:
            print("Access denied! Reconnect in 5 secds...")
            deny_count+=1
            time.sleep(5)
            if deny_count>4:
                break
    if deny_count>4:
        print("Skip link " + link )
        count-=1
        continue
    soup = BeautifulSoup(page.content, 'html.parser')
    content = soup.find('article')
    time = content.find('time')
    if time:
        time = time.get('datetime').split("T")[0].replace("-","/")
    else:
        time = content.find('span', class_="time-rubric").text.split(" - ")[0]
    title = content.find('h1').text
    if not title:
        title = soup.title
        print("cannot find title")
    content = content.find_all(class_='gtm-story-text')
    content = content[-1]

    text = ""
    child_count = 0
    for child in content.children:
        child_count+=1
        if child.name!="p" and child.name!='ul' and child.name!='blockquote' and child.name!='cite':
            continue
        if child.name=="ul":
            for li in child.children:
                text+=li.text+"\n"
        else:
            text+=child.text+"\n"
    if text=="":
        print("This link does not contain article")
        print(link)
        count-=1
        continue
    with open("AXIOS_articles.csv", "a") as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow([str(count), time, title, text, "neutral", link])
    if(count%200==0):
        print(count)
    print(title)
    print(text)
    print(time)
    print(child_count)

Biden's road warriors
President Biden's Cabinet and senior staff are fanning out to make his case that human infrastructure — as well as hard infrastructure — are needed to grow the economy for the middle class: 
Vice President Harris travels to New Jersey on Friday. 
Energy Secretary Jennifer Granholm talks to Marie Claire and will hold an Instagram Live conversation with young Latino leaders.
Education Secretary Miguel Cardona travels to the Rio Grande Valley.
HUD Secretary Marcia Fudge tours a revitalized community in Michigan. 
Transportation Secretary Pete Buttigieg makes virtual remarks in Chicago. 

2021/10/06
2


In [120]:
with open("AXIOS_articles.csv", "r") as csvfile:
    reader = csv.reader(csvfile)
    item_count = 0
    for item in reader:
        item_count+=1
    print(item_count)

15435
