# Selenium:

## Learning Goals:

- Able to install and setup Selenium
- Able to login to website platform
- Able to navigate through pages/pop-ups
- Able to write scraper that is more "human-like"
- Able to know when to use appropriate `find_element(s)_by...`
- Able to acquire desired data

---

## First, head over to [this page](https://chromedriver.chromium.org/downloads) and locate the chromedriver that matches your chrome version.

**How to Find Your Internet Browser Version Number - Google Chrome.**

1) Click on the Menu icon in the upper right corner of the screen. 

2) Click on Help, and then About Google Chrome. 

3) Your Chrome browser version number can be found here.

## Next, download the appropriate driver that matches your version of Chrome

- After you have downloaded the driver, press `command` + `spacebar`
- Inside of the spotlight search you just opened, type `/usr/local/bin/` and open that folder
- Next, in a separate finder window (`command` + `n`), navigate to where you downloaded the `chromedriver`
- Finally, move the `chromedriver` from where ever you downloaded it into your `/usr/local/bin/`

*Technically, you can install the driver anywhere, but most tutorials I have read say to put it in `/usr/local/bin/`*

...However, after a bit of research, I believe the reason we want to install the `chromedriver` inside of `/usr/local/bin/` is so that you don't have to explicitly state the chromedriver path when you instantiate your driver 😎 

https://www.kenst.com/2015/03/including-the-chromedriver-location-in-macos-system-path/

## Install Selenium if you have not already done so:

In [1]:
# !pip install selenium

# Please complete the above steps before lecture

In [6]:
import re
import os
import time
import random
import requests
import numpy as np
import pandas as pd
from os import system   
from math import floor
from copy import deepcopy
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [7]:
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 200)

---

**Next, I always like to label my driver with a bold title cell**


- I find it helps when we need to re-instantiate our driver and for general organization
- Also, when we copy and paste this code form a notebook to a .py file, we would usually only need one driver

## DRIVER HERE:

In [4]:
# driver = webdriver.Chrome()

import time
from selenium import webdriver

driver = webdriver.Chrome('/Users/meaganrossi/Projects/Scraping_Tools/chromedriver')  # Optional argument, if not specified will search path.
driver.get('http://www.google.com/');
time.sleep(5) # Let the user actually see something!
search_box = driver.find_element_by_name('q')
search_box.send_keys('ChromeDriver')
search_box.submit()
time.sleep(5) # Let the user actually see something!
driver.quit()

---

### Note: Headless Browsers

**Headless Browser**
A Headless Browser is also a Web Browser but without a graphical user interface (GUI) but can be controlled programmatically which can be extensively used for automation, testing, and other purposes.

**Why to use Headless Browsers?**
There are a lot of advantages and disadvantages in using the Headless Browsers. Using a headless browser might not be very helpful for browsing the Web, but for Automating tasks and tests it’s awesome.

**Advantages of Headless Browsers**

Some of the advantages are as follows:

- Headless Browsers are typically faster than real browsers. 
    - The reason for being faster is because we are not starting up a Browser GUI and can bypass all the time a real browser takes to load CSS, JavaScript and open and render HTML DOM.
- Performance wise, you can typically see a 2x to 15x faster performance when using a headless browser.

*More info on headless browsers here:* https://stackoverflow.com/questions/53083952/difference-of-headless-browsers-for-automation

## Time to scrape!

<img src = "https://media1.tenor.com/images/3fd84ba4b54f8d299f7732e63cdb3c00/tenor.gif?itemid=11903546" />

### Visiting a webpage

In [None]:
# Visit the website of your choice:

driver.get('https://www.espn.com')

#### Methods for finding a single element 

    This will return the FIRST instance of your desired "element"

* find_element_by_id
* find_element_by_name
* find_element_by_xpath  
* find_element_by_link_text
* find_element_by_partial_link_text
* find_element_by_tag_name
* find_element_by_class_name
* find_element_by_css_selector

---

#### Methods for finding multiple elements

    This will return a list of ALL instances of your desired "element"

* find_elements_by_name
* find_elements_by_xpath
* find_elements_by_link_text
* find_elements_by_partial_link_text
* find_elements_by_tag_name
* find_elements_by_class_name
* find_elements_by_css_selector

From the [Selenium Python Docs](https://selenium-python.readthedocs.io/locating-elements.html "Selenium Docs") 

### Selecting the FIRST instance of an "element"

First, well check out `.find_element_by_css_selector()`

In [None]:
driver.find_element_by_css_selector('h1')

In [None]:
driver.find_element_by_tag_name('h1').text

### Selecting ALL instances of your desired "element"

In [3]:
listy = driver.find_elements_by_css_selector('h1')

NameError: name 'driver' is not defined

In [None]:
for x in listy[:15]:
    if len(x.text) > 0:
        print(x.text)

### Selecting a specific element (by class name)

Using `.find_element_by_class_name()` to locate an element:

In [None]:
driver.find_element_by_class_name('contentItem__title--hero').text

---

#### Closing the driver:

If you were to just close your driver's browsing window, your Google chrome instance will still appear open in your mac's dock. Using `driver.quit()`, we can close the Google chrome instance, which will also close the driver's browser:

In [None]:
driver.quit()

### Logging into websites

We'll use `.find_element_by_id()` for this example:

In [None]:
from private import *

In [None]:
my_url = 'https://www.facebook.com'

In [None]:
driver = webdriver.Chrome()
driver.get(my_url)

In [None]:
username = driver.find_element_by_id("email")
password = driver.find_element_by_id("pass")
submit   = driver.find_element_by_id("loginbutton")
  
username.send_keys(FB_USERNAME)
password.send_keys(PASSWORD)

In [None]:
submit.click()

#### Timing

Sometimes we will need to wait for the page to load. Other times, we may want to have our scraper act more like a human, in terms of "click rate."

Two possible ways to make this happen are by using `time.sleep()` or `WebDriverWait()`

If we just want to mimic the behavior of a human, we can use `time.sleep()`:

In [None]:
# Using a single "wait" time:

time.sleep(2)

In [None]:
# Using a randomized time:

sequence = [x/10 for x in range(8, 14)]
print(sequence)

time.sleep(random.choice(sequence))

If we explicitly want to wait for our page to load, we can use `WebDriverWait()`:

In [None]:
wait = WebDriverWait(driver, 5)

try:
    page_loaded = wait.until(lambda driver: driver.current_url == my_url)
    print('The page loaded correctly')
except TimeoutException:
    print("Loading timeout expired")

In [None]:
driver.current_url

In [None]:
driver.quit()

### Ohhhh nooooooo, I can't remember how I named my variables...

And I don't want to open the file elsewhere to check, because that seems inefficient...

We can do something like this:

In [None]:
print(list(locals().keys()))

The previous output is a bit messy... 

If we are writing a .py file specifically to store "private" variables, I recommend using an all caps syntax. The two reasons I like this are:

    1) This mimics the syntax of ENVIRONMENT_VARIABLES

    2) If we name our private.py file variables with all caps, we can see all our private variable names like this:

In [None]:
for key in list(globals().keys()):
    if key[-1] == key[-1].title() and key[-1].isalpha() == True:
        print(key)

**NOTE:** The variable "`EC`" is present in the list above because of how we imported the `expected_conditions` module up at the top

In [None]:
driver.quit()

In [None]:
driver = webdriver.Chrome()
driver.get('https://www.instagram.com/')

time.sleep(3)

In [None]:
#### The old IG landing page used this, but they have recently updated their landing page ####
#### Just leaving for notes ####

# Find the login click button
# ig_login_button = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/article/div[2]/div[2]/p/a')

# Click the button
# ig_login_button.click()

# time.sleep(3)

#### Wait a second... what is that `xpath` thing?

XPath is defined as XML path. It is a syntax or language for finding any element on the web page using XML path expression. XPath is used to find the location of any element on a webpage using HTML DOM structure. The basic format of XPath is explained below with screen shot.

<img src='https://www.guru99.com/images/3-2016/032816_0758_XPathinSele1.png' >

XPath contains the path of the element situated at the web page. Standard syntax for creating XPath is:

`Xpath=//tagname[@attribute='value']`

- // == Select current node.
- Tagname == Tagname of the particular node.
- @ == Select attribute.
- Attribute == Attribute name of the node.
- Value == Value of the attribute.

<img src='https://media1.giphy.com/media/XBpEStoQ5rftPFA8rh/giphy.gif?cid=790b7611dbcd651cd785fb8382888f7b41666d5c8695755b&rid=giphy.gif'>

**We can perform the next operations a few different ways:**

Similar to above, we could use the `xpath`

Or... based on visual knowledge of inspecting html/css elements, we can see the css selector `input` and we could assume that the only 2 possible inputs are Username and Password

---

With that knowledge, we can define both variables in one line of code


In [None]:
ig_username, ig_password = driver.find_elements_by_css_selector('input')
# driver.find_elements_by_css_selector('input')

# ig_username = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[2]/div/label/input')
# ig_password = driver.find_element_by_xpath('//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[3]/div/label/input')

In [None]:
ig_username.send_keys(INSTA_USERNAME)
ig_password.send_keys(PASSWORD)

In [None]:
# Here is the complete xpath to the login element:

login_button_xpath = '//*[@id="react-root"]/section/main/article/div[2]/div[1]/div/form/div[4]/button'

ig_submit = driver.find_element_by_xpath(login_button_xpath)

In [None]:
ig_submit.click()

In [None]:
## Sometimes, depending on the HTML layout, we might want to truncate the xpath:

# ig_submit = driver.find_element_by_xpath('//div[4]/button/div')

# Modal buttons and scrolling:

In [None]:
# Whoah! What's that modal? 
try:
    modal_button = driver.find_element_by_class_name("HoLwm")
    modal_button.click()
    
except: 
    pass 

In [None]:
# These websites have modal popups:

driver.get('https://www.nike.com')

# Other options:
# https://www.carbon38.com
# https://www.meundies.com

The following cell is an example of how you can write functions to scroll down the page (for dynamic loading) and for loading more content with "clicks"

In [None]:
# Example: Scroll down (with a test for a modal)

def scroll_down():
    for i in range(1, 10):
        try:
            modal_button = driver.find_element_by_class_name("button2")
            webdriver.ActionChains(driver).move_to_element(modal_button).click(modal_button).perform()
      ##### modal_button.click() also works 
            
        except:
            time.sleep(.5)
            pass 
        
        #scroll to the bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)

        
# Example: Load more content
# Code snippet for context purposes only. We will not run this function:

def get_more(): 
    for i in range(1, 5):
        try:
            next_b = driver.find_element_by_xpath("//*[contains(text(), 'Load next Politics story')]")
            webdriver.ActionChains(driver).move_to_element(next_b).click(next_b).perform()
            time.sleep(.5)
        except: 
            print("Page #" + str(i) + " has failed to load") 

In [None]:
# Run this cell and watch the page scrollllllll

scroll_down()

In [None]:
driver.quit()

## When to use BeautifulSoup vs.  Selenium?

<img src='https://media.giphy.com/media/xTiN0IuPQxRqzxodZm/giphy.gif' width = 400>

<img src='https://media2.giphy.com/media/3o7TKAdOad9Y3eSMZG/giphy.gif?cid=790b761168b43f2be748800602251dce3cad91fcb4c972f9&rid=giphy.gif' width = 400>

<img src = "https://media1.giphy.com/media/8VLgtJqaxIlhu/giphy.gif?cid=790b7611df175494e219b99894f7e717b3ea7bfbf806f9c4&rid=giphy.gif" />

**Just kidding!**

Everything depends on the website and your data goals.

In general:
- If the data needs to be exposed interactively, then go for Selenium. 
- Selenium for more complex JavaScript heavy pages. 
---
- If the data is accessible in the HTML structure (more static pages), soup is a more lightweight tool. 
- Soup gives you more control about navigating the HTML tree.

In [None]:
html = requests.get('https://www.skysports.com/premier-league-table')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

# table = bs.find(lambda tag: tag.name=='table' ) 
# rows = table.findAll(lambda tag: tag.name=='tr')

In [None]:
table_rows = table.find_all('tr')

for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)

In [None]:
html = requests.get('https://www.skysports.com/premier-league-table')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

#If you know there is more than one table, you can edit the code to include the proper index:
# table = bs.find_all('table')[0] 

df = pd.read_html(str(table), index_col='Team')
df = df[0].dropna(axis=0, thresh=4)
df

#### Adjusting the header and index:

- Caveat: this uses pandas, not Selenium or Soup

If there is more than one table, pandas reads the html as a list of tables:

In [None]:
df2 = pd.read_html('https://www.sportsmole.co.uk/football/premier-league/2018-19/')

df2

In [None]:
# Let's check out one of our tables:

df2[0]

As we can see above, the table's formatting is slightly off...

So we can make adjustments like so:

In [None]:
df2 = pd.read_html('https://www.sportsmole.co.uk/football/premier-league/2018-19/',header=0, index_col=1)

df2[0].columns =  ['final_standings', 'P', 'W', 'D', 'L', 'F', 'A', 'GD', 'PTS']

df2[0]

---

### An example where formatting is an issue:

In [None]:
html = requests.get('http://www.nfl.com/stats/team')
nfl_soup = BeautifulSoup(html.content, 'lxml')
table = nfl_soup.table

In [None]:
table.prettify

In [None]:
nfl = pd.read_html('http://www.nfl.com/stats/team')

nfl

In [None]:
# PRO-TIP: if you want to instantiate a new df variable from a previous df or list of dfs, 
# making a copy of the df will save you from a headache

offense = deepcopy(nfl[0])
offense

In [None]:
offense['Total Offense (YPG)'] = offense['Total Offense (YPG).1']
offense.drop(columns=['Total Offense (YPG).1'], inplace=True)
offense.columns = ['TEAM', 'Total_Offense_YPG']
offense.TEAM = offense.TEAM.str[3:]

In [None]:
offense

In [None]:
## Need to refactor this code because the website changed: 

def clean_data(data_list):
    pass
#     cleaned = []
#     data_copy = deepcopy(data_list)
#     for data in data_copy:
#         col_1, col_3 = data.columns[0], data.columns[-1]
#         cell1 = data.iloc[0,1]
#         cell2 = data.iloc[0,2]
#         data.iloc[0,0] = cell1
#         data.iloc[0,1] = cell2
#         data.drop([col_3],axis=1,inplace=True)
#         data.columns = ['team', col_1]
#         data['team'] = data['team'].apply(lambda x: x.split('.\xa0')[1] if '.\xa0' in x else x.split('. ')[1])
#         data.total_offense_ypg = data[col_1].astype(float).astype(int)
#         cleaned.append(data)
#     return cleaned

Don't run these cells until `clean_data()` has been refactored:

In [None]:
tables_list = deepcopy(nfl[:6])

In [None]:
tables_list[2]

In [None]:
clean_tables = clean_data(tables_list)

In [None]:
clean_tables[0]

In [None]:
clean_tables[3]

### The best example of when Selenium is supreme:

When the page is written in JavaScript

In [None]:
html = requests.get('http://www.tennisabstract.com/cgi-bin/player.cgi?p=RogerFederer')
bs = BeautifulSoup(html.content, 'lxml')
table = bs.table

# table = bs.find(lambda tag: tag.name=='table' ) 
# rows = table.findAll(lambda tag: tag.name=='tr')

In [None]:
# bs

In [None]:
# table

In [1]:
url = "http://www.tennisabstract.com/cgi-bin/player.cgi?p=RogerFederer"
driver = webdriver.Chrome('/Users/meaganrossi/Projects/Incarceration_COVID/chromedriver')

NameError: name 'webdriver' is not defined

In [None]:
driver.get(url)

In [None]:
table = driver.find_element_by_id("recent-results")

In [None]:
body = table.find_element_by_css_selector('tbody')

In [None]:
# Table rows usually have the css tag 'tr'
rows = body.find_elements_by_css_selector('tr')

In [None]:
len(rows)

In [None]:
rows[0].get_attribute('innerHTML')

In [None]:
row_data = rows[0].find_elements_by_css_selector('td')

In [None]:
for e in row_data: 
    print(e.text)

In [None]:
data_list = []
for r in rows: 
    row_list = []
    row_data = r.find_elements_by_css_selector('td')
    for d in row_data: 
        row_list.append(d.text)
    data_list.append(row_list)

In [None]:
data_list[10]

In [None]:
len(data_list[10])

In [None]:
headers = table.find_element_by_css_selector('thead')
headers.text

In [None]:
columns = headers.text.split(' ')
print(columns)

In [None]:
print('Number of columns:     '+ str(len(columns)))
print()
print('Number of data points: '+ str(len(data_list[0])))

In [None]:
columns = ['Date','Tournament','Surface','Rd','Rk','vRk', 
           'Opponent','Score','DR','A%','DF%','1stIn',
           '1st%','2nd%','BPSvd','Time']

In [None]:
print(data_list[0])

In [None]:
federer_h2h = pd.DataFrame(data_list[1:], columns=columns)

In [None]:
federer_h2h.head()

- A slightly different approach:

In [None]:
header = table.find_element_by_css_selector('thead')

header_elements = header.find_elements_by_css_selector('th')

len(header_elements)

In [None]:
headers = []

for x in header_elements: 
    headers.append(x.text)
print(headers)

## Some other neat stuff:

In [None]:
# Let's take a screenshot! 

driver.get('https://www.nytimes.com')

driver.get_screenshot_as_file('ny_times_front_pg.png')

driver.quit()

In [None]:
# The .get_attribute() method is your friend
# Example code (don't run this):

element.get_attribute("attribute name")

attribute_value = WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.ID,
                                                                "id_name_here"))).get_attribute("attribute_name_here")

# Example of "complete" scraper:

- Including an example of `.get_attribute()`

In [None]:
topic_url_dict = {'15': 'arts', '16':'sports', '24':'sci-tech', '14': 'business',
                  '17': 'international', '13': 'authority'}

topic_codes = list(topic_url_dict.keys())

driver = webdriver.Chrome()

In [None]:
# Before creating a scraper, examine user interface structure:
# Can we scroll through articles? Do we need to get the article links first?

driver.get('http://www.satirewire.com/')

In [None]:
# <-- STEP 1 -->
# Create a function to scrape the links of articles:

def scrape_links_satirewire(topic_codes):
    
    base_url = "http://www.satirewire.com/content1/?cat="
    link_list = []
    
    for code in topic_codes: 
        url = base_url + code 
        topic = topic_url_dict[code]
        driver.get(url)
        time.sleep(1.18)
        last_page = driver.find_element_by_class_name('pages').text
        last_page_value = int(last_page.split(' of ', 1)[1])
        link_objects1 = driver.find_elements_by_class_name('morelink')
        print('Scraping ', len(link_objects1), topic.upper(), ' article links')
        for link in link_objects1:
            if '#' not in link.get_attribute('href'):
                link_list.append((link.get_attribute('href'), topic))
            else:
                pass       
        for x in range(2, (last_page_value + 1)):
            driver.get(url + '&paged=' + str(x))
            time.sleep(1.08)
            link_objects1 = driver.find_elements_by_class_name('morelink')
            print('Scraping ', len(link_objects1), topic.upper(), ' article links')

            for link in link_objects1:
                if '#' not in link.get_attribute('href'):
                    link_list.append((link.get_attribute('href'), topic))
                else:
                    pass                        
    df = pd.DataFrame()
    df['urls'] = [x[0] for x in link_list]
    df['topics'] = [x[1] for x in link_list]
    df = df.drop_duplicates(subset='urls')
    set_satirewire_urls = [(df['urls'][i], df['topics'][i]) for i in list(df.index)]
    
    print('-----------------------------------')
    print('Total of', len(set_satirewire_urls), 'urls to scrape for articles')
    print('-----------------------------------')
    return set_satirewire_urls                  

In [None]:
# <-- STEP 2 -->
# Create a helper function to clean up the article's text:

def clean_up_satirewire(dirty_string):
    
    body_clean1 = re.sub(r"\s+", " ", dirty_string)
    body_squeaky = body_clean1.split('Copyright ©', 1)[0]
    sep1 = '(SatireWire) — '
    sep2 = '(SatireWire.com) — '
    sep3 = '(SatireWire.com) – '
    if sep1 in body_squeaky:
        clean = body_squeaky.split(sep1, 1)[1]
    elif sep2 in body_squeaky:
        clean = body_squeaky.split(sep2, 1)[1]
    elif sep3 in body_squeaky:
        clean = body_squeaky.split(sep3, 1)[1]
    else:
        clean = body_squeaky
    return clean 

In [None]:
print('Are these to "dash" characters equal to each other? Answer: ' + str(bool('—'=='–')))

In [None]:
# <-- STEP 3 -->
# Create a helper function to scrape individual article content:

def scrape_one_article(url, topic, ind, all_urls, all_dates, all_titles, 
                       all_lengths, all_topics1, body_contents, source_id):
    try:
        driver.get(url)
        time.sleep(1.1)
        body = driver.find_element_by_class_name('entry').text
        length = round(len(body) /5/ 250, 1)
        
        if length >= .5:
            if url not in all_urls:                
                date = driver.find_element_by_class_name('entry-date').text
                date = pd.to_datetime(date).date().strftime('%Y-%m-%d')
                title = driver.find_element_by_tag_name('h2').text
                body_squeaky = clean_up_satirewire(body)
                
                all_urls.append(url)
                body_contents.append(body_squeaky)
                all_dates.append(date)
                all_titles.append(title)
                all_lengths.append(length)
                all_topics1.append(topic)

#           --- ADDING CATEGORIES AT A LATER TIME FOR TOPIC MODELING ---
#                 category_dict = find_categories(content, categories)
#                 all_topics1.append(category_dict[0])
#                 all_topics2.append(category_dict[1])
#                 all_topics3.append(category_dict[2])
#                 all_topics4.append(category_dict[3])
#                 all_topics5.append(category_dict[4])

            else:
                print("Duplicate link not added", ind)
                pass
        else:
            print('Not worthy of scraping article #', ind)
            pass
    except Exception as e:
        print('Nothing to scrape for link #', str(ind) , e)
        pass

In [None]:
# <-- STEP 4 -->
# Scrape each article's content and populate a dataframe

def scrape_satirewire_articles(urls_list):
    ind = 1
    body_contents = []
    all_urls = []
    all_dates = []
    all_titles = []
    all_lengths = []
    all_topics1 = []
    author = 'Author not specified'
    source_id = 'SatireWire'
    
#     all_topics2 = []
#     all_topics3 = []
#     all_topics4 = []
#     all_topics5 = []
    
    for url, topic in urls_list:
        print('Working on #' + str(ind) + ' of '+ str(len(urls_list)) +' links')
        print()
        scrape_one_article(url, topic, ind, all_urls, all_dates, all_titles, 
                           all_lengths, all_topics1, body_contents, source_id)
        ind += 1    

    df = pd.DataFrame()
    df['body_content'] = body_contents
    df['url'] = all_urls
    df['date'] = all_dates
    df['title'] = all_titles
    df['length'] = all_lengths
    df['topic_1'] = all_topics1
    df['author'] = author
    df['source_id'] = source_id
    df['satire_or_not'] = 'satire'
    df['label'] = 1

# ADDING CATEGORIES AT A LATER TIME FOR TOPIC MODELING    
#     df['topic_1'] = all_topics1
#     df['topic_2'] = all_topics2
#     df['topic_3'] = all_topics3
#     df['topic_4'] = all_topics4
#     df['topic_5'] = all_topics5

    df = df.drop_duplicates()
    df.index = range(len(df.index))

    return df

In [None]:
# # Complete scraping function

def scrape_satirewire():
    
    start = time.time()
    
    satirewire_urls = scrape_links_satirewire(['16'])    # <--- FOR TESTING/DEMONSTRATION PURPOSES
#     satirewore_urls = scrape_links_satirewire(topic_codes)

    satirewire_df = scrape_satirewire_articles(satirewire_urls)
    print('The satire scraper took ', str(time.time() - start), 'seconds.')  # <---   Can remove before uploading to AWS

    return satirewire_df

In [None]:
satire = scrape_satirewire()

In [None]:
satire.tail()

In [None]:
satire.body_content[30]