# Better Webscraping in Python

In this file, we're going to use:
- AsyncIO: https://docs.python.org/3/library/asyncio.html
- Requests-HTML: https://pypi.org/project/requests-html/
- Pandas: https://pandas.pydata.org/
- BeautifulSoup4: https://pypi.org/project/beautifulsoup4/

The idea here is to use:
- Requests-html to get the website (and render JS if we need to)
- AsyncIO to do that asyncronously (and FAST!)
- BeautifulSoup to process the CSS selectors and stuff
- Pandas when we have a table to turn into a dataframe

The part that screws a lot of people up is CSS selectors. Right click on a website, hit inspect, and take a look. HTML is intimidating at first, but it's not difficult to read. Go here to learn how to use CSS selectors at: https://flukeout.github.io/

I'm not going to discuss xpaths here, they tend to be a lot of trouble, and not that much better than CSS selectors once you know what you're doing.

## Let's scrape!

For this example, I'm going to scrape IMDB movie review text.

In [1]:
#Import our libraries
import asyncio
from bs4 import BeautifulSoup
from requests_html import AsyncHTMLSession
import spacy


@asyncio.coroutine
async def getReviewText(linkStr):
    """
    This function returns the title and text of an imdb review, given the permalink.
    """
    # Create the async session. Think of this as your browser, it reaches out to get a URL.
    session = AsyncHTMLSession()
    # Get the page from the link! Note the await keyword. This tells python that we are waiting on a response.
    page = await session.get(linkStr)
    # Close the session.
    await session.close()
    # Status code 200 means it was successful. 
    if page.status_code == 200:
        #Make the beautifulsoup object from the html, see the docs if you haven't done this before
        soup = BeautifulSoup(page.html.raw_html)
        #Then, use BS4's find function to return just the class we want. And, just the text!
        return(soup.find("div", {"class": "text show-more__control"}).text)


async def getAllReviews(linksList):
    """
    This driver function creates a bunch of asycio jobs with the function above.
    This lets us get a hundred or more reviews at once, pretty darn quick.
    """
    #This makes the async objects
    jobs = list(map(getReviewText, linksList))
    #Jobs is unpacked into gather, which we then await. This allows them all to work in parallel!
    results = await asyncio.gather(*jobs)
    #Keep the result if it's not none. Remember that we only return a result if the code is 200.
    results = [result for result in results if result]
    return(results)

#Let's try it out!
urlbase = "https://www.imdb.com/review/rw1000"
links = [f"{urlbase}{i:03d}" for i in range(50)]

results = await getAllReviews(links)
print(results[0:5])

["When Ellen went to Spitsbergen, she went for adventure. She also expected Lars, her host, to be a rough husky adventurer, but was deceived as Lars turned out to be a silent clumsy tinkerer.At the beginning this movie didn't seem very special to me. But along the way the psychology of Lars and Ellen was getting more and more interesting.Then when the end came the movie took my breath away. Probably because this movie happened for real as well as how realistic it was filmed I could completely imagine myself in Lars' place (contrary to 'Hollywood hits' which are often too cinematic for such empathy).Finally, the movie also shows how it's like to live in Spitsbergen: rough, but not adventurous.", "Joel Schumacher who did an OK job on the third Batman Movie, has simply lost the plot on this one. Poorly cast with the exception of Robin and Ivy, this is far more like the 60's TV series than the comic books, and it just does not work.The Baddies are not menacing, Schwarzenegger rarely perfor

We can see above, the result is a list of the text like we expected!

So, what about tables? Let's use pandas to pull the Passing, Rushing, & Receiving stats table from here: https://www.pro-football-reference.com/boxscores/201909050chi.htm

Inspect the page and see what we're dealing with. We have some javascript to render, the table is the 'table' class, and the id is 'player_offense', which makes our job pretty easy!

**Pay close attention to the .render() call, that is going to be useful often!**

In [2]:
import pandas as pd

@asyncio.coroutine
async def getTable(linkStr):
    """
    This function gets a table based on the html class, and turns it into a Pandas dataframe.
    """
    # Create the async session. Think of this as your browser, it reaches out to get a URL.
    session = AsyncHTMLSession()
    # Get the page from the link! Note the await keyword. This tells python that we are waiting on a response.
    page = await session.get(linkStr)
    # Render the page (This handles the javascript rendering)
    await page.html.arender()
    # Close the session.
    await session.close()
    # Status code 200 means it was successful. 
    if page.status_code == 200:
        #Make the beautifulsoup object from the html, see the docs if you haven't done this before
        soup = BeautifulSoup(page.html.raw_html)
        #Then, use BS4's find function to return just the class we want.
        table = str(soup.find('table', id="player_offense"))
        #And, use pandas to parse it into a table. Super easy, right?
        return pd.read_html(table)


In [3]:
url = 'https://www.pro-football-reference.com/boxscores/201909050chi.htm'
print(await asyncio.gather(getTable(url)))

[[          Unnamed: 0_level_0 Unnamed: 1_level_0  Passing                    \
                      Player                 Tm      Cmp      Att      Yds   
0              Aaron Rodgers                GNB       18       30      203   
1                Aaron Jones                GNB        0        0        0   
2   Marquez Valdes-Scantling                GNB        0        0        0   
3              Davante Adams                GNB        0        0        0   
4               Jimmy Graham                GNB        0        0        0   
5               Trevor Davis                GNB        0        0        0   
6              Robert Tonyan                GNB        0        0        0   
7            Jamaal Williams                GNB        0        0        0   
8             Marcedes Lewis                GNB        0        0        0   
9                        NaN                NaN  Passing  Passing  Passing   
10                    Player                 Tm      Cmp      

As we can see (Sorry it's kinda ugly), the output is a plain old pandas dataframe. From there, it's time for data cleaning! But I'll leave that for you all to figure out.