### Webscraping class walk-through, to scrape information from Rock and Roll Marathons
#### This notebook is to scrape the data from ONE PAGE of the Marathons website, to learn Requests, and BeautifulSoup
#### For my personal work, two other notebooks will be created:
- marathons_webscraping, to scrape the data and create .csv files
- marathons_EDA (Exploratory Data Analysis), to analyze the data

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup as BS

- It's a good idea to view the page source first, of the web page(s) you intend to scrape, to see how it's structured
- The devtools may also be helpful
    - In Chrome right-click and choose `inspect` or just use `F12` to bring up the devtools

In [2]:
#Number of pages of results for each race. 
#From Readme file for class assignment (instructors got this info by using Postman app)
#Need this info to build the loop and function later (in personal work) when build 
#function / for loop.

pgs_2016 = 154   
pgs_2017 = 147   
pgs_2018 = 85   
pgs_2019 = 113   
pgs_half_2016 = 898   
pgs_half_2017 = 892   
pgs_half_2018 = 598   
pgs_half_2019 = 690   

In [3]:
urlbase_2019 = 'https://www.runrocknroll.com/Events/Nashville/The-Races/Marathon/2019-Results?gender=&agegroup=&bib=&firstname=&lastname=&page='
urlbase_2018 = 'https://www.runrocknroll.com/Events/Nashville/The-Races/Marathon/2018-Results?gender=&agegroup=&bib=&firstname=&lastname=&page='
urlbase_2017 = 'https://www.runrocknroll.com/Events/Nashville/The-Races/Marathon/2017-Results?gender=&agegroup=&bib=&firstname=&lastname=&page='
urlbase_2016 = 'https://www.runrocknroll.com/Events/Nashville/The-Races/Marathon/2016-Results?gender=&agegroup=&bib=&firstname=&lastname=&page='
urlbase_half_2019 = 'https://www.runrocknroll.com/Events/Nashville/The-Races/Half-Marathon/2019-Results?gender=&agegroup=&bib=&firstname=&lastname=&page='
urlbase_half_2018 = 'https://www.runrocknroll.com/Events/Nashville/The-Races/Half-Marathon/2018-Results?gender=&agegroup=&bib=&firstname=&lastname=&page='
urlbase_half_2017 = 'https://www.runrocknroll.com/Events/Nashville/The-Races/Half-Marathon/2017-Results?gender=&agegroup=&bib=&firstname=&lastname=&page='
urlbase_half_2016 = 'https://www.runrocknroll.com/Events/Nashville/The-Races/Half-Marathon/2016-Results?gender=&agegroup=&bib=&firstname=&lastname=&page='

### The steps to get a DataFrame from one page of results look like this:
1. Build a URL by combining the base url with a specific page number
2. Use requests.post() to get the results of the post
3. Make a soup from results.text
4. Look at the soup to identify the table you want based on one of its attributes (like class) 
5. Pass the table as a string to pandas read_html() 
6. What does that look like? What is the datatype?
7. Keep working with the data until you have it a DataFrame

In [18]:
#STEP 1: Build a URL by combining the base url with a specific page number
'''QUESTION: Why is this needed? 
    Could you do base = 'https://www.runrocknroll.com/Events/...[full web address]?
    Why is url = base + str(99) needed?
    '''

base = urlbase_2019
page = 99             #To pull just one page to take a look at 
url = base + str(99)  #This appends page #99 to the end of the URL. 
print(url)

https://www.runrocknroll.com/Events/Nashville/The-Races/Marathon/2019-Results?gender=&agegroup=&bib=&firstname=&lastname=&page=99


### Make a request using the `requests` [library](https://requests.readthedocs.io/en/master/user/quickstart/)
- `request.get()` uses http GET to get a webpage
- `request.post()` uses http POST when the webpage is submitting a form
- checking the [`status_code`](https://www.restapitutorial.com/httpstatuscodes.html) on the result let's you know your request was successful

In [19]:
#STEP 2: Use requests.post() to get the results of the post

'''QUESTION: I don't understand what "requests.post" is. What's happening here?
What does it mean in the 2nd bullet above "when the webpage is submitting a form"?
Doest that just mean the webpage has a form on it that people are filling out?

Why is response.TEXT needed in line 11 below?'''

response = requests.post(url)   #url defined above, it's page 99 (only) of 2019 full marathon.
print(type(response))           #To see what type "response" is. It's requests.
soup = BS(response.text, 'lxml')   #parses response (page 99 of 2019 full marathon) into a soup
print(type(soup))               #To see what type "soup" is. It's a Beautiful Soup class

<class 'requests.models.Response'>
<class 'bs4.BeautifulSoup'>


In [6]:
# STEP 3: Make a soup from response.text (which is generated above by requests.post(url)
#         This is telling soup to find all attributes for below type of table

tables = soup.find_all('table',
                  attrs = {'class': 'table table-responsive table-bordered'})

In [7]:
#STEP 4: Look at the soup to identify the table you want 
#        based on one of its attributes (like class)

#This looks at the number of tables, from the original 
#home page - we can deduce that it's the top "search" table, 
#plus leaderboards for females, males.

len(tables)

3

In [8]:
#STEP 5: Pass the table as a string to pandas read_html()

#We can deduce that the first table is the one we want (position 0)
#This code returns a LIST of the DATAFRAMES.. we're only asking for the one in position [0], 
#  so we'll just get one df.
#OPTIONS: Could save each df and append to 0, or save several small dfs then append that.
#Doing little dfs helps see what's going on by using print().

results_list = pd.read_html(str(tables[0]))  

In [9]:
#Confirmed that we just got 1 table back. 

len(results_list)

1

In [10]:
df = results_list[0]  #gets the first (only) df from the results_list created in STEP 4.

In [20]:
#STEP 6: What does that look like? What is the datatype?
#STEP 7: Keep working with the data until you have it a DataFrame

df.dtypes

Overall     int64
Bib         int64
Name       object
Time       object
dtype: object

In [11]:
#Yup, looking good. (Zero times are for people who didn't finish in alloted time)

df.head()  

Unnamed: 0,Overall,Bib,Name,Time
0,99999,32379,Raquel Flores,00:00:00
1,99999,30292,Kyle Domingos,00:00:00
2,99999,32850,Paul Dillard,00:00:00
3,99999,31415,Nicole Bennett,00:00:00
4,99999,32995,Rudy Novak,00:00:00


In [None]:
#This is the start of the next part... may not need this code.
pd.read_html(str(tables[2]))