Links for 25 Feb 2020 Matt Lavin workshop:
- [slides](https://docs.google.com/presentation/d/1JnMZbl7434RrzAHluKOT4OyUxCpln3B_7QYoSa01X0M/edit#slide=id.p)
- [repo](https://github.com/mjlavin80/advanced-webscraping-pitt-november-2019)

Notes from the workshop:
- Webscraping is really about understanding how HTML works, the underlying dynamics.
- There is another library called scrapy; it's not very pythonic, and overkill (vis-a-vis bs4) for 90% of what you need to do.
- time is another great library for when websites are going to time you out.
    - `time.sleep(2)` suspends execution for 2 seconds
    - making your code go slow is frustrating, but absolutely an essential skill
- Important to pay attention to URL *arguments*, i.e. specifications built up as the page is created.


In [14]:
import requests, time
from bs4 import BeautifulSoup

In [8]:
r = requests.get('https://api.github.com/user', auth = ('user', 'pass'))
r.status_code

403

In [5]:
r.encoding

'utf-8'

In [6]:
r.text

'{"message":"Not Found","documentation_url":"https://developer.github.com/v3"}'

In [7]:
r.json()

{'message': 'Not Found',
 'documentation_url': 'https://developer.github.com/v3'}

In [16]:
r = requests.get('https://mjlavin80.github.io/pseudonyms/')

index = r.text 
bs = BeautifulSoup(index) 
bs.title

<title>Author Pseudonyms</title>

In [17]:
url = 'https://mjlavin80.github.io/advanced-webscraping-pitt-november-2019/oed/index.html'
r = requests.get(url)

In [18]:
r = requests.get('https://memory-alpha.fandom.com/wiki/Starfleet_casualties_(22nd_century)')

early_deaths = r.text
bs = BeautifulSoup(early_deaths)
bs.find_all("td")

[<td>
 <ul><li><a href="/wiki/Bajoran_Militia_casualties" title="Bajoran Militia casualties">Bajoran Militia</a>
 </li><li><a href="/wiki/Civilian_casualties" title="Civilian casualties">Civilian</a>
 </li><li>Starfleet casualties:
 <ul><li><strong class="selflink">22nd century</strong> (<a class="mw-redirect" href="/wiki/MACO_casualties" title="MACO casualties">MACO</a>)
 </li><li><a href="/wiki/Starfleet_casualties_(23rd_century)" title="Starfleet casualties (23rd century)">23rd century</a>
 </li><li><a href="/wiki/Starfleet_casualties_(24th_century)" title="Starfleet casualties (24th century)">24th century</a>
 </li></ul>
 </li><li><a href="/wiki/Vulcan_casualties" title="Vulcan casualties">Vulcan</a>
 </li><li><i><a href="/wiki/Alternate_reality_casualties" title="Alternate reality casualties">Alternate reality</a></i>
 </li><li><i><a href="/wiki/Mirror_universe_casualties" title="Mirror universe casualties">Mirror universe</a></i>
 </li></ul>
 </td>, <td rowspan="5"> <a href="/wik

In [19]:
r = requests.get('http://github.com/')
r.status_code

200

UFO Database Example

In [34]:
r = requests.get('https://memory-alpha.fandom.com/wiki/Starfleet_casualties_(22nd_century)')
r.status_code

200

In [35]:
ufotext = r.text
bs = BeautifulSoup(ufotext)

In [36]:
bs

<!DOCTYPE html>
<html class="" dir="ltr" lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, user-scalable=yes" name="viewport"/>
<meta content="MediaWiki 1.19.24" name="generator"/>
<meta content="Memory Alpha,enmemoryalpha,Starfleet casualties (22nd century),Titles/Home,Titles/The Forgotten,Titles/Anomaly,Titles/Similitude,Titles/Rajiin,Titles/Chosen Realm,Titles/Storm Front, Part II,Titles/Daedalus,Titles/Azati Prime,Titles/Zero Hour" name="keywords"/>
<meta content="United Earth/Coalition of Planets/Federation Starfleet personnel in the 22nd century often had to place their lives in danger during the course of their duties, and many made the ultimate sacrifice. The following casualties served aboard the Enterprise while it was in service (2151 – 2161):" name="description"/>
<meta content="summary" name="twitter:card"/>
<meta content="@getfandom" name="twitter:site"/>
<meta content="https://memory-alpha.fandom.com

In [40]:
r = requests.get('https://memory-alpha.fandom.com/wiki/Starfleet_casualties_(22nd_century)')
early_deaths = r.text
bs = BeautifulSoup(early_deaths)
bs.find_all("td")

[<td>
 <ul><li><a href="/wiki/Bajoran_Militia_casualties" title="Bajoran Militia casualties">Bajoran Militia</a>
 </li><li><a href="/wiki/Civilian_casualties" title="Civilian casualties">Civilian</a>
 </li><li>Starfleet casualties:
 <ul><li><strong class="selflink">22nd century</strong> (<a class="mw-redirect" href="/wiki/MACO_casualties" title="MACO casualties">MACO</a>)
 </li><li><a href="/wiki/Starfleet_casualties_(23rd_century)" title="Starfleet casualties (23rd century)">23rd century</a>
 </li><li><a href="/wiki/Starfleet_casualties_(24th_century)" title="Starfleet casualties (24th century)">24th century</a>
 </li></ul>
 </li><li><a href="/wiki/Vulcan_casualties" title="Vulcan casualties">Vulcan</a>
 </li><li><i><a href="/wiki/Alternate_reality_casualties" title="Alternate reality casualties">Alternate reality</a></i>
 </li><li><i><a href="/wiki/Mirror_universe_casualties" title="Mirror universe casualties">Mirror universe</a></i>
 </li></ul>
 </td>, <td rowspan="5"> <a href="/wik