# Accessing data from a website
Not all websites make it easy to grab data. Luckily, `pandas` can help.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

result = requests.get('https://en.wikipedia.org/wiki/List_of_sovereign_states')
pd.read_html(result.content)[0].head(20)

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
1,UN member states and observer states ↓,,,
2,Abkhazia,,,
3,Afghanistan – Islamic Republic of Afghanistan,UN member state,,
4,Albania – Republic of Albania,,,
5,Algeria – People's Democratic Republic of Algeria,,,
6,Andorra – Principality of Andorra,,,Andorra is a co-principality in which the offi...
7,Angola – Republic of Angola,,,
8,Antigua and Barbuda,,,Antigua and Barbuda is a Commonwealth realm[e]...
9,Argentina – Argentine Republic[g],,,Argentina is a federation of 23 provinces and ...


For more complex parsing, we can utilize the `BeautifulSoup` library. Let's try to extract the same table, but use the new library. 

In [2]:
soup = BeautifulSoup(result.content, 'lxml') # Parse the HTML as a string
str(soup)[:500]

'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n<head>\n<meta charset="utf-8"/>\n<title>List of sovereign states - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_sovereign_states","wgTitle":"List of sovereign states","wgCurRevisionId":907138677,"wgRevisionId":907138677,"wgArti'

Find the tables.

In [3]:
tables = soup.find_all('table')

Using the `read_html` function of `pandas`, read the first table into a dataframe.

In [4]:
pd.read_html(str(tables[0]))[0].head(20)

Unnamed: 0,Common and formal names,Membership within the UN System[a],Sovereignty dispute[b],Further information on status and recognition of sovereignty[d]
0,,,,
1,UN member states and observer states ↓,,,
2,Abkhazia,,,
3,Afghanistan – Islamic Republic of Afghanistan,UN member state,,
4,Albania – Republic of Albania,,,
5,Algeria – People's Democratic Republic of Algeria,,,
6,Andorra – Principality of Andorra,,,Andorra is a co-principality in which the offi...
7,Angola – Republic of Angola,,,
8,Antigua and Barbuda,,,Antigua and Barbuda is a Commonwealth realm[e]...
9,Argentina – Argentine Republic[g],,,Argentina is a federation of 23 provinces and ...


As we can see, the data we get back isn't always perfect, which is what's so nice about APIs instead of parsing HTML. Nevertheless, we would benefit a lot if we simplified this into a function.

In [5]:
def dfFromURL(url, tableNumber=1):
    soup = BeautifulSoup(requests.get(url).content, 'lxml') # Parse the HTML as a string
    tables = soup.find_all('table')
    # check table number is within number of tables on the page
    assert len(tables) >= tableNumber
    return pd.read_html(str(tables[tableNumber-1]))[0]

Now we can make a pretty simple call to get an HTML table as a dataframe. Let's try it.

In [6]:
prices = dfFromURL('https://finance.yahoo.com/quote/JPM/history?p=JPM')
prices.head()

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
0,"Jul 22, 2019",112.91,114.45,112.77,114.27,114.27,9061100
1,"Jul 19, 2019",114.89,115.12,113.4,113.54,113.54,10402800
2,"Jul 18, 2019",113.93,115.07,113.55,114.67,114.67,9400700
3,"Jul 17, 2019",114.43,114.94,113.73,113.99,113.99,13120900
4,"Jul 16, 2019",113.48,115.5,112.92,115.12,115.12,16945000


Got some messy data hear with divs and some disclaimers on the bottom...let's clean it up with a simple `dropna`.

In [7]:
prices = prices.dropna()
prices.head()

Unnamed: 0,Date,Open,High,Low,Close*,Adj Close**,Volume
0,"Jul 22, 2019",112.91,114.45,112.77,114.27,114.27,9061100
1,"Jul 19, 2019",114.89,115.12,113.4,113.54,113.54,10402800
2,"Jul 18, 2019",113.93,115.07,113.55,114.67,114.67,9400700
3,"Jul 17, 2019",114.43,114.94,113.73,113.99,113.99,13120900
4,"Jul 16, 2019",113.48,115.5,112.92,115.12,115.12,16945000


Cool! Let's try to get the second table from a website. Let's see what the Cavs record was for the last few seasons:
    

In [8]:
dfFromURL('http://www.espn.com/nba/team/_/name/cle/cleveland-cavaliers', 3)

Unnamed: 0,YEAR,W,L,PCT
0,2018-19,19,63,0.232
1,2017-18,50,32,0.61
2,2016-17,51,31,0.622
3,2015-16,57,25,0.695
4,2014-15,53,29,0.646
