# Accessing data from a website
Not all websites make it easy to grab data. Luckily, `pandas` can help.

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

result = requests.get('https://en.wikipedia.org/wiki/List_of_sovereign_states')
pd.read_html(result.content)[0].head(20)

For more complex parsing, we can utilize the `BeautifulSoup` library. Let's try to extract the same table, but use the new library. 

In [None]:
soup = BeautifulSoup(result.content, 'lxml') # Parse the HTML as a string
str(soup)[:500]

Find the tables.

In [None]:
tables = soup.find_all('table')

Using the `read_html` function of `pandas`, read the first table into a dataframe.

In [None]:
pd.read_html(str(tables[0]))[0].head(20)

As we can see, the data we get back isn't always perfect, which is what's so nice about APIs instead of parsing HTML. Nevertheless, we would benefit a lot if we simplified this into a function.

In [None]:
def dfFromURL(url, tableNumber=1):
    soup = BeautifulSoup(requests.get(url).content, 'lxml') # Parse the HTML as a string
    tables = soup.find_all('table')
    # check table number is within number of tables on the page
    assert len(tables) >= tableNumber
    return pd.read_html(str(tables[tableNumber-1]))[0]

Now we can make a pretty simple call to get an HTML table as a dataframe. Let's try it.

In [None]:
prices = dfFromURL('https://finance.yahoo.com/quote/JPM/history?p=JPM')
prices.head()

Got some messy data hear with divs and some disclaimers on the bottom...let's clean it up with a simple `dropna`.

In [None]:
prices = prices.dropna()
prices.head()

Cool! Let's try to get the second table from a website. Let's see what the Cavs record was for the last few seasons:
    

In [None]:
dfFromURL('http://www.espn.com/nba/team/_/name/cle/cleveland-cavaliers', 3)