# Scraping with Pandas

In [1]:
import pandas as pd

We can use the `read_html` function in Pandas to automatically scrape any tabular data from a page.

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'

In [3]:
tables = pd.read_html(url)
tables

[             State Abr. State-hood         Capital Capital since Area (mi²)  \
              State Abr. State-hood         Capital Capital since Area (mi²)   
 0          Alabama   AL       1819      Montgomery          1846     159.80   
 1           Alaska   AK       1959          Juneau          1906    2716.70   
 2          Arizona   AZ       1912         Phoenix          1889     517.60   
 3         Arkansas   AR       1836     Little Rock          1821     116.20   
 4       California   CA       1850      Sacramento          1854      97.90   
 5         Colorado   CO       1876          Denver          1867     153.30   
 6      Connecticut   CT       1788        Hartford          1875      17.30   
 7         Delaware   DE       1787           Dover          1777      22.40   
 8          Florida   FL       1845     Tallahassee          1824      95.70   
 9          Georgia   GA       1788         Atlanta          1868     133.50   
 10          Hawaii   HI       1959     

What we get in return is a list of dataframes for any tabular data that Pandas found.

In [4]:
type(tables)

list

We can slice off any of those dataframes that we want using normal indexing.

In [None]:
df = tables[0]
df.columns = ['State', 'Abr.', 'State-hood Rank', 'Capital', 
              'Capital Since', 'Area (sq-mi)', 'Municipal Population', 'Metropolitan', 
              'Metropolitan Population', 'Population Rank', 'Notes']
df.head()

Cleanup of extra rows

In [None]:
df = df.iloc[2:]
df.head()

Set the index to the `State` column

In [None]:
df.set_index('State', inplace=True)
df.head()

In [None]:
df.loc['Alabama']

## DataFrames as HTML

Pandas also had a `to_html` method that we can use to generate HTML tables from DataFrames.

In [None]:
html_table = df.to_html()
html_table

You may have to strip unwanted newlines to clean up the table.

In [None]:
html_table.replace('\n', '')

You can also save the table directly to a file.

In [None]:
df.to_html('table.html')

In [None]:
# OSX Users can run this to open the file in a browser, 
# or you can manually find the file and open it in the browser
!open table.html