# Scraping with Pandas

In [1]:
import pandas as pd

We can use the `read_html` function in Pandas to automatically scrape any tabular data from a page.

In [2]:
url = 'https://animalcrossing.fandom.com/wiki/Bugs_(New_Horizons)'

In [5]:
tables = pd.read_html(url)
tables[3]

Unnamed: 0,Name,Image,Price,Location,Time,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,Common butterfly,,160,Flying,4 AM - 7 PM,✓,✓,✓,✓,✓,✓,-,-,✓,✓,✓,✓
1,Yellow butterfly,,160,Flying,4 AM - 7 PM,-,-,✓,✓,✓,✓,-,-,✓,✓,-,-
2,Tiger butterfly,,240,Flying,4 AM - 7 PM,-,-,✓,✓,✓,✓,✓,✓,✓,-,-,-
3,Peacock butterfly,,2500,Flying by Hybrid Flowers,4 AM - 7 PM,-,-,✓,✓,✓,✓,-,-,-,-,-,-
4,Common bluebottle,,300,Flying,4 AM - 7 PM,-,-,-,✓,✓,✓,✓,✓,-,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,Pill bug,,250,Hitting Rocks,11 PM - 4 PM,✓,✓,✓,✓,✓,✓,-,-,✓,✓,✓,✓
76,Centipede,,300,Hitting Rocks,4 PM - 11 PM,✓,✓,✓,✓,✓,✓,-,-,✓,✓,✓,✓
77,Spider,,600,Shaking Trees,7 PM - 8 AM,✓,✓,✓,✓,✓,✓,✓,✓,✓,✓,✓,✓
78,Tarantula,,8000,On the Ground,7 PM - 4 AM,✓,✓,✓,✓,-,-,-,-,-,-,✓,✓


What we get in return is a list of dataframes for any tabular data that Pandas found.

In [6]:
type(tables)

list

We can slice off any of those dataframes that we want using normal indexing.

In [7]:
df = tables[3]
df.columns = ['Name','Image','Price','Location','Time','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
df.head()

Unnamed: 0,Name,Image,Price,Location,Time,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,Common butterfly,,160,Flying,4 AM - 7 PM,✓,✓,✓,✓,✓,✓,-,-,✓,✓,✓,✓
1,Yellow butterfly,,160,Flying,4 AM - 7 PM,-,-,✓,✓,✓,✓,-,-,✓,✓,-,-
2,Tiger butterfly,,240,Flying,4 AM - 7 PM,-,-,✓,✓,✓,✓,✓,✓,✓,-,-,-
3,Peacock butterfly,,2500,Flying by Hybrid Flowers,4 AM - 7 PM,-,-,✓,✓,✓,✓,-,-,-,-,-,-
4,Common bluebottle,,300,Flying,4 AM - 7 PM,-,-,-,✓,✓,✓,✓,✓,-,-,-,-


Set the index to the `State` column

In [8]:
df.set_index('Name', inplace=True)
df.head()

Unnamed: 0_level_0,Image,Price,Location,Time,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Common butterfly,,160,Flying,4 AM - 7 PM,✓,✓,✓,✓,✓,✓,-,-,✓,✓,✓,✓
Yellow butterfly,,160,Flying,4 AM - 7 PM,-,-,✓,✓,✓,✓,-,-,✓,✓,-,-
Tiger butterfly,,240,Flying,4 AM - 7 PM,-,-,✓,✓,✓,✓,✓,✓,✓,-,-,-
Peacock butterfly,,2500,Flying by Hybrid Flowers,4 AM - 7 PM,-,-,✓,✓,✓,✓,-,-,-,-,-,-
Common bluebottle,,300,Flying,4 AM - 7 PM,-,-,-,✓,✓,✓,✓,✓,-,-,-,-


In [9]:
df.loc['Peacock butterfly']

Image                            NaN
Price                           2500
Location    Flying by Hybrid Flowers
Time                     4 AM - 7 PM
Jan                                -
Feb                                -
Mar                                ✓
Apr                                ✓
May                                ✓
Jun                                ✓
Jul                                -
Aug                                -
Sep                                -
Oct                                -
Nov                                -
Dec                                -
Name: Peacock butterfly, dtype: object

## DataFrames as HTML

Pandas also had a `to_html` method that we can use to generate HTML tables from DataFrames.

In [None]:
html_table = df.to_html()
html_table

You may have to strip unwanted newlines to clean up the table.

In [None]:
html_table.replace('\n', '')

You can also save the table directly to a file.

In [None]:
df.to_html('fish_data.html')

In [None]:
# OSX Users can run this to open the file in a browser, 
# or you can manually find the file and open it in the browser
!open table.html