# Web scraping example.  This scrapes Influenza burden data from the CDC web site

In [1]:
# This is the beautiful soup html parser
# https://pypi.org/project/beautifulsoup4/
from bs4 import BeautifulSoup

# This is an http client to fetch the page
# https://requests.readthedocs.io/en/master/
import requests as req

# Pandas
# https://pandas.pydata.org/docs/
import pandas as pd

# html parser needed by pandas.read_html()
# Note if you pip install this and Juypter Notebooks still insists
# it is missing, try selecting restart from the kernel menu here.
import lxml

# Fetch the page
r = req.get( 'https://www.cdc.gov/flu/about/burden/2010-2011.html')

# Instantiate the parser
soup = BeautifulSoup(r.text)
 
# New pandas dataframe to hold results of scraping this page
# https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
pdf = pd.DataFrame(index=range(10),columns=range(10))

# Debugging
print(pdf.size)
print(pdf.shape)

# find all the tables
# to understand this code, you should have a page
# with view source open on the page to be scraped.
# that way you can see why this works.
for aTable in soup.find_all( 'table'):
    table_title = aTable.find( 'h3')
    
    #pick the table we want by title
    # could also have subscripted as  in aTable.find()[1]
    if 'Estimated rates of influenza disease' in table_title.text:
        print( table_title.text )
        pdf = pd.read_html(str(aTable))[0]
         
        # Just a data dump of the Pandas dataframe for now.
        print(pdf.head())

    


100
(10, 10)
Estimated rates of influenza disease outcomes, per 100,000 by age group — United States, 2010-2011 influenza season
  Unnamed: 0_level_0 Illness rate                       Medical visit rate  \
           Age group     Estimate              95% Cr I           Estimate   
0            0-4 yrs      13743.2  (11,319.6, 17,432.0)             9207.9   
1           5-17 yrs       8216.6  ( 6,686.1, 10,832.1)             4272.6   
2          18-49 yrs       5468.1   ( 4,537.7, 7,030.2)             2023.2   
3          50-64 yrs       8240.5  ( 6,858.4, 11,046.5)             3543.4   
4            65+ yrs       4521.1   ( 3,951.1, 5,948.4)             2531.8   

                       Hospitalization rate                 Mortality rate  \
              95% Cr I             Estimate        95% Cr I       Estimate   
0  (7,411.9, 11,935.4)                 95.8   (78.9, 121.5)            1.0   
1   (3,423.3, 5,712.9)                 22.5    (18.3, 29.7)            0.3   
2   (1,626.2

Ok, we have a dataframe with the scraped table in it.  Notice how short the code was to do this.  Sure I had to Google every freaking line, but it does work and it is not hard to understand, at least in retrospect.  That's Pandas, and to some extent, Python.  Short, cryptic and very powerful.

(In passing, I note that the flu in the 2010 season was both different and similar to COVID-19.  A big difference is the total mortality, but also note the large number of young children affected, albeit with a very low death rate, thankfully.  And the death rate among the elderly is really scary, similar to or worse than COVID-19.)

Now we need to pick out just the data we want from the dataframe.

In [None]:
This shows the index that was automagically created by the read_html( ) method of the Pandas dataframe

In [22]:
#pdf.index
pdf.columns

MultiIndex(levels=[['Hospitalization rate', 'Illness rate', 'Medical visit rate', 'Mortality rate', 'Unnamed: 0_level_0'], ['95% Cr I', 'Age group', 'Estimate']],
           codes=[[4, 1, 1, 2, 2, 0, 0, 3, 3], [1, 2, 0, 2, 0, 2, 0, 2, 0]])

Cool, but honestly I don't completely understand what I am looking at.  (For instance, what do the codes represent?) Instead, let's pick out some columns of interest using plain vanilla subscripts. First the 0-4 years death rate:

In [49]:
pdf.loc[0][7]

1.0

So ignoring the multiIndex stuff for now, the dataframe can be treated like a 2d array, using the numeric subscripts as shown in the output from pdf.head() before.  Looking there, we can see that row 0 has the data for the 0-4 years cohort.  And column 7 has the mortality rate.

# Here's what you need to do:

1. Select another flu year from the CDC web site.  Just change the year in the fetch above. Post a claim in the discussion for the year you select.  pick a year other than the one I am using. 
2. Run the web scraping code in the cells above on that page.
3. Demonstrate code that will pick out the mortality rates for each age group.
4. Put the rates you grab into a Python listor dictionary (you needn't create another dataframe unless you just want to).
5. Post a screenshot of your added code from your Notebook in the discussion forum.
6. Google, work together, ask questions of me or peers as appropriate.