# Michigan Covid Scraper - using BeautifulSoup
Here's an approach using BeautifulSoup that avoids dealing with finding end of table.

In [1]:
import requests
import pandas as pd
import bs4

In [2]:
url='https://www.michigan.gov/coronavirus/0,9753,7-406-98163-520743--,00.html'

Get the text of the page using requests.

In [3]:
page_text = requests.get(url).text
# page_text

Use BeautifulSoup to parse text and turn into soup object.

In [4]:
page_soup = bs4.BeautifulSoup(page_text)

Since looks like only one table on page, will just find first table.

In [5]:
stats_table = page_soup.table

Check if we got we think. Yep.

In [6]:
stats_table.contents

['\n',
 <caption><strong>Overall Confirmed COVID-19 Cases by County</strong></caption>,
 '\n',
 <thead>
 </thead>,
 '\n',
 <tbody>
 <tr>
 <td nowrap="nowrap" style="width: 251px;"><strong>  County</strong></td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"><strong>Cases</strong></td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"><strong>Deaths</strong></td>
 </tr>
 <tr>
 <td nowrap="nowrap" style="width: 251px;">  Allegan</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;">1</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"> </td>
 </tr>
 <tr>
 <td nowrap="nowrap" style="width: 251px;">  Barry</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;">1</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"> </td>
 </tr>
 <tr>
 <td nowrap="nowrap" style="width: 251px;">  Bay</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;">2</td>
 <td nowrap="nowrap" style="width: 88px; text-align: cen

Get all the rows of the table. The result is a list.

In [7]:
stats_table_rows = stats_table.find_all('tr')
stats_table_rows[:5]

[<tr>
 <td nowrap="nowrap" style="width: 251px;"><strong>  County</strong></td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"><strong>Cases</strong></td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"><strong>Deaths</strong></td>
 </tr>,
 <tr>
 <td nowrap="nowrap" style="width: 251px;">  Allegan</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;">1</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"> </td>
 </tr>,
 <tr>
 <td nowrap="nowrap" style="width: 251px;">  Barry</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;">1</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"> </td>
 </tr>,
 <tr>
 <td nowrap="nowrap" style="width: 251px;">  Bay</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;">2</td>
 <td nowrap="nowrap" style="width: 88px; text-align: center;"> </td>
 </tr>,
 <tr>
 <td nowrap="nowrap" style="width: 251px;">  Berrien</td>
 <td nowrap="nowrap" style="width: 88px; te

Now we can just iterate over the list (the rows). For each row:

* create empty master list
* find all the table data elements
* use a list comprehension to create a list of elements in each row. Note use of `.strip()` to clean county name.
* do string to number conversions
* as long as resulting list isn't None, append to our master list

In [8]:
stats_data_rows = []
for tr in stats_table_rows:
    tds = tr.find_all('td')
    row = [td.text.strip().replace(",","") for td in tds]
    # Let's do the string to int conversions now (but skip the header)
    if len(stats_data_rows) > 0:
        for i in range(1, 3):
            if row[i] == '':
                row[i] = 0
            else:
                row[i] = int(row[i])
    if row:
        stats_data_rows.append(row)

In [9]:
stats_data_rows

[['County', 'Cases', 'Deaths'],
 ['Allegan', 1, 0],
 ['Barry', 1, 0],
 ['Bay', 2, 0],
 ['Berrien', 8, 0],
 ['Calhoun', 4, 0],
 ['Charlevoix', 3, 0],
 ['Chippewa', 1, 0],
 ['Clare', 1, 0],
 ['Clinton', 5, 0],
 ['Detroit City', 563, 8],
 ['Eaton', 3, 0],
 ['Emmet', 2, 0],
 ['Genesee', 34, 0],
 ['Gladwin', 2, 0],
 ['Grand Traverse', 3, 0],
 ['Hillsdale', 1, 0],
 ['Ingham', 15, 0],
 ['Isabella', 2, 0],
 ['Jackson', 6, 0],
 ['Kalamazoo', 3, 0],
 ['Kalkaska', 1, 0],
 ['Kent', 31, 1],
 ['Lapeer', 1, 0],
 ['Leelanau', 1, 0],
 ['Livingston', 13, 0],
 ['Macomb', 225, 3],
 ['Manistee', 1, 0],
 ['Midland', 5, 0],
 ['Monroe', 12, 0],
 ['Montcalm', 1, 0],
 ['Muskegon', 3, 0],
 ['Newaygo', 2, 0],
 ['Oakland', 428, 4],
 ['Otsego', 5, 0],
 ['Ottawa', 15, 0],
 ['Roscommon', 1, 0],
 ['Saginaw', 8, 0],
 ['St. Clair', 8, 0],
 ['Tuscola', 1, 0],
 ['Washtenaw', 50, 3],
 ['Wayne', 310, 5],
 ['Wexford', 1, 0],
 ['Out of State', 6, 0],
 ['Not Reported', 2, 0],
 ['Total', 1791, 24]]

In [10]:
stats = pd.DataFrame(stats_data_rows[1:], columns = stats_data_rows[0])
stats.head(50)

Unnamed: 0,County,Cases,Deaths
0,Allegan,1,0
1,Barry,1,0
2,Bay,2,0
3,Berrien,8,0
4,Calhoun,4,0
5,Charlevoix,3,0
6,Chippewa,1,0
7,Clare,1,0
8,Clinton,5,0
9,Detroit City,563,8


Add the market county indicators. Avoid spaces in column names.

In [11]:
stats['InMarket'] = 0
market_counties = ['Detroit City','Lapeer','Macomb','Monroe','Oakland','Genesee','Wayne','St. Clair']
stats.loc[stats['County'].isin(market_counties),['InMarket']] = 1

In [12]:
stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   County    45 non-null     object
 1   Cases     45 non-null     int64 
 2   Deaths    45 non-null     int64 
 3   InMarket  45 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 1.5+ KB


The warning you are getting in your version of the following statement is due to deprecation of `[col1, col2]` on `GroupBy` objects. Needs to be a list of columns inside the indexer brackets: `[[col1, col2]]`.

In [13]:
stats.groupby(['InMarket'])[['Cases','Deaths']].sum()

Unnamed: 0_level_0,Cases,Deaths
InMarket,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2001,28
1,1581,20
