<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Scrape-WashingtonDC---County-(Ward)-Dropoff-Locations" data-toc-modified-id="Scrape-WashingtonDC---County-(Ward)-Dropoff-Locations-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Scrape WashingtonDC - County (Ward) Dropoff Locations</a></span></li><li><span><a href="#Analyze-the-format-of-the-text" data-toc-modified-id="Analyze-the-format-of-the-text-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Analyze the format of the text</a></span></li><li><span><a href="#Scrape-county,-location,-address" data-toc-modified-id="Scrape-county,-location,-address-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Scrape county, location, address</a></span></li><li><span><a href="#Create-DataFrame-to-hold-the-data" data-toc-modified-id="Create-DataFrame-to-hold-the-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Create DataFrame to hold the data</a></span></li><li><span><a href="#Scrape-zip-code-from-Google-Maps-link" data-toc-modified-id="Scrape-zip-code-from-Google-Maps-link-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Scrape zip code from Google Maps link</a></span></li><li><span><a href="#Write-DataFrame-to-csv" data-toc-modified-id="Write-DataFrame-to-csv-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Write DataFrame to csv</a></span></li></ul></div>

# Scrape WashingtonDC - County (Ward) Dropoff Locations

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
url = 'https://dcboe.org/Voters/Where-to-Vote/Mail-Ballot-Drop-Sites'

In [3]:
res = requests.get(url)

In [4]:
res.status_code

200

In [5]:
soup = BeautifulSoup(res.text, "html.parser")

# Analyze the format of the text

- Pretty short list. There are 8 wards in DC
- Entire list in class="article dcboeContent"
- The addresses do not have zip codes on the page, but can be scraped from the Google Map embedded link.

In [6]:
type(soup.text)

str

In [7]:
soup.find_all('strong')

[<strong>You may drop off your voted mail-in-ballot in <u>ANY Ballot Drop Box at ANY time,</u> before 8:00 pm on Election Day, November 3, 2020.</strong>,
 <strong>WARD 1</strong>,
 <strong><span style="color:#FF0000;">WARD 2</span></strong>,
 <strong><span style="color:#FF0000;">WARD 3</span></strong>,
 <strong><span style="color:#FF0000;">WARD 4</span></strong>,
 <strong><span style="color:#FF0000;">WARD 5</span></strong>,
 <strong><span style="color:#FF0000;">WARD 6</span></strong>,
 <strong><span style="color:#FF0000;">WARD 7</span></strong>,
 <strong><span style="color:#FF0000;">WARD 8</span></strong>]

In [8]:
divs = soup.find_all(class_='article dcboeContent')

In [9]:
len(divs)

1

In [10]:
divs[0].find_all('strong')

[<strong>You may drop off your voted mail-in-ballot in <u>ANY Ballot Drop Box at ANY time,</u> before 8:00 pm on Election Day, November 3, 2020.</strong>,
 <strong>WARD 1</strong>,
 <strong><span style="color:#FF0000;">WARD 2</span></strong>,
 <strong><span style="color:#FF0000;">WARD 3</span></strong>,
 <strong><span style="color:#FF0000;">WARD 4</span></strong>,
 <strong><span style="color:#FF0000;">WARD 5</span></strong>,
 <strong><span style="color:#FF0000;">WARD 6</span></strong>,
 <strong><span style="color:#FF0000;">WARD 7</span></strong>,
 <strong><span style="color:#FF0000;">WARD 8</span></strong>]

In [11]:
len(divs[0].find_all('a'))

55

In [12]:
divs[0].find_all('a')[:3]

[<a href="https://www.google.com/maps/place/2000+14th+St+NW,+Washington,+DC+20009/@38.9175589,-77.0346619,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7e792f6bfa9:0xdc2a31827f13c8eb!8m2!3d38.9175547!4d-77.0324732" target="_blank">2000 14<sup>th</sup> Street, NW</a>,
 <a href="https://www.google.com/maps/place/3160+16th+St+NW,+Washington,+DC+20010/@38.9305317,-77.0394299,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7c82113d40607:0xd9f7166dc091fe8a!8m2!3d38.9305275!4d-77.0372412" target="_blank">3160 16<sup>th</sup> Street, NW</a>,
 <a href="https://www.google.com/maps/place/3100+14th+St+NW,+Washington,+DC+20010/@38.9295597,-77.0354266,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7c821d704d1c3:0x413cdcbe553f9797!8m2!3d38.9295555!4d-77.0332379" target="_blank">3100 14<sup>th</sup> Street, NW</a>]

In [13]:
print(divs[0].text[:500])


Mail-in-Ballot Drop Box Locations
You may drop off your voted mail-in-ballot in ANY Ballot Drop Box at ANY time, before 8:00 pm on Election Day, November 3, 2020.

WARD 1

Frank D. Reeves Center
2000 14th Street, NW
 
Mt. Pleasant Library
3160 16th Street, NW
 
Columbia Heights Shopping Center
3100 14th Street, NW
 
Sun Trust Bank (Park Area)
1800 Columbia Road, NW
 
Banneker Community Center
2500 Georgia Avenue, NW

WARD 2

Georgetown Library
3260 R Street, NW
 
Martin Luther King 


In [14]:
[print(x) for x in enumerate(divs[0].text.split('WARD'))]

(0, '\nMail-in-Ballot Drop Box Locations\nYou may drop off your voted mail-in-ballot in\xa0ANY Ballot Drop Box at ANY time,\xa0before 8:00 pm on Election Day, November 3, 2020.\n\n')
(1, ' 1\n\r\nFrank D. Reeves Center\n2000 14th Street, NW\r\n\xa0\r\nMt. Pleasant Library\n3160 16th Street, NW\r\n\xa0\r\nColumbia Heights Shopping Center\n3100 14th Street, NW\r\n\xa0\r\nSun Trust Bank (Park Area)\n1800 Columbia Road, NW\r\n\xa0\r\nBanneker Community Center\n2500 Georgia Avenue, NW\n\n')
(2, ' 2\n\r\nGeorgetown Library\n3260 R Street, NW\r\n\xa0\r\nMartin Luther King Jr. Library\n901 G Street, NW\r\n\xa0\r\nWest End Library\n2301 L Street, NW\r\n\xa0\r\nStead Recreation Center\n1625 P Street, NW\n\r\nFoggy Bottom/GWU Metro (Available October 8, 2020)\n2301 I Street, NW\n\n')
(3, ' 3\n\r\nGuy Mason Recreation Center\n3600 Calvert Street, NW\n\r\nChevy Chase Library\n5625 Connecticut Avenue, NW\r\n\xa0\r\nTenley-Friendship Library\n4450 Wisconsin Avenue, NW\r\n\xa0\r\nCleveland Park Librar

[None, None, None, None, None, None, None, None, None]

In [15]:
divs[0].text.split('WARD')[1].splitlines()

[' 1',
 '',
 'Frank D. Reeves Center',
 '2000 14th Street, NW',
 '\xa0',
 'Mt. Pleasant Library',
 '3160 16th Street, NW',
 '\xa0',
 'Columbia Heights Shopping Center',
 '3100 14th Street, NW',
 '\xa0',
 'Sun Trust Bank (Park Area)',
 '1800 Columbia Road, NW',
 '\xa0',
 'Banneker Community Center',
 '2500 Georgia Avenue, NW',
 '']

In [16]:
divs[0].text.split('WARD')[5].splitlines()[2]

'Woodridge Library'

In [17]:
divs[0].text.split('WARD')[8].splitlines()

[' 8',
 '',
 'Anacostia Library',
 '1800 Good Hope Road SE',
 '\xa0',
 'Parklands-Turner Library',
 '1547 Alabama Avenue, SE',
 '\xa0',
 'Bellevue (William O. Lockridge) Library',
 '115 Atlantic Street, SW',
 '\xa0',
 'Seventh District Police Station',
 '2455 Alabama Avenue, SE',
 '\xa0',
 'The ARC',
 '1901 Mississippi Avenue, SE',
 '\xa0',
 'Department of Human Services',
 '2100 Martin Luther King Jr. Avenue, SE',
 '\xa0',
 'Hendley Elementary School',
 '425 Chesapeake Street, SE',
 '\xa0',
 'Patterson Elementary School',
 '4399 South Capitol Terrace, SW',
 '',
 'Fort Stanton Recreation Center (Available October 8, 2020)',
 '1812 Erie Street, SE',
 '\xa0',
 '',
 '',
 '']

In [18]:
[print(x) for x in range(2, len(divs[0].text.split('WARD')[5].splitlines()), 3)]

2
5
8
11
14
17
20
23


[None, None, None, None, None, None, None, None]

In [19]:
divs[0].text.split('WARD')[1].splitlines()[3]

'2000 14th Street, NW'

# Scrape county, location, address
**Assign city, state, website and hours**

In [20]:
all_data = []
for ward_num in range(1,9):
    county = 'Ward {ward}'.format(ward=divs[0].text.split('WARD')[ward_num].splitlines()[0][1])
    for loc in range(2, len(divs[0].text.split('WARD')[ward_num].splitlines()), 3):
        if len(divs[0].text.split('WARD')[ward_num].splitlines()[loc])>0:
            location = divs[0].text.split('WARD')[ward_num].splitlines()[loc]
            address = divs[0].text.split('WARD')[ward_num].splitlines()[loc+1]
            city = 'Washington'
            state = 'District of Columbia'
            data = {
             'address_1': address,
             'state': 'DC',
             'county': county,
             'Field 23': np.nan,
             'location_type': location,
             'address_2':np.nan,
             'city': city,
             'state_2': np.nan,
             'zip': np.nan,
             'phone': '(202) 727-2525',
             'latitude': np.nan, 
             'longitude': np.nan, 
             'has_droppff_location': 'Yes', 
             'has_phone_number': 'Yes', 
             'county_website_url': 'https://dcboe.org/Voters/Where-to-Vote/Mail-Ballot-Drop-Sites', 
             'validate_url': 'https://dcboe.org/Voters/Where-to-Vote/Mail-Ballot-Drop-Sites', 
             'email': np.nan, 
             'fax': np.nan, 
             'social':np.nan, 
             'inactive':np.nan, 
             'hours':'Any ballot drop box at any time before 8:00 pm on Election Day, Nov 3, 2020.',
             'rules': np.nan,
             'notes': np.nan
            }
            all_data.append(data)

In [21]:
len(all_data)

55

In [22]:
all_data[-1]

{'address_1': '1812 Erie Street, SE',
 'state': 'DC',
 'county': 'Ward 8',
 'Field 23': nan,
 'location_type': 'Fort Stanton Recreation Center (Available October 8, 2020)',
 'address_2': nan,
 'city': 'Washington',
 'state_2': nan,
 'zip': nan,
 'phone': '(202) 727-2525',
 'latitude': nan,
 'longitude': nan,
 'has_droppff_location': 'Yes',
 'has_phone_number': 'Yes',
 'county_website_url': 'https://dcboe.org/Voters/Where-to-Vote/Mail-Ballot-Drop-Sites',
 'validate_url': 'https://dcboe.org/Voters/Where-to-Vote/Mail-Ballot-Drop-Sites',
 'email': nan,
 'fax': nan,
 'social': nan,
 'inactive': nan,
 'hours': 'Any ballot drop box at any time before 8:00 pm on Election Day, Nov 3, 2020.',
 'rules': nan,
 'notes': nan}

# Create DataFrame to hold the data

In [23]:
all_data_df = pd.DataFrame(all_data)

In [24]:
all_data_df.tail()

Unnamed: 0,address_1,state,county,Field 23,location_type,address_2,city,state_2,zip,phone,...,has_phone_number,county_website_url,validate_url,email,fax,social,inactive,hours,rules,notes
50,"1901 Mississippi Avenue, SE",DC,Ward 8,,The ARC,,Washington,,,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,
51,"2100 Martin Luther King Jr. Avenue, SE",DC,Ward 8,,Department of Human Services,,Washington,,,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,
52,"425 Chesapeake Street, SE",DC,Ward 8,,Hendley Elementary School,,Washington,,,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,
53,"4399 South Capitol Terrace, SW",DC,Ward 8,,Patterson Elementary School,,Washington,,,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,
54,"1812 Erie Street, SE",DC,Ward 8,,Fort Stanton Recreation Center (Available Octo...,,Washington,,,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,


# Scrape zip code from Google Maps link

In [25]:
divs[0].find_all('a')[0]

<a href="https://www.google.com/maps/place/2000+14th+St+NW,+Washington,+DC+20009/@38.9175589,-77.0346619,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7e792f6bfa9:0xdc2a31827f13c8eb!8m2!3d38.9175547!4d-77.0324732" target="_blank">2000 14<sup>th</sup> Street, NW</a>

In [26]:
len(divs[0].find_all('a'))

55

In [27]:
divs[0].find_all('a')[0].text

'2000 14th Street, NW'

In [28]:
len(divs[0].find_all('a')[0])

3

In [29]:
type(divs[0].find_all('a')[0])

bs4.element.Tag

In [30]:
divs[0].find_all('a')[0].attrs

{'href': 'https://www.google.com/maps/place/2000+14th+St+NW,+Washington,+DC+20009/@38.9175589,-77.0346619,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7e792f6bfa9:0xdc2a31827f13c8eb!8m2!3d38.9175547!4d-77.0324732',
 'target': '_blank'}

In [31]:
divs[0].find_all('a')[0]['href']

'https://www.google.com/maps/place/2000+14th+St+NW,+Washington,+DC+20009/@38.9175589,-77.0346619,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7e792f6bfa9:0xdc2a31827f13c8eb!8m2!3d38.9175547!4d-77.0324732'

In [32]:
type(divs[0].find_all('a')[0]['href'])

str

In [33]:
divs[0].find_all('a')[0]['href'].split('+')

['https://www.google.com/maps/place/2000',
 '14th',
 'St',
 'NW,',
 'Washington,',
 'DC',
 '20009/@38.9175589,-77.0346619,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7e792f6bfa9:0xdc2a31827f13c8eb!8m2!3d38.9175547!4d-77.0324732']

In [34]:
divs[0].find_all('a')[27]['href'].split('+')

['https://www.google.com/maps/place/1309',
 '5th',
 'St',
 'NE,',
 'Washington,',
 'DC',
 '20002/@38.908604,-76.9997057,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b810eb488625:0x62d428c6c430d84a!8m2!3d38.9085998!4d-76.997517?hl=en']

In [35]:
all_zips = []
for x in divs[0].find_all('a'):
    address = x.text
    try:
        if x['href'].split('+')[6][:2]=='20':
            zip_code = x['href'].split('+')[6][:5]
        elif x['href'].split('+')[7][:2]=='20':
            zip_code = x['href'].split('+')[7][:5]
        elif x['href'].split('+')[9][:2]=='20':
            zip_code = x['href'].split('+')[9][:5]
    except IndexError:
            zip_code = np.nan
    data = {'address': address, 'zip':zip_code}
    all_zips.append(data)

In [36]:
all_zips_df = pd.DataFrame(all_zips)

In [37]:
all_zips_df

Unnamed: 0,address,zip
0,"2000 14th Street, NW",20009.0
1,"3160 16th Street, NW",20010.0
2,"3100 14th Street, NW",20010.0
3,"1800 Columbia Road, NW",20009.0
4,"2500 Georgia Avenue, NW",20001.0
5,"3260 R Street, NW",20007.0
6,"901 G Street, NW",20001.0
7,"2301 L Street, NW",20037.0
8,"1625 P Street, NW",20036.0
9,"2301 I Street, NW",20037.0


In [38]:
all_data_df['zip'] = all_zips_df['zip']

In [39]:
all_data_df.tail()

Unnamed: 0,address_1,state,county,Field 23,location_type,address_2,city,state_2,zip,phone,...,has_phone_number,county_website_url,validate_url,email,fax,social,inactive,hours,rules,notes
50,"1901 Mississippi Avenue, SE",DC,Ward 8,,The ARC,,Washington,,20020,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,
51,"2100 Martin Luther King Jr. Avenue, SE",DC,Ward 8,,Department of Human Services,,Washington,,20020,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,
52,"425 Chesapeake Street, SE",DC,Ward 8,,Hendley Elementary School,,Washington,,20032,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,
53,"4399 South Capitol Terrace, SW",DC,Ward 8,,Patterson Elementary School,,Washington,,20032,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,
54,"1812 Erie Street, SE",DC,Ward 8,,Fort Stanton Recreation Center (Available Octo...,,Washington,,20020,(202) 727-2525,...,Yes,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,https://dcboe.org/Voters/Where-to-Vote/Mail-Ba...,,,,,Any ballot drop box at any time before 8:00 pm...,,


In [40]:
all_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   address_1             55 non-null     object 
 1   state                 55 non-null     object 
 2   county                55 non-null     object 
 3   Field 23              0 non-null      float64
 4   location_type         55 non-null     object 
 5   address_2             0 non-null      float64
 6   city                  55 non-null     object 
 7   state_2               0 non-null      float64
 8   zip                   54 non-null     object 
 9   phone                 55 non-null     object 
 10  latitude              0 non-null      float64
 11  longitude             0 non-null      float64
 12  has_droppff_location  55 non-null     object 
 13  has_phone_number      55 non-null     object 
 14  county_website_url    55 non-null     object 
 15  validate_url          55 

# Write DataFrame to csv

In [41]:
all_data_df.to_csv('WashingtonDC.csv')