### *Gathering information for Oregon County and the Dignity Memorial*

Bryan Brugal

PIT-DSC 2022

### Import libraries

In [42]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os

In [43]:
# Get path to the repo directory
dir_path = "/".join(os.getcwd().split("/")[:-1])
print(dir_path)




The first step is to read this article's HTML into Python, which we'll do by utilizing the requests library. (If you don't have it, you can use the command line to execute pip install requests.)

In [44]:
# Create an URL object
url = 'https://projects.oregonlive.com/indigent-burials/indigent.json'
# Create object page
page = requests.get(url)

# print(page.json())

In [45]:
# Extract tables
oregon = pd.read_json(url)
oregon.head()

Unnamed: 0,date,name
0,2000-01-01,"BEIGHTS, Karl Wesleu"
1,2000-01-01,"CATER, Jon Leslie"
2,2000-01-01,"CRAFT, Michael David"
3,2000-01-01,"DIETZ, Wendenlin"
4,2000-01-01,"DILKA, Richard Earl"


In [46]:
oregon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6554 entries, 0 to 6553
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    6554 non-null   datetime64[ns]
 1   name    6554 non-null   object        
dtypes: datetime64[ns](1), object(1)
memory usage: 102.5+ KB



The date column will be converted to pandas' own "datetime" format because it's best to format your data consistently


In [47]:
oregon['date'] = pd.to_datetime(oregon['date'])

In [48]:
# convert the format from month-day-year
oregon['date'] = [d.strftime('%m-%d-%Y') if not pd.isnull(d) else '' for d in oregon['date']]

In [49]:
oregon.head()

Unnamed: 0,date,name
0,01-01-2000,"BEIGHTS, Karl Wesleu"
1,01-01-2000,"CATER, Jon Leslie"
2,01-01-2000,"CRAFT, Michael David"
3,01-01-2000,"DIETZ, Wendenlin"
4,01-01-2000,"DILKA, Richard Earl"


In [50]:
# Output collected data to the "web-scraping" folder
oregon.to_csv((dir_path + "../web-scraping/oregon/the-unclaimed.csv"), index=False, encoding='utf-8')



*** 
The Beautiful Soup 4 package, a well-known Python web scraping library, will be used to parse the HTML. You can pip install beautifulsoup4 from the command line if you don't already have it.
***

In [51]:
r = requests.get('https://www.dignitymemorial.com/plan-funeral-cremation/veterans/homeless-veterans-program/21-homeless-vet-burials')

* The code above retrieves our website from the URL and saves the outcome in a "response" object called r. **The text attribute of that response object has the same HTML code as the source code displayed in our web browser:**

In [52]:
# print the first 500 characters of the HTML
print(r.text[0:500])




<!DOCTYPE html>
<html lang="en" class="no-js">
<head>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge"><script type="text/javascript">window.NREUM||(NREUM={});NREUM.info = {"beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"eac7771f5a","applicationID":"1360028078","transactionName":"ZldUMkRSWUUEVBFdWV8dezB1HGRfEVIGW0RUcVkIQkFYWglSFxt/X1ZTHg==","queueTime":0,"applicationTime":677,"agent":"","atts":""}</script><script type="text/


The code below translates the HTML (stored in r.text) into a particular soup object that the Beautiful Soup library can recognize. In other words, HTML is being read and the structure of the HTML is being interpreted by Beautiful Soup.


In [53]:
soup = BeautifulSoup(r.text, 'html.parser')

### Gathering all the data

We will now begin constructing our dataset by utilizing the patterns we identified in the webpage structure.

In [54]:
results = soup.find_all('h2')
# results

* This code looks through the soup object for every < h2 > tags. It provides the search results in a unique Beautiful Soup object known as a Result Set.

* results behaves like a Python list, enabling us to determine its length:

In [55]:
len(results)

21

* Given how long the article is, there are 21 results, which seems appropriate. (Whether we didn't think this number made sense, we would look more closely at the HTML to see if our theories about its patterns weren't accurate.)

* For a closer look at the top three outcomes, we may also slice the object like a list:

In [56]:
results[:3]

[<h2><a class="basic-link" href="https://www.dignitymemorial.com/obituaries/phoenix-az/will-shegog-8179335" rel="noopener noreferrer" target="_blank">Will Melvin Shegog, U.S. Air Force</a></h2>,
 <h2><a class="basic-link" href="https://www.dignitymemorial.com/obituaries/knoxville-tn/ronnie-lundy-8141440" rel="noopener noreferrer" target="_blank">Ronnie Joe Lundy, U.S. Army</a></h2>,
 <h2><a class="basic-link" href="https://www.dignitymemorial.com/obituaries/lexington-sc/joseph-williams-8710549" rel="noopener noreferrer" target="_blank">Joseph Lorenzo Williams, U.S. Army</a></h2>]

We'll likewise make sure the last record in the article and the last result in this object match.

In [57]:
results[-1]

<h2>
Ronald Baker Thomas, U.S. Army
</h2>

**Pretty good looking!**

* All 21 entries have now been gathered, but to give the dataset some structure, we still need to divide each record into its three parts (date, name, and organization).

* To make things easier, we'll begin by just manipulating the first record in the results object, and subsequently we'll change our code to make use of a loop:

In [58]:
# Extracting the name
first_result = results[0]
first_result

<h2><a class="basic-link" href="https://www.dignitymemorial.com/obituaries/phoenix-az/will-shegog-8179335" rel="noopener noreferrer" target="_blank">Will Melvin Shegog, U.S. Air Force</a></h2>

Despite the fact that first result appears to be a Python string, there are no quote marks surrounding it. It is actually a different unique Beautiful Soup object (called a "Tag") with specific methods and attributes.

* We may access its text attribute, which does indeed yield a typical Python string, since we want to extract the text that is included between the opening and closing tags:

In [59]:
first_result.text
#     or 
# first_result.find('a').text[0:-1]

'Will Melvin Shegog, U.S. Air Force'

* Let's slice through this list to get the initial component.

In [60]:
first_result.text.split(',')[0]

'Will Melvin Shegog'

### Create a for loop to put all the strings into the list
* To repeat this procedure over all 21 results.

In [61]:
name_head = []
# for i in soup.find_all('h2'):
for i in results:
 title = i.text.split(',')[0].strip()
 name_head.append(title)
name_head

['Will Melvin Shegog',
 'Ronnie Joe Lundy',
 'Joseph Lorenzo Williams',
 'Danny Rollin Ballantyne',
 'George Charles Babcock',
 'James Miske',
 'Stephen Jerald Spicer',
 'Howard Nicholas Warren',
 'Charles Bradley Fox',
 'James David Ellis',
 'Frank Harmon Wilson',
 'Wesley Russell',
 'Arnold Martin Klechka',
 'James Michael Farrar',
 'Charles Joseph Burnett',
 'Robert Lee Baker',
 'Stephen Sebastian Cunningham',
 'George Shaw',
 'Richard Lindsay Butterfield',
 'Gary Lynn Andrews',
 'Ronald Baker Thomas']

In [62]:
main_title = []

for x in results:
  # This list can be sliced to extract the second element
  detail = x.text.split(',')[1].strip()
  main_title.append(detail)
main_title

['U.S. Air Force',
 'U.S. Army',
 'U.S. Army',
 'U.S. Marine Corps',
 'U.S. Army',
 'U.S. Navy',
 'U.S. Army',
 'U.S. Army',
 'U.S. Marine Corps',
 'U.S. Army',
 'U.S. Army',
 'U.S. Army',
 'U.S. Army',
 'U.S. Army',
 'U.S. Army',
 'Jr.',
 'U.S. Army',
 'U.S. Coast Guard',
 'U.S. Marine Corps',
 'U.S. Army',
 'U.S. Army']

### Extracting the Date
* The approach is to look for surrounding tags, just like we did when we extracted the name but this time accessing specific atributes of tags.

In [63]:
events = soup.find_all('div', class_ = 'row row-band one-col')
records = []
for event in events:
    event_name = event.find_all('p')[1]
    # print(event_name.text)
    records.append((event_name))

There were 21 results, thus we need to have 21 records.

In [64]:
len(records)

22

We'll look into the HTML more to see if our assumptions about the patterns in the HTML were accurate because there are 22 records, which doesn't seem reasonable.
* Let's do a quick spot check of the first five records:

In [65]:
records[0:5]

[<p>When family can’t be located or there are no resources to pay for a funeral service, a community of volunteer Dignity Memorial® funeral homes, veteran service groups, local medical examiners, coroners and veterans advocates step up to offer a proper and dignified funeral service.</p>,
 <p>March 24, 1960 – Feb. 15, 2019</p>,
 <p>Feb. 4, 1955 – Jan. 18, 2019</p>,
 <p>April 7, 1960 – April 1, 2019</p>,
 <p>Nov. 15, 1944 – April 30, 2019</p>]

In a same way, we'll confirm that the final result in this object corresponds to the final entry in the article.

In [66]:
records[-1]

<p>
Jan. 3, 1946 – Sept. 8, 2019
</p>

As the last record in the article matches the last result in this object, we must remove the additional records at index zero which is not part of the date.

In [67]:
date_record = records[1:]

In [68]:
date_record[0:5]

[<p>March 24, 1960 – Feb. 15, 2019</p>,
 <p>Feb. 4, 1955 – Jan. 18, 2019</p>,
 <p>April 7, 1960 – April 1, 2019</p>,
 <p>Nov. 15, 1944 – April 30, 2019</p>,
 <p>Feb. 17, 1940 – June 4, 2019</p>]

In [69]:
date_record[-1]

<p>
Jan. 3, 1946 – Sept. 8, 2019
</p>

It looks reasonable!
* Now we can build a for loop.

In [70]:
date_head = []

for elt in date_record:
    time = elt.text.strip()
    date_head.append(time)
date_head

['March 24, 1960 – Feb. 15, 2019',
 'Feb. 4, 1955 – Jan. 18, 2019',
 'April 7, 1960 – April 1, 2019',
 'Nov. 15, 1944 – April 30, 2019',
 'Feb. 17, 1940 – June 4, 2019',
 'Dec. 21, 1944 – May 26, 2019',
 'April 18, 1947 - June 14, 2019',
 'June 15, 1949 - July 16, 2019',
 'Oct. 26, 1958 – Oct. 7, 2018',
 'July 28, 1943-July 23, 2018',
 'August 22, 1946-August 3, 2018',
 'July 28, 1942 – Sept. 14, 2018',
 'April 15, 1947 – Oct. 10, 2018',
 'August 1, 1947 – Oct. 2, 2018',
 'Nov. 20, 1946 – Oct. 16, 2018',
 'March 15, 1948 – Dec. 10, 2018',
 'Jan. 25, 1950 – Dec. 27, 2018',
 'Jan. 24, 1951 – March 6, 2019',
 'June 14, 1939 – April 4, 2019',
 'August 24, 1951 – August 14, 2019',
 'Jan. 3, 1946 – Sept. 8, 2019']

Okay, so after building a for loop and adding all the strings to the lists, the next step is to construct a data frame from the list.

In [71]:
df = pd.DataFrame({
    "name": name_head, 
    "organization": main_title,
    "date": date_head})
df.head()

Unnamed: 0,name,organization,date
0,Will Melvin Shegog,U.S. Air Force,"March 24, 1960 – Feb. 15, 2019"
1,Ronnie Joe Lundy,U.S. Army,"Feb. 4, 1955 – Jan. 18, 2019"
2,Joseph Lorenzo Williams,U.S. Army,"April 7, 1960 – April 1, 2019"
3,Danny Rollin Ballantyne,U.S. Marine Corps,"Nov. 15, 1944 – April 30, 2019"
4,George Charles Babcock,U.S. Army,"Feb. 17, 1940 – June 4, 2019"


In [72]:
df[['DOB','DOD']] = df.date.str.split("[–-]",expand=True)
df

Unnamed: 0,name,organization,date,DOB,DOD
0,Will Melvin Shegog,U.S. Air Force,"March 24, 1960 – Feb. 15, 2019","March 24, 1960","Feb. 15, 2019"
1,Ronnie Joe Lundy,U.S. Army,"Feb. 4, 1955 – Jan. 18, 2019","Feb. 4, 1955","Jan. 18, 2019"
2,Joseph Lorenzo Williams,U.S. Army,"April 7, 1960 – April 1, 2019","April 7, 1960","April 1, 2019"
3,Danny Rollin Ballantyne,U.S. Marine Corps,"Nov. 15, 1944 – April 30, 2019","Nov. 15, 1944","April 30, 2019"
4,George Charles Babcock,U.S. Army,"Feb. 17, 1940 – June 4, 2019","Feb. 17, 1940","June 4, 2019"
5,James Miske,U.S. Navy,"Dec. 21, 1944 – May 26, 2019","Dec. 21, 1944","May 26, 2019"
6,Stephen Jerald Spicer,U.S. Army,"April 18, 1947 - June 14, 2019","April 18, 1947","June 14, 2019"
7,Howard Nicholas Warren,U.S. Army,"June 15, 1949 - July 16, 2019","June 15, 1949","July 16, 2019"
8,Charles Bradley Fox,U.S. Marine Corps,"Oct. 26, 1958 – Oct. 7, 2018","Oct. 26, 1958","Oct. 7, 2018"
9,James David Ellis,U.S. Army,"July 28, 1943-July 23, 2018","July 28, 1943","July 23, 2018"


In [73]:
df['organization'].replace(['Jr.'], ['U.S. Navy'], inplace=True)
df.tail(6)


Unnamed: 0,name,organization,date,DOB,DOD
15,Robert Lee Baker,U.S. Navy,"March 15, 1948 – Dec. 10, 2018","March 15, 1948","Dec. 10, 2018"
16,Stephen Sebastian Cunningham,U.S. Army,"Jan. 25, 1950 – Dec. 27, 2018","Jan. 25, 1950","Dec. 27, 2018"
17,George Shaw,U.S. Coast Guard,"Jan. 24, 1951 – March 6, 2019","Jan. 24, 1951","March 6, 2019"
18,Richard Lindsay Butterfield,U.S. Marine Corps,"June 14, 1939 – April 4, 2019","June 14, 1939","April 4, 2019"
19,Gary Lynn Andrews,U.S. Army,"August 24, 1951 – August 14, 2019","August 24, 1951","August 14, 2019"
20,Ronald Baker Thomas,U.S. Army,"Jan. 3, 1946 – Sept. 8, 2019","Jan. 3, 1946","Sept. 8, 2019"


In [74]:
df.drop('date', inplace=True, axis=1)

In [75]:
df.head()

Unnamed: 0,name,organization,DOB,DOD
0,Will Melvin Shegog,U.S. Air Force,"March 24, 1960","Feb. 15, 2019"
1,Ronnie Joe Lundy,U.S. Army,"Feb. 4, 1955","Jan. 18, 2019"
2,Joseph Lorenzo Williams,U.S. Army,"April 7, 1960","April 1, 2019"
3,Danny Rollin Ballantyne,U.S. Marine Corps,"Nov. 15, 1944","April 30, 2019"
4,George Charles Babcock,U.S. Army,"Feb. 17, 1940","June 4, 2019"


In [76]:
# df['DOB'] = pd.to_datetime(df['DOB'])

# Use DataFrame.apply() to convert multiple columns to datetime
df[['DOB','DOD']] = df[['DOB','DOD']].apply(pd.to_datetime)
df['DOB'] = [d.strftime('%m-%d-%Y') if not pd.isnull(d) else '' for d in df['DOB']]
df['DOD'] = [d.strftime('%m-%d-%Y') if not pd.isnull(d) else '' for d in df['DOD']]


In [77]:
df.head()

Unnamed: 0,name,organization,DOB,DOD
0,Will Melvin Shegog,U.S. Air Force,03-24-1960,02-15-2019
1,Ronnie Joe Lundy,U.S. Army,02-04-1955,01-18-2019
2,Joseph Lorenzo Williams,U.S. Army,04-07-1960,04-01-2019
3,Danny Rollin Ballantyne,U.S. Marine Corps,11-15-1944,04-30-2019
4,George Charles Babcock,U.S. Army,02-17-1940,06-04-2019


### Export To CSV

When the data frame is complete, the next thing we can do is export it in CSV format.

In [78]:
# Output gathered data to the "web-scraping" folder
df.to_csv(dir_path + '../web-scraping/dignity-memorial/veterans.csv', index=False, encoding='utf-8')