Code for a scraper to create a CSV to collect items under "Support organizations."

In [2]:
# load libraries etc.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv

In [4]:
# get contents of page at URL, put into soup 
html = urlopen("http://www.operationwearehere.com/MilitaryServiceDogs.html?fbclid=IwAR0xChISnK87deV3IqVgG8AR8ZJCCH7YBp_Yl8CJih54DW262j-wzuxXH10")
soup = BeautifulSoup(html, 'html5lib')

In [6]:
# get all divs on page
all_divs = soup.find_all('div')

In [21]:
# test run
count = 0
for div in all_divs:
    count += 1
    if div.get_text() == "Support organizations":
        print("Data we want starts at " + str(count))
        # this tells us the first heading - the first item we want to capture 
        print(div.next_sibling.next_sibling.get_text())
        break
print("Total divs on page: " + str( len(all_divs) ) )

Data we want starts at 67
Alpha Bravo Canine
Total divs on page: 1253


## What did we just do?

Everything in the HTML is enclosed in div elements. There are no normal heading or paragraph elements. What I tested above is a way to find the heading, "Support organizations," after which the desired items begin.

Since `div.next_sibling.next_sibling.get_text()` gets the first heading, that's where we need to begin scraping.

We've found a way to determine the starting point. Two divs after 67 will be 69 - the 69th div is our starting point.

In [30]:
# test 2
count = 0
for div in all_divs:
    count += 1
    if count == 69: 
        print(div.get_text())
        break

# same thing, using list index 
print(all_divs[68].get_text())

Alpha Bravo Canine
Alpha Bravo Canine


Now we have refined that code to go straight to the first heading of the first item, by counting the divs in the list of all divs.

Now we need to work out a function to capture the data from each div for one item. An item has a name, a location, a URL, and a description. The description may be spread across multiple divs, which presents a challenge.

In [40]:
def getItemTest01(div):
    # first div - name
    print(div.get_text())
    # second div - location
    print(div.next_sibling.get_text())
    # third div - URL
    print(div.next_sibling.next_sibling.find('a').attrs["href"])
    # fourth div - description 
    print(div.next_sibling.next_sibling.next_sibling.get_text())
    # fifth div - test for blank line, True/False 
    div = div.next_sibling.next_sibling.next_sibling.next_sibling
    print(div.get_text() == "")
    # sixth div - test for blank line, True/False
    print(div.next_sibling.get_text() == "")
    # seventh div
    print(div.next_sibling.next_sibling.get_text())

# call that function with the starting div being the 69th div on the page 
getItemTest01(all_divs[68])


Alpha Bravo Canine
Philadelphia, PA
http://alphabravocanine.org/
Alpha Bravo Canine’s mission is to provide trained service dogs to U.S military veterans suffering from Post-Traumatic Stress Disorder (PTSD), Traumatic Brain Injury (TBI), and other combat related disabilities.
True
True
Alpha K-9


## What's happening now?

We have determined that we can get the data for name, location, URL, and description where the description has only one paragraph (actually a div, because everything is divs).

We have also found that we can test successfully for blank lines. Two blank lines signal that the following div is the start of a new item - it will contain the name of a new organization.

We want to modify this function to get all description-divs for an item, regardless of how many there are.

In [59]:
def getItemTest02(div):
    # first div - name
    print(div.get_text())
    # second div - location
    print(div.next_sibling.get_text())
    # third div - URL
    print(div.next_sibling.next_sibling.find('a').attrs["href"])
    # fourth div - description 
    descrip = div.next_sibling.next_sibling.next_sibling.get_text()
    # print(descrip)
    # fifth div - test for blank line, True/False
    # div = div.next_sibling.next_sibling.next_sibling.next_sibling
    # print(div.get_text() == "")
    # THAT WILL ALWAYS BE A BLANK LINE, so skip it 
    # sixth div - test for second blank line, True/False
    # div = div.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling
    # print(div.get_text() == "")
    
    # fifth div, after descrip 
    div = div.next_sibling.next_sibling.next_sibling.next_sibling
    # we need to look at the next div that has text in it
    if div.get_text() == "":
        div = div.next_sibling
    
    while div.get_text() != "":
        descrip += (" " + div.get_text())
        if div.next_sibling == "":
            div = div.next_sibling.next_sibling
        else:
            div = div.next_sibling
    print(descrip)
        
    '''
    # if that text is empty (blank line found), then NEXT div is start of a new item
    if div.get_text() == "":
        # next div starts new item - this will be returned
        div = div.next_sibling
    # if that test is NOT empty, we want to add it to the descrip and do the same test again, after
    else:
        descrip += (" " + div.get_text())
        print(descrip)
    '''
    return div

# call that function with the starting div being the 69th div on the page 
next_item = getItemTest02(all_divs[68])

next_item = getItemTest02(all_divs[970])

next_item = getItemTest02(all_divs[854])


Alpha Bravo Canine
Philadelphia, PA
http://alphabravocanine.org/
Alpha Bravo Canine’s mission is to provide trained service dogs to U.S military veterans suffering from Post-Traumatic Stress Disorder (PTSD), Traumatic Brain Injury (TBI), and other combat related disabilities.
*Warrior Service Dogs
Asheville, NC
http://warriorservicedogs.org/
Warrior Service Dogs is a non-profit organization started by three OEF/OIF combat veterans based in Western North Carolina. Our training is led by certified dog trainer and behavior modification specialist Shane Cox, as well as certified dog trainer Chris Stewart. We are currently holding training sessions to any veteran. If you have a dog bring him/her out and we can see what we can do. If you do not own a dog we would love to scout the local shelters to help you find the right dog.
The Battle Buddy Foundation
West Chester, OH
http://www.tbbf.org/
Our mission is to ensure that veterans and their families receive programs and services that will hel

In [117]:
# find start of particular item using name of org 
count = 0
for div in all_divs:
    count += 1
    if div.get_text() == "K9s for Veterans, NFP":
        print("Data we want starts at " + str(count))
        break


Data we want starts at 344


In [112]:
# refining the function further

def getItemTest03(div):
    # first div - name
    print(div.get_text())
    # second div - location - need to check if this is a location div; some items have none
    if div.next_sibling.has_attr('align'):
        print(div.next_sibling.get_text())
        # set value to third div
        div = div.next_sibling.next_sibling
    else:
        print('None')
        # set value to THIS div
        div = div.next_sibling
    # third div - URL - need to check if this has 'a' element
    if div.find('a') != None:
        print(div.find('a').attrs["href"])
        div = div.next_sibling
    # fourth div - description - need to check if present
    if div.get_text() != "":
        descrip = div.get_text()
        div = div.next_sibling
        if div.get_text() == "":
            div = div.next_sibling
        while div.get_text() != "":
            descrip += (" " + div.get_text())
            div = div.next_sibling
            if div.get_text() == "":
                div = div.next_sibling
        print(descrip)
        return div.next_sibling
    else:
        descrip = "None"
        print(descrip)
        return div.next_sibling.next_sibling


# call that function to test with specific divs on the page 
# x = getItemTest03(all_divs[68])
x = getItemTest03(all_divs[343])
y = getItemTest03(x)
z = getItemTest03(y)

# x = getItemTest03(all_divs[970])
# x = getItemTest03(all_divs[854])

# test some bad ones 
# x = getItemTest03(all_divs[556])
# x = getItemTest03(all_divs[343])


K9s for Veterans, NFP
None
http://k9forveteranwarriors.org/index.html
K9s for Veterans, NFP is dedicated to helping veterans suffering from Post Traumatic Stress Disorder (PTSD) by providing them with service dogs. We’re honored to help the men and women who so bravely served our country with dogs that dramatically improve their quality of life. While there are many service dog organizations, K9s for Veterans is proud to say that we offer some unique benefits to veterans.   EXPEDITED PLACEMENT – The average length of time to find a match and place a properly trained service dog is 6-9 months for K9s for Veterans. Unfortunately, it can take years with some other organizations. FREE FOOD AND BASIC MEDICAL CARE – As you know, many of our veterans struggle financially when returning to civilian life, so adding the expense of a dog can be challenging.  K9s for Veterans, stands by our veterans for the life of their service dog with free food and basic medical care. All of this is made possib

In [114]:
# this loops over every item in the section "Support organizations" 

x = getItemTest03(all_divs[68])
while x.get_text() != "Service dog support organizations, housing advocacy":
    x = getItemTest03(x)

# the next step will be the CSV 


Alpha Bravo Canine
Philadelphia, PA
http://alphabravocanine.org/
Alpha Bravo Canine’s mission is to provide trained service dogs to U.S military veterans suffering from Post-Traumatic Stress Disorder (PTSD), Traumatic Brain Injury (TBI), and other combat related disabilities.
Alpha K-9
Sacramento, CA
http://www.alphak9.org/
Alpha K9 donates time, equipment and training in order to provide PTSD service dogs to veterans, first responders and children in a hope that a PTSD service dog will provide the recipient with an opportunity to live a more normal life and alleviate the stressors that are preventing them from entering society. Alpha K9 has already donated multiple service animals in 2012 to both veterans and kids in the local community and plans to continue to provide PTSD service dogs.
American Humane Association
None
http://www.americanhumane.org/program/military/
Given the increasing number of veterans returning from long deployments—often serving multiple consecutive tours as nev

## Where are things now?

The code now parses everything we need, apparently perfectly, but it only prints to the screen. 

Above, we see all the data, one field per line, cleanly printed to screen. 

The final step is to write the items into rows in a CSV file.


In [119]:
# standard Python CSV-writing stuff 
csvfile = open("vets_animals.csv", 'w', newline='', encoding='utf-8')
c = csv.writer(csvfile)
# write the header row for CSV file
c.writerow(['name', 'location', 'url', 'description'])


# same old function, rewriiten to write to CSV 
def getCompleteItem(div):
    # first div - name
    name = div.get_text()
    # second div - location - need to check if this is a location div; some items have none
    if div.next_sibling.has_attr('align'):
        location = div.next_sibling.get_text()
        # set value to third div
        div = div.next_sibling.next_sibling
    else:
        location = 'None'
        # set value to THIS div
        div = div.next_sibling
    # third div - URL - need to check if this has 'a' element
    if div.find('a') != None:
        url = div.find('a').attrs["href"]
        div = div.next_sibling
    else: 
        url = 'None'
    # fourth div - description - need to check if present
    if div.get_text() != "":
        descrip = div.get_text()
        div = div.next_sibling
        if div.get_text() == "":
            div = div.next_sibling
        while div.get_text() != "":
            descrip += (" " + div.get_text())
            div = div.next_sibling
            if div.get_text() == "":
                div = div.next_sibling
        div = div.next_sibling
    else:
        descrip = "None"
        div = div.next_sibling.next_sibling
    
    # write the things
    row = [name, location, url, descrip]
    c.writerow( row )
    
    return div


# the while-loop uses the function
x = getCompleteItem(all_divs[68])
while x.get_text() != "Service dog support organizations, housing advocacy":
    x = getCompleteItem(x)


# close & save CSV file
csvfile.close()

## And we're done!

The previous single cell is the complete code that is necessary to both scrape and make the CSV.

Everything before that cell is just scratch paper.
