# Scraping URLs and using them to open pages

For this demo, we use a page at [The Enough Project](https://enoughproject.org/). The contents of the page are not very long, but the content changes over time, so we might want to monitor it to collect new URLs and data.

This is the page: [Take Action](https://enoughproject.org/get-involved/take-action)

In [121]:
# load the Python libraries
from bs4 import BeautifulSoup
import requests

In [122]:
# open the main page and copy all its HTML into a variable named `soup`
url = 'https://enoughproject.org/get-involved/take-action'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

I want to collect the URLS for all the items in the central part of this page. There are seven items.

After I right-click the heading of the first item I want and select **Inspect**, I can see in the Elements pane (of Chrome Developer Tools) that this heading is inside an H6 element.

<img src="screenshots/enough/sudan_h6.png" alt="Screenshot of Elements inspector pane" width=934 style="margin-left:0;">

I know that most web page headings are H1 or H2, and H6 is uncommon. So I try and see whether perhaps these are the *only* H6 elements on this page.

In [123]:
# data we want is a set of items that start with h6 headings
heads = soup.find_all( 'h6' )

# how many of those on the page?
len(heads)

7

I'm in luck! There are seven items on the page that I want, and it appears I can grab them with just this &mdash; using their H6 headings.

In [124]:
# let's check and make sure that's the stuff we want
for head in heads:
    print(head.text)

South Sudan: Support Use of Robust Financial Tools on Actors Highlighted in Sentry Report
Tell UK to Address Connections to Human Rights Violations and Corruption in South Sudan
South Sudan: Promote the Use of Robust Financial Tools and Support Strong Institutions
Support Bipartisan Congo Legislation to Help Dismantle Kleptocratic System
Urge Companies to Be Leaders In Creating a Transparent Cobalt Trade in Congo
Conflict Gold Trade: Urge the US, EU, and United Nations Security Council to Sanction Gold Smuggling Companies and Networks 
Tell 20 of the Largest Companies in the World that You Demand the Supply of Products Made with Conflict-Free Minerals from Congo


Success! Those are the headings on the items I want.

## Get the URLs too

Now, I would also like to get the URL for each of those items.

<img src="screenshots/enough/sudan_a_href.png" alt="Screenshot of Elements inspector pane with A HREF" width=705 style="margin-left:0;">

By further inspecting the HTML for the page, I see that the A element is inside the H6 element. The A element is what creates a **link** in HTML. The target URL is defined in the HREF attribute, inside the A element.

In [125]:
# let's get the URL for each of those heads too
# HREF is an attribute of the A element - it holds the URL 

for head in heads:
    print(head.text)
    url = head.find('a')
    print(url.attrs['href'])
    print()

South Sudan: Support Use of Robust Financial Tools on Actors Highlighted in Sentry Report
https://enoughproject.org/get-involved/take-action/south-sudan-support-use-robust-financial-tools-actors-highlighted-sentry-report

Tell UK to Address Connections to Human Rights Violations and Corruption in South Sudan
https://enoughproject.org/get-involved/take-action/tell-uk-address-connections-human-rights-violations-corruption-south-sudan

South Sudan: Promote the Use of Robust Financial Tools and Support Strong Institutions
https://enoughproject.org/get-involved/take-action/south-sudan-financial-tools

Support Bipartisan Congo Legislation to Help Dismantle Kleptocratic System
https://enoughproject.org/get-involved/take-action/bipartisan-congo-legislation

Urge Companies to Be Leaders In Creating a Transparent Cobalt Trade in Congo
https://enoughproject.org/get-involved/take-action/companies-leaders-cobalt

Conflict Gold Trade: Urge the US, EU, and United Nations Security Council to Sanction 

## Can we use the URLs to navigate to other pages?

Yes. But first, let's look at what we want to get from those other pages.

## Scrape a URL based on its link text

Now we come to a trickier and cooler goal to accomplish: Sometimes on the linked page we are working with here, there is a further link to a longer report about the situation. I would like to get the URL for any such report linked on the page. **Only some of the linked pages have this report link.** Others do not.

On one of the pages that DOES have a report link ([Conflict Gold Trade: Urge the US, EU, and United Nations Security Council to ...](https://enoughproject.org/get-involved/take-action/conflict-gold-trade)), I inspect that report link and see that there's not a good element name or class or ID to work with in the HTML, to help me grab that report URL. 

<img src="screenshots/enough/sudan_report_link.png" alt="Screenshot of Elements inspector pane showing A element and text" width=708 style="margin-left:0;">

However, I can use *the unique text in the link* to help me.

In [126]:
# this is a different page, so I need to make `soup` again - call it `soup2` this time
# open the main page and copy all its HTML into a variable named `soup`

url = 'https://enoughproject.org/get-involved/take-action/conflict-gold-trade'
html = requests.get(url)
soup2 = BeautifulSoup(html.text, 'html.parser')

In [127]:
# collect all A elements on the page
a_list = soup2.find_all( 'a' )
len(a_list)

112

In [128]:
# now find out if more than one of those has the word "report" in the linked text
for a in a_list:
    if "report" in a.text:
        print(a.text)
        print(a.attrs['href'])

For more information, read The Sentry’s recent report >
https://eno.ug/2P7K14G


## Expand a shortened URL

So I can get that URL &mdash; in this case, it is a shortened URL. *Ugh.* I want the full, *regular* URL. But I can get that!

In [129]:
r = requests.get('https://eno.ug/2P7K14G')
r.url

'https://thesentry.org/reports/the-golden-laundromat/'

## Continue solving the main problem

So now I have to think how to get from my first page &mdash; https://enoughproject.org/get-involved/take-action &mdash; to each of the linked pages.

Then, on each linked page, I will need to check for a URL that leads to a report. There MAY, or may NOT, be one of these.

And finally, if I find such a link, I need to expand the URL if it is a shortened URL.

First, I will make a Python list of dictionaries out of the headings and URLs so I can loop over them.

In [130]:
heads = soup.find_all( 'h6' )

# create a new, empty Python list to hold dictionaries
enough_dictlist = []

# this is code adapted from code used above to just print 
for head in heads:
    new_dict = {}
    new_dict['title'] = head.text
    url = head.find('a')
    new_dict['url'] = url.attrs['href']
    # add it to the list
    enough_dictlist.append(new_dict)

# print the complete list of dictionaries  
print(enough_dictlist)

[{'title': 'South Sudan: Support Use of Robust Financial Tools on Actors Highlighted in Sentry Report', 'url': 'https://enoughproject.org/get-involved/take-action/south-sudan-support-use-robust-financial-tools-actors-highlighted-sentry-report'}, {'title': 'Tell UK to Address Connections to Human Rights Violations and Corruption in South Sudan', 'url': 'https://enoughproject.org/get-involved/take-action/tell-uk-address-connections-human-rights-violations-corruption-south-sudan'}, {'title': 'South Sudan: Promote the Use of Robust Financial Tools and Support Strong Institutions', 'url': 'https://enoughproject.org/get-involved/take-action/south-sudan-financial-tools'}, {'title': 'Support Bipartisan Congo Legislation to Help Dismantle Kleptocratic System', 'url': 'https://enoughproject.org/get-involved/take-action/bipartisan-congo-legislation'}, {'title': 'Urge Companies to Be Leaders In Creating a Transparent Cobalt Trade in Congo', 'url': 'https://enoughproject.org/get-involved/take-actio

In [131]:
# alternatively, we can  print values (or keys AND values) from each dictionary in a prettier way

#loop over the list
for dict in enough_dictlist:
    for v in dict.values():
        print(v)
    # next line puts a blank line between dicts
    print()


South Sudan: Support Use of Robust Financial Tools on Actors Highlighted in Sentry Report
https://enoughproject.org/get-involved/take-action/south-sudan-support-use-robust-financial-tools-actors-highlighted-sentry-report

Tell UK to Address Connections to Human Rights Violations and Corruption in South Sudan
https://enoughproject.org/get-involved/take-action/tell-uk-address-connections-human-rights-violations-corruption-south-sudan

South Sudan: Promote the Use of Robust Financial Tools and Support Strong Institutions
https://enoughproject.org/get-involved/take-action/south-sudan-financial-tools

Support Bipartisan Congo Legislation to Help Dismantle Kleptocratic System
https://enoughproject.org/get-involved/take-action/bipartisan-congo-legislation

Urge Companies to Be Leaders In Creating a Transparent Cobalt Trade in Congo
https://enoughproject.org/get-involved/take-action/companies-leaders-cobalt

Conflict Gold Trade: Urge the US, EU, and United Nations Security Council to Sanction 

## Use the list of titles and URLs to go to the linked page

... and get the additional link (to a full report), *if there is one.* To do so, we will loop over the list of dictionaries like we just did &mdash; but we will use the URL to open a page, where we will look for the full report link.

If there is such a link, we will try to expand it. Then we will add the report URL to the current dictionary.


In [132]:
# function based on previous code, above 
def get_report_link(newsoup):
    # new empty list
    report_urls = []
    # collect all A elements on the page
    a_list = newsoup.find_all( 'a' )
    # now find all of those that have the word "report" in the linked text
    for a in a_list:
        if "report" in a.text:
            # get epapnded url 
            r = requests.get(a.attrs['href'])
            report_urls.append(r.url)
    return report_urls


In [136]:
# loop over the list of dictionaries
for dict in enough_dictlist:
    # open one page and copy all its HTML into `newsoup`
    url = dict['url']
    html = requests.get(url)
    newsoup = BeautifulSoup(html.text, 'html.parser')
    # run the function
    urls = get_report_link(newsoup)
    if len(urls) >= 1:
        dict['report_url'] = urls[0]
    if len(urls) > 1:
        print('More than one report for "' + dict['title'] + '"')
        print(urls)


More than one report for "South Sudan: Support Use of Robust Financial Tools on Actors Highlighted in Sentry Report"
['https://thesentry.org/reports/taking-south-sudan/', 'https://thesentry.org/reports/taking-south-sudan/']


I am using a bit of "print-statement debugging" here, which is not the greatest, but I'm trying to keep this code on a beginner level.

In [137]:
# print the values from the dictionaries

for dict in enough_dictlist:
    for k, v in dict.items():
        print(k + ": " + v),
    # next line puts a blank line between dicts
    print()


title: South Sudan: Support Use of Robust Financial Tools on Actors Highlighted in Sentry Report
url: https://enoughproject.org/get-involved/take-action/south-sudan-support-use-robust-financial-tools-actors-highlighted-sentry-report
report_url: https://thesentry.org/reports/taking-south-sudan/

title: Tell UK to Address Connections to Human Rights Violations and Corruption in South Sudan
url: https://enoughproject.org/get-involved/take-action/tell-uk-address-connections-human-rights-violations-corruption-south-sudan
report_url: https://thesentry.org/reports/taking-south-sudan/

title: South Sudan: Promote the Use of Robust Financial Tools and Support Strong Institutions
url: https://enoughproject.org/get-involved/take-action/south-sudan-financial-tools

title: Support Bipartisan Congo Legislation to Help Dismantle Kleptocratic System
url: https://enoughproject.org/get-involved/take-action/bipartisan-congo-legislation

title: Urge Companies to Be Leaders In Creating a Transparent Cobalt

You see that only three of the seven current items has a REPORT LINK on its page. Two of them link to the SAME report. The fact that this data set is so small might make it seem too much work for the result &mdash; but keep in mind the same techniques can be used for other pages, other websites, with far more data.