In [1]:
import requests, bs4, re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
import pandas as pd
import lxml.html

# Specify url
url = 'https://www.census.gov/programs-surveys/popest.html'


# create a BeautifulSoup object from the HTML: soup
def make_soup(url):
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text, 'lxml')
    return soup

#create a BeautifulSoup object from the HTML: soup
soup = make_soup(url)

#find all html tags
html_tags = soup.find_all('html')


#Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a') 


#make a list to append all links into    
link_lister = []  

Using the BeautifulSoup module, we can parse the HTML code and use its various parts.  
First we create an object we can use BeautifulSoup against the websites HTML.  
Then BeautifulSoup downloads the HTML and parses the tags and attributes.

In [12]:
for a in html_tags:
    link = a.get('xmlns')
        
    if link !=None and link.startswith('http'):
        link_lister.append(link)

for a in a_tags:
    link = a.get('href')
        
    if link !=None and link.startswith('http'):
        link_lister.append(link)

#creating a dataframe to organize the data in a more clear way        
df = pd.DataFrame(link_lister)

df.head(10)

Unnamed: 0,0
0,http://www.w3.org/1999/xhtml
1,https://twitter.com/uscensusbureau
2,https://www.census.gov/
3,https://www.census.gov/2020census
4,https://www.census.gov/2020census
5,https://www.census.gov/AmericaCounts
6,https://www.census.gov/AmericaCounts
7,https://www.census.gov/AmericaCounts
8,https://www.census.gov/EconomicCensus
9,https://www.census.gov/EconomicCensus


I skimmed the website’s Page Source and looked for links that lead to other sites.  
Links always have attributes of ‘href’ so I created a loop that appends all of the href links under the ‘a’ attribute to a list.  
The soup.find_all(‘a’) was used to locate all the attributes that would contain ‘href’tags nested like this: 

    <a href=“https:…>

Next I created a for loop, which the a.get(‘href’) would pull the links.  
With that loop I used a list, link_lister, to append my websites into.

In [11]:
#convert /program-survey html's first and append to list

#/programs-surveys
def get_links(url):
    soup = make_soup(url)
    a_tags = soup.find_all('a', href=re.compile(r"^/programs-surveys/"))
    links = [urljoin(url, a['href'])for a in a_tags]  # convert relative url to absolute url
    return links

new_list = get_links(url)

for link in new_list:
    link_lister.append(link)        
    

#/newsroom
def get_links(url):
    soup = make_soup(url)
    a_tags = soup.find_all('a', href=re.compile(r"^/newsroom/"))
    links = [urljoin(url, a['href'])for a in a_tags]  # convert relative url to absolute url
    return links
    

new_list = get_links(url)

for link in new_list:
    link_lister.append(link)
    

#/data/
def get_links(url):
    soup = make_soup(url)
    a_tags = soup.find_all('a', href=re.compile(r"^/data/"))
    links = [urljoin(url, a['href'])for a in a_tags]  # convert relative url to absolute url
    return links
    
new_list = get_links(url)

for link in new_list:
    link_lister.append(link)

#/library/
def get_links(url):
    soup = make_soup(url)
    a_tags = soup.find_all('a', href=re.compile(r"^/library/"))
    links = [urljoin(url, a['href'])for a in a_tags]  # convert relative url to absolute url
    return links    

new_list = get_links(url)

for link in new_list:
    link_lister.append(link)
    
df = pd.DataFrame(link_lister)

df.head(10)

Unnamed: 0,0
0,http://www.w3.org/1999/xhtml
1,https://twitter.com/uscensusbureau
2,https://www.census.gov/
3,https://www.census.gov/2020census
4,https://www.census.gov/2020census
5,https://www.census.gov/AmericaCounts
6,https://www.census.gov/AmericaCounts
7,https://www.census.gov/AmericaCounts
8,https://www.census.gov/EconomicCensus
9,https://www.census.gov/EconomicCensus


The absolute URL has all the information that points to a resource.  
The relative URL will use the absolute URL as a starting to extract the information we want. The relative URI looks at the absolute URL like a pathway.  So, If the absolute URL that is pulled with soup.find_all(‘a’) are like:
	
    <a class="data-uscb-header-dropdown-link-item uscb-header-dropdown-link-item uscb-padding-TB-10" href=“https://www.census.gov/about/contact-us/staff-finder.html">

Then the relative URI will look at the pathway like a -> class -> href
and that is how we will locate our ‘href’ tags nested inside the ‘a’ attribute.

How the scraper works:  
    This first portion above only scraps links imbedded in a tags with href attributes that begin with http.  With a for loop it looks at each link identified with an href attribute and appends it to our link_lister list.  Some links have ‘none’ value so those items are ignored with the for loop.
    
I looked through the Page-Source and identified all of the links that did not have http attached.  The below code shows that I created functions that use the soup object with the url to identify all the a tags by key words: programs-surveys, newsroom, data, and library as they are identified in the leading part of the URI.  Once the link is identified, the leading portion of the URL address is joined with urljoin.  This converts the relative URL to an Absolute URL.

After a new list of ‘links’ is created in each function, I created an object new_list and used a for loop to append each item in the new_list to link_lister.  I did this with the four key words that did not have the http portion of the URL shown in the Page Source.

In [14]:
link_lister.sort()

df = pd.DataFrame(link_lister)

df.head(10)

Unnamed: 0,0
0,http://www.w3.org/1999/xhtml
1,http://www.w3.org/1999/xhtml
2,http://www.w3.org/1999/xhtml
3,http://www.w3.org/1999/xhtml
4,http://www.w3.org/1999/xhtml
5,https://twitter.com/uscensusbureau
6,https://twitter.com/uscensusbureau
7,https://twitter.com/uscensusbureau
8,https://twitter.com/uscensusbureau
9,https://twitter.com/uscensusbureau


In the code above that extracts the unique items from the original link_lister, I’ve created two objects.  
One is an empty data set that the duplicates will be added to if duplicates are come across in the for loop used above. and the other is an empty list that unique items will be appended to.  

The for loop looks at each item in the link_lister and asks if the item is NOT already in the duplicate data set then it is deemed an original item and thus put inside the unique_items list.  Thus, our duplicates get removed and all unique items will be placed in unique_items. 

In [13]:
#removing duplicates

#create a set that the unique values will go into 
duplicate = set()
#create a list that the duplicate items will go into
unique_items = []

for x in link_lister:
    #if the item is NOT already in the unique_items list then it will be added, else it will be ignored
    if x not in duplicate:
        unique_items.append(x)
        duplicate.add(x)
        
df = pd.DataFrame(unique_items)

df.head(10)

Unnamed: 0,0
0,http://www.w3.org/1999/xhtml
1,https://twitter.com/uscensusbureau
2,https://www.census.gov/
3,https://www.census.gov/2020census
4,https://www.census.gov/AmericaCounts
5,https://www.census.gov/EconomicCensus
6,https://www.census.gov/about-us
7,https://www.census.gov/about/business-opportun...
8,https://www.census.gov/about/contact-us.html
9,https://www.census.gov/about/contact-us/staff-...


Now that I have a unique_items list, I can write the list to a csv file assured that all the links are absolute URLs. 

In [6]:
list_len = len(unique_items)

#open csv file to write to
csv = open('desktop/BEM1 TASK 1/Part 1/link_list.csv','w')

for j in range(0,list_len):
    csv.write(unique_items[j]+'\n')#write to each line in csv file
    
#close the file
csv.close()
print(csv.closed)

True


The above code shows the unique HTML list ran including the webpage subject of the extraction.  Underneath it is the portion of the code that shows opening the csv file we will write the list to. Using a for loop to loop through each item in the unique_list, each item gets written to our csv file with the .write function. A sorted list is created with all duplicates removed with the unique_list.