# Web Crawler

In the following notebook, we test some functions offered by `BeautifulSoup` in order to parse the html code of a website. The goal is to crawl a given `URL` and output a site map showing the static assets of each page, not following the links to other websites.

In [3]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

**We get the html `soup` of the webpage**

In [4]:
root_url = 'https://gocardless.com'
page = requests.get(root_url)
html = page.text
soup = BeautifulSoup(html,'lxml')

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width" name="viewport"/>
  <title>
   The easiest way to collect recurring payments - GoCardless
  </title>
  <meta content="GoCardless is the easy way to collect Direct Debit. Already serving more than 20,000 businesses, perfect for recurring billing and B2B invoicing." name="description"/>
  <link href="https://plus.google.com/+Gocardless" rel="publisher"/>
  <meta content="https://gocardless.com/images/logos/gocardless-square.png" name="og:image"/>
  <meta content="https://gocardless.com/images/logos/gocardless-square.png" name="og:image:secure_url"/>
  <meta content="Y80kah87ghJhwiDqw-5ap234p9wCcGt6kMRxvnamtHU" name="google-site-verification"/>
  <link href="https://gocardless.com/" rel="canonical"/>
  <link href="https://gocardless.com/" hreflang="x-default" rel="alternate"/>
  <link href="https://gocardless.com/en-ie/" hreflang="en-IE" rel="alternate"/>
  <link href="htt

**We are only interested in getting the links to other pages from the current page and its images, scripts and CSS stylesheets.**

Getting the links to other **pages**:

In [6]:
soup.find_all('a', href=True)

[<a class="header-logo u-relative u-block u-padding-Vl is-active" data-reactid=".1nge5jv6om6.0.0.0.0.0.0" href="/" id="track-nav-home"><svg aria-label="GoCardless" class="site-logo__image u-fill-invert" data-reactid=".1nge5jv6om6.0.0.0.0.0.0.0" height="16" role="link" viewbox="0 0 157 16" width="157"><title data-reactid=".1nge5jv6om6.0.0.0.0.0.0.0.0">GoCardless</title><g data-reactid=".1nge5jv6om6.0.0.0.0.0.0.0.1" fill="#000"><path d="M8.394 15.983C3.472 15.983.07 12.615.07 8.05V8C.07 3.62 3.574.017 8.36.017c2.85 0 4.56.758 6.217 2.122L12.4 4.715c-1.225-.994-2.313-1.567-4.144-1.567-2.54 0-4.543 2.19-4.543 4.8V8c0 2.83 1.987 4.9 4.802 4.9 1.26 0 2.4-.302 3.282-.925V9.768H8.29v-2.93h6.875v6.703c-1.624 1.332-3.87 2.443-6.77 2.443" data-reactid=".1nge5jv6om6.0.0.0.0.0.0.0.1.0" id="Shape"></path><path d="M25.873 15.983c-4.888 0-8.394-3.554-8.394-7.932V8c0-4.38 3.54-7.983 8.428-7.983 4.887 0 8.394 3.554 8.394 7.932V8c0 4.38-3.54 7.983-8.43 7.983zM30.657 8c0-2.644-1.986-4.85-4.784-4.85s-4.75 

Getting the **alternate versions** of the page:

In [7]:
soup.find_all('link', href=True, rel='alternate')

[<link href="https://gocardless.com/" hreflang="x-default" rel="alternate"/>,
 <link href="https://gocardless.com/en-ie/" hreflang="en-IE" rel="alternate"/>,
 <link href="https://gocardless.com/en-ca/" hreflang="en-CA" rel="alternate"/>,
 <link href="https://gocardless.com/en-us/" hreflang="en-US" rel="alternate"/>,
 <link href="https://gocardless.com/fr-be/" hreflang="fr-BE" rel="alternate"/>,
 <link href="https://gocardless.com/nl-nl/" hreflang="nl-NL" rel="alternate"/>,
 <link href="https://gocardless.com/de-de/" hreflang="de-DE" rel="alternate"/>,
 <link href="https://gocardless.com/fr-fr/" hreflang="fr-FR" rel="alternate"/>,
 <link href="https://gocardless.com/es-es/" hreflang="es-ES" rel="alternate"/>,
 <link href="https://gocardless.com/en-nz/" hreflang="en-NZ" rel="alternate"/>,
 <link href="https://gocardless.com/nl-be/" hreflang="nl-BE" rel="alternate"/>,
 <link href="https://gocardless.com/en-eu/" hreflang="en-EU" rel="alternate"/>,
 <link href="https://gocardless.com/en-se/

Getting the **style sheets**:

In [8]:
soup.find_all(rel="stylesheet")

[<link href="/bundle/main-83fa229a41c5c6dfb1ef.css" rel="stylesheet"/>]

Getting the **images**:

In [9]:
soup.find_all('img', src=True);
def image(t):
    return (t and re.compile("image").search(t))
soup.find_all('link', href=True, type=image)

[<link href="/images/favicons/favicon-196x196.png" rel="icon" sizes="196x196" type="image/png"/>,
 <link href="/images/favicons/favicon-96x96.png" rel="icon" sizes="96x96" type="image/png"/>,
 <link href="/images/favicons/favicon-32x32.png" rel="icon" sizes="32x32" type="image/png"/>,
 <link href="/images/favicons/favicon-16x16.png" rel="icon" sizes="16x16" type="image/png"/>,
 <link href="/images/favicons/favicon-128.png" rel="icon" sizes="128x128" type="image/png"/>]

Getting the **scripts**:

In [10]:
soup.find_all('script', src=True)

[<script async="" src="//www.googletagmanager.com/gtm.js?id=GTM-PRFKNC"></script>,
 <script src="/bundle/main-83fa229a41c5c6dfb1ef.js"></script>]

We define a class to describe a **page**, `children` are the URL links to the pages that can be reached from it.

In [11]:
class Page:
    def __init__(self, n):
        self.name = n
        self.children = []
        self.style_sheets = []
        self.scripts = []
        self.images = []
    def get_str(self):
        s = '########   '+self.name+'   ########\n\n'
        s += '||Number of links: ' + str(len(self.children)) +'\n'
        for c in self.children:
            s += '\t - '+ c + '\n'
        s += '\n||Number of CSS stylesheets: ' + str(len(self.style_sheets)) +'\n'
        for f in self.style_sheets:
            s += '\t - '+ f + '\n'
        s += '\n||Number of scripts: ' + str(len(self.scripts)) +'\n'
        for f in self.scripts:
            s += '\t - '+ f + '\n'
        s += '\n||Number of images: ' + str(len(self.images)) +'\n'
        for f in self.images:
            s += '\t - '+ f + '\n'   
        s += '\n\n\n'
        return s

We store the pages in a dictionary, with the page URL as a key.

In [12]:
pages = {}

We need a function to tell us if a URL is insite the site or not. If it begins with '/' it is the absolute path from the root url, otherwise it has to begin with the root url to be accepted as we don't crawl other websites.

In [34]:
def get_url(url, page_url):
    # Absolute path
    if url.startswith('/'):
        return root_url + url
    # Link to a full url
    if url.startswith('http'):
        if url.startswith(root_url):
            return url
        else:
            return ''
    return ''

We define a function keeping only **valid URLs** from the list returned by `soup`.

In [35]:
def apply_url(l, page):
    url = get_url(l['href'], page)
    # No special chars
    chars = set(';?@=&$,#')
    check = True
    if any((c in chars) for c in url):
        check = False
    if (url != '') & check:
        return url
    return ''

In [36]:
def get_urls_page(soup, p):
    l = []
    links = soup.find_all('a', href=True) + soup.find_all('link', href=True, rel='alternate')
    for link in links:
        url = apply_url(link,p)
        if url !=  '':
            l.append(url)
    return list(set(l))

Get the images/files/style_sheets/scrips/images:

In [37]:
def get_data(soup_list, p):
    l = []
    if soup_list:
        for s in soup_list:
            l.append(s[p])
        if l:
            return list(set(l))
        return l
    return l

In [38]:
def image(t):
    return (t and re.compile("image").search(t))

We define a function **crawling a page** and its children 

In [39]:
def crawl_page(url):
    page = requests.get(url)
    # If there is a reply to the request
    if page:
        html_page = page.text
        # Get the html
        soup = BeautifulSoup(html_page,'lxml')
        # Instantiate the new page found
        p = Page(url)
        p.children = get_urls_page(soup, url)
        p.style_sheets = get_data(soup.find_all('link', rel='stylesheet'), 'href')
        p.scripts = get_data(soup.find_all('script', src=True), 'src')
        p.images = get_data(soup.find_all('img', src=True ), 'src')
        p.images = p.images + get_data(soup.find_all('link', href=True, type=image), 'href')
        # Add it to the dictionary
        pages[url] = p
        # Crawl the links found in the page
        for c in p.children:
            if c not in pages:
                crawl_page(c)

In [40]:
crawl_page(root_url)

We will represent the sitemap in two ways: we print each entry with the data gathered whilst crawling the website and we will generate a XML file valid to submit to Google sitemap format.

In [178]:
def print_pages(dic):
    for entry in dic:
        toPrint = dic[entry].get_str()
        toPrint = toPrint.encode('utf-8')
        print(toPrint)

In [179]:
print_pages(pages)

########   https://gocardless.com/about/jobs/inside-account-executive-spain/   ########

||Number of links: 60
	 - https://gocardless.com/about/jobs/head-of-operations/
	 - https://gocardless.com/fr-fr/
	 - https://gocardless.com/about/jobs/european-marketing-manager/
	 - https://gocardless.com/about/jobs/enterprise-account-executive/
	 - https://gocardless.com/about/jobs/customer-support-france/
	 - https://gocardless.com/en-se/about/jobs/inside-account-executive-spain/
	 - https://gocardless.com/education/
	 - https://gocardless.com/en-ca/about/jobs/inside-account-executive-spain/
	 - https://gocardless.com/about/
	 - https://gocardless.com/accountants/
	 - https://gocardless.com/finance/
	 - https://gocardless.com/en-ie/about/jobs/inside-account-executive-spain/
	 - https://gocardless.com/users/sign_in
	 - https://gocardless.com/guides
	 - https://gocardless.com
	 - https://gocardless.com/contact-sales/
	 - https://gocardless.com/features/
	 - https://gocardless.com/en-eu/about/jobs

**XML format** from Google https://support.google.com/webmasters/answer/183668?hl=en&ref_topic=4581190

In [61]:
def get_google_xml(pages):
    s = '<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">'
    for page in pages:
        s = s+'<url><loc>'+page+'</loc>'
        for image in pages[page].images:
            s = s+'<image:image><image:loc>'+image+'</image:loc></image:image>'
        s = s+'</url>'
    s = s+'</urlset>'
    return s

In [137]:
get_google_xml(pages)

'<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"><url><loc>https://gocardless.com/about/jobs/inside-account-executive-spain/</loc><image:image><image:loc>/images/flags/CA-flag-icon@2x.png</image:loc></image:image><image:image><image:loc>/images/flags/BE-flag-icon@2x.png</image:loc></image:image><image:image><image:loc>/images/flags/NL-flag-icon@2x.png</image:loc></image:image><image:image><image:loc>/images/flags/NZ-flag-icon@2x.png</image:loc></image:image><image:image><image:loc>/images/flags/DE-flag-icon@2x.png</image:loc></image:image><image:image><image:loc>/images/flags/US-flag-icon@2x.png</image:loc></image:image><image:image><image:loc>/images/flags/IE-flag-icon@2x.png</image:loc></image:image><image:image><image:loc>/images/flags/GB-flag-icon@2x.png</image:loc></image:image><image:image><image:loc>/images