<a href="https://colab.research.google.com/github/lblogan14/web_scraping_with_python/blob/master/ch3_web_crawlers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Be extremely conscientious of how much bandwidth used and make every effort to determine whether there is a way to make the target server's load easier.

#Traversing a Single Domain
To begin with, write a Python script that retrieves an arbitrary Wikipedia page and produces a list of links on that page:

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [0]:
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')

In [3]:
for link in bs.find_all('a'):
  if 'href' in link.attrs:
    print(link.attrs['href'])

/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/Kevin_Bacon_(disambiguation)
/wiki/File:Kevin_Bacon_SDCC_2014.jpg
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
#cite_note-1
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
http://baconbros.com/
#cite_note-2
#cite_note-actor-3
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
#cite_note-4
/wiki/Hollywood_Walk_of_Fame
#cite_note-5
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
#cite_note-walk-6
#Early_life_and_education
#Acting_career
#Early_work
#1980s
#1990s
#2000s
#2010s
#Advertising_work
#Personal_life
#Six_Degrees_of_Kevi

Not all of them are useful because Wikipedia is full of sidebar, footer, and header links that appear on every page, along with links to the category pages, talk pages, and other pages that do not contain different articles.\\
To retrieve the links that point to article pages, notice that there are three things in common:
1. They reside within the `div` with the `id` set to `bodyContent`.
2. The URLs do not contain colons.
3. TheURLs begin with */wiki/*.

These can be used as rules to rewrite the code and retrieve only the desired article links with the *regular expression* `^(/wiki/)((?!:).)*$`:

In [0]:
import re

In [5]:
for link in bs.find('div', {'id': 'bodyContent'}).find_all('a',
                                                           href=re.compile('^(/wiki/)((?!:).)*$')
                                                          ):
  if 'href' in link.attrs:
    print(link.attrs['href'])

/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Pennsylvania
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Holly_Near
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Sleepers
/wiki/The_Woodsman_(2004_film)
/wiki/Fox_Broadcasting_Company
/wiki/The_Following
/wiki/HBO
/wiki/Taking_Chance
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/The_Guardian
/wiki/Academy_Award
/wiki/Hollywood_Walk_of_Fame
/wiki/Social_networks
/wiki/Six_Degrees_of_Kevin_Bacon
/wiki/SixDegrees.org
/wiki/Philadelphia
/wiki/Edmund_Bacon_(architect)
/wiki/Julia_R._Masterman_High_School
/wiki/Pennsylvania_Governor%27s_School_for_the_Arts
/wiki/Bucknell_University
/wiki/Glory_Van_Scott
/wiki/Circle_in_the_Square
/wiki/Nancy_Mills
/wiki/Cosmopolitan_(magazine)
/wiki/Fraternities_and_sororities
/wiki/Animal_House
/wiki/Search_for_Tomorrow

Apply this code to write more useful functions:
* `getLinks` takes in a Wikipedia article URL of the form `/wiki/<Article_name>` and returns a list of all linked article URLs in the same form
*  A main function that calls `getLinks` with a starting article, chooses a random article link from the returned list, and calls `getLinks` again, until users stop the program or until no article links are found on the new page.

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

In [0]:
random.seed(datetime.datetime.now())

def getLinks(articleURL):
  html = urlopen('http://en.wikipedia.org{}'.format(articleURL))
  bs = BeautifulSoup(html, 'html.parser')
  return bs.find('div', {'id':'bodyContent'}).find_all('a',
                                                       href=re.compile('^(/wiki/)((?!:).)*$')
                                                      )


In [13]:
links = getLinks('/wiki/Kevin_Bacon')

while len(links) > 0:
  newArticle = links[random.randint(0, len(links)-1)].attrs['href']
  print(newArticle)
  links = getLinks(newArticle)
  
  #stop this
  if random.randint(0,20) < 5:
    break

/wiki/Al_Pacino
/wiki/Ren%C3%A9e_Fleming
/wiki/Janet_Baker
/wiki/Bernard_Haitink
/wiki/Murray_Perahia
/wiki/Barbican_Centre
/wiki/Greenwich_Playhouse
/wiki/Young_Vic
/wiki/Haworth_Tompkins
/wiki/Future_Systems
/wiki/Richard_Rogers


#Crawling an Entire Site
Crawling an entire site, especially a large one, is a memory-intensive process that is
best suited to applications for which a database to store crawling results is readily
available. \\
The general approach to an exhaustive site crawl is to start with a top-level page (such
as the home page), and search for a list of all internal links on that page. Every one of
those links is then crawled, and additional lists of links are found on each one of
them, triggering another round of crawling.

To avoid crawling the same page twice, it is extremely important that all internal links
discovered are formatted consistently, and kept in a running set for easy lookups,
while the program is running. \\
A *set* is similar to a list, but elements do not have a
specific order, and only unique elements will be stored. Only links that are "new" should be crawled and searched for additional links

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

In [0]:
pages = set()

def getLinks(pageURL):
  global pages
  html = urlopen('http://en.wikipedia.org{}'.format(pageURL))
  bs = BeautifulSoup(html, 'html.parser')
  
  for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
    if 'href' in link.attrs:
      if link.attrs['href'] not in pages:
        #new page
        newPage = link.attrs['href']
        print(newPage)
        pages.add(newPage)
        getLinks(newPage)

Notice that article pages do not contain colons, but file-upload pages, talk pages, and the like do contain colons in the URL.

In [21]:
getLinks('')

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Wikipedia:Requests_for_page_protection
/wiki/Wikipedia:Requests_for_permissions
/wiki/Wikipedia:Protection_policy#extended
/wiki/Wikipedia:Lists_of_protected_pages
/wiki/Wikipedia:Protection_policy
/wiki/Wikipedia:Perennial_proposals
/wiki/Wikipedia:Reliable_sources/Perennial_sources
/wiki/Wikipedia:Reliable_sources
/wiki/Wikipedia:RS_(disambiguation)
/wiki/Wikipedia:WikiProject_Radio_Stations
/wiki/File:People_icon.svg
/wiki/Special:WhatLinksHere/File:People_icon.svg
/wiki/Help:What_links_here
/wiki/Wikipedia:Project_namespace#How-to_and_information_pages
/wiki/Wikipedia:Protection_policy#move
/wiki/Wikipedia:WPPP
/wiki/Wikipedia:WikiProject
/wiki/Wikipedia:Wikimedia_sister_projects
/wiki/Help:Interwikimedia_links
/wiki/Help:Interlanguage_links
/wiki/List_of_ISO_639-1_codes
/wiki/File:Question_book-new.svg
/wiki/Wikipedia:Protection_policy#full
/wiki/Wikipedia:Party_and_person
/wiki/File:Essay.svg
/wiki/File:Essay.png
/wiki/

KeyboardInterrupt: ignored

Note that Python has a default recursion limit of 1000. The program will eventually hit the recursion limit and stop, unless a recursion counter or something is set to prevent that from happening.

##Collecting Data Across an Entire Site
Before writing a web crawler, the first thing to do is to look at a few pages from the site and determine a pattern. \\
For example,for Wikipedia pages:
* All titles (on all pages, regardless of their status as an article page, an edit history
page, or any other page) have titles under `h1` $\rightarrow$ `span` tags, and these are the only `h1` tags on the page.
* All body text lives under the `div#bodyContent` tag. To access just the first paragraph of text,use `div#mw-content-text` $\rightarrow$ `p` to select the first paragraph tag only. This is true for all content pages except file pages
* Edit links occur only on article pages. If they occur, they will be found in the `li#ca-edit` tag, under `li#ca-edit` $\rightarrow$ `span` $\rightarrow$ `a`.

The following is the modified version of the crawling code:

In [0]:
pages = set()

def getLinks(pageURL):
  global pages
  html = urlopen('http://en.wikipedia.org{}'.format(pageURL))
  bs = BeautifulSoup(html, 'html.parser')
  
  try:
    print(bs.h1.get_text())
    print(bs.find(id='mw-content-text').find_all('p')[0])
    print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
  except AttributeError:
    print('This page is missing something! Continuing.')
    
  for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
    if 'href' in link.attrs:
      if link.attrs['href'] not in pages:
        #new page
        newPage = link.attrs['href']
        print('-'*20)
        print(newPage)
        pages.add(newPage)
        getLinks(newPage)

In [30]:
getLinks('/wiki/Kevin_Bacon')

Kevin Bacon
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Protection_policy#semi
Wikipedia:Protection policy
<p class="mw-empty-elt">
</p>
This page is missing something! Continuing.
--------------------
/wiki/Wikipedia:Requests_for_page_protection
Wikipedia:Requests for page protection
<p>This page is for requesting that a page, file or template be <b> fully protected</b>, <b>create protected</b> (<a href="/wiki/Wikipedia:Protection_policy#Creation_protection" title="Wikipedia:Protection policy">salted</a>), <b>extended confirmed protected</b>, <b>semi-protected</b>, added to <b>pending changes</b>, <b>move-protected</b>, <b>template protected</b>, <b>upload protected</b> (file-specific), or <b>unprotected</b>. Please read up on the <a href="/wiki/Wikipedia:Protection_policy" title="Wikipedia:Protection policy">protection policy</a>. Full protection is used to stop edit warring between multiple users or to prevent vandal

KeyboardInterrupt: ignored

###Handling Redirects
* Server-side redirects, where the URL is changed before the page is loaded
* Client-side redirects, sometimes seen with a “You will be redirected in 10 sec‐
onds” type of message, where the page loads before redirecting to the new one

The `urllib` library will handles server-side redirects automatically. If the requests library is used, make sure to set the `allow_redirects` flag to `True`:


```
 r = request.get('http://github.com', allow_redirects=True)
```
Just be aware that, occasionally, the URL of the page you’re crawling might not be
exactly the URL that you entered the page on.


#Crawling Across the Internet
The following code can be used as a template of web scraping:

In [0]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

In [0]:
pages = set()
random.seed(datetime.datetime.now())

In [0]:
#Retrieve a list of all Internal links found on a page
def getInternalLinks(bs, includeURL):
  includeURL = '{}://{}'.format(urlparse(includeURL).scheme, urlparse(includeURL).netloc)
  internalLinks = []
  
  #Find all links that begin with a "/"
  for link in bs.find_all('a', href=re.compile('^(/|.*'+includeURL+')' )):
    if link.attrs['href'] is not None:
      if link.attrs['href'] not in internalLinks:
        if link.attrs['href'].startswith('/'):
          internalLinks.append(includeURL + link.attrs['href'])
        else:
          internalLinks.append(link.attrs['href'])
          
  return internalLinks

In [0]:
#Retrieve a list of all External links found on a page
def getExternalLinks(bs, excludeURL):
  externalLinks = []
  
  #Find all links that start with "http" that do not contain the current URL
  for link in bs.find_all('a', href=re.compile('^(http|www)((?!' +excludeURL+ ').)*$' )):
    if link.attrs['href'] is not None:
      if link.attrs['href'] not in externalLinks:
        externalLinks.append(link.attrs['href'])
        
  return externalLinks

In [0]:
def getRandomExternalLink(startingPage):
  html = urlopen(startingPage)
  bs = BeautifulSoup(html, 'html.parser')
  externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)
  
  if len(externalLinks) == 0:
    print('No external links, looking around the site for one')
    domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)
    internalLinks = getInternalLinks(bs, domain)
    return getRandomExternalLink(internalLinks[random.randint(0, len(internalLinks)-1) ])
  else:
    return externalLinks[random.randint(0, len(externalLinks)-1 )]

In [0]:
def followExternalOnly(startingSite):
  externalLink = getRandomExternalLink(startingSite)
  print('Random external link is {}'.format(externalLink))
  followExternalOnly(externalLink)

In [49]:
#Try... remember to manually stop the execution
followExternalOnly('http://oreilly.com')

Random external link is https://itunes.apple.com/us/app/safari-to-go/id881697395
Random external link is http://www.oreilly.com/
Random external link is https://www.facebook.com/OReilly/
Random external link is http://l.facebook.com/l.php?u=http%3A%2F%2Fshare.here.com%2Fr%2Fmylocation%2Fe-eyJuYW1lIjoiTydSZWlsbHkgTWVkaWEiLCJhZGRyZXNzIjoiMTAwNSBHcmF2ZW5zdGVpbiBId3kgTiwgU2ViYXN0b3BvbCwgQ2FsaWZvcm5pYSIsImxhdGl0dWRlIjozOC40MTE0NTUxNjI4ODEsImxvbmdpdHVkZSI6LTEyMi44NDA5NDQ3MjYxMiwicHJvdmlkZXJOYW1lIjoiZmFjZWJvb2siLCJwcm92aWRlcklkIjoxNTEzNzUwMDQzMH0%3D%3Flink%3Daddresses%26fb_locale%3Den_US%26ref%3Dfacebook&h=AT0QhYx9Tb5ebZ0OIlMhtu1dR2nvGjRmW2scTv-2v0ZcX5VpJF9-0ezNBYj96UaqdDYUClW_eBUbitcl-KFeGWp6XiFLGofjjeCr4yn0jbblsJmBC8VmAVdbJBUmMViANzemj8bpD9bF7xFp
No external links, looking around the site for one


ValueError: ignored

External links are not always guaranteed to be found on the first page of a website. To
find external links in this case, a method similar to the one used in the previous
crawling example is employed to recursively drill down into a website until it finds an
external link as shown below:
![](https://github.com/lblogan14/web_scraping_with_python/blob/master/img/ch3/crawl_internet.JPG?raw=true)

To find all the external links on the current page:

In [0]:
allExtLinks = set()
allIntLinks = set()

def getAllExternalLinks(siteURL):
  html = urlopen(siteURL)
  domain = '{}://{}'.format(urlparse(siteURL).scheme,
                            urlparse(siteURL).netloc)
  bs = BeautifulSoup(html, 'html.parser')
  internalLinks = getInternalLinks(bs, domain)
  externalLinks = getExternalLinks(bs, domain)
  
  for link in externalLinks:
    if link not in allExtLinks:
      allExtLinks.add(link)
      print(link)
      
  for link in internalLinks:
    if link not in allIntLinks:
      allIntLinks.add(link)
      getAllExternalLinks(link)

This function has two loops, one gathering internal links and the other gathering external links. The flowchart is shown below:
![](https://github.com/lblogan14/web_scraping_with_python/blob/master/img/ch3/all_external.JPG?raw=true)

In [0]:
allIntLinks.add('http://oreilly.com')

In [53]:
getAllExternalLinks('http://oreilly.com')

https://www.oreilly.com
https://www.oreilly.com/sign-in.html
https://www.oreilly.com/online-learning/try-now.html
https://www.oreilly.com/online-learning/index.html
https://www.oreilly.com/online-learning/individuals.html
https://www.oreilly.com/online-learning/teams.html
https://www.oreilly.com/online-learning/enterprise.html
https://www.oreilly.com/online-learning/government.html
https://www.oreilly.com/online-learning/academic.html
https://www.oreilly.com/online-learning/features.html
https://www.oreilly.com/online-learning/custom-services.html
https://www.oreilly.com/online-learning/pricing.html
https://www.oreilly.com/conferences/
https://conferences.oreilly.com/artificial-intelligence
https://conferences.oreilly.com/oscon
https://conferences.oreilly.com/software-architecture
https://conferences.oreilly.com/strata
https://conferences.oreilly.com/tensorflow
https://conferences.oreilly.com/velocity
https://www.oreilly.com/ideas/
https://www.oreilly.com/about/approach.html
https://co

KeyboardInterrupt: ignored