# Web crawling

The following content heavily draws on [Web Scraping with Python](https://proquest.safaribooksonline.com/book/programming/python/9781491985564) (2018) by Ryan Mitchell.

- First, we learn how create a list of URLs from a webpage.
- Second, we learn how to crawl through the list of these URLs.

## Web crawling 

The idea behind web scrawling is recursion. Instead of scraping one web page, we do the same process over and over different web pages (or websites) using Uinform Resource locations (or URLs). Since web crawling substantially increases work load of the target's server, do this very carefully again within *technical*, *legal*, and *ethical* boundaries.  

### Create a list of URLs

Since we're studying computational methods, let's look at the list of past winners of Turing Award -- the computer science equivalent to Nobel prizes and collect links from the URL.


To begin with, let's find out whether we can gain access to the web page.

In [1]:
from urllib.request import urlopen 
from urllib.error import HTTPError
from urllib.error import URLError

try:
    page = urlopen('https://en.wikipedia.org/wiki/Turing_Award')
except HTTPError as e:
    print(e)
except URLError as e:
    print("The server is broken")
else:
    print("The site is working")

The site is working


Then, let's scrap the table and the find the element (the URLs of the prize winners) we need.

In [2]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page, "html.parser")

wiki_table = soup.find('table',{'class':'wikitable'})

Find some patterns we can exploit.

In [3]:
import re

wiki_table.find_all('tr')[1].find_all('td')[0].find('a').get('href') # first winner
wiki_table.find_all('tr')[2].find_all('td')[0].find('a').get('href') # first winner

'/wiki/Maurice_Wilkes'

Calculate the total number of prize winners. Remember each row in the table contains information about a prize winner.

In [4]:
len(wiki_table.find_all('tr')) 

68

In [5]:
links = []

for i in range(0,67):
    if i <= 67:
        links.append(wiki_table.find_all('tr')[i+1].find_all('td')[0].find('a').get('href'))
        print(links[i]) # to check whether we get the right list
    else:
        print("Something is wrong") # for debugging

/wiki/Alan_Perlis
/wiki/Maurice_Wilkes
/wiki/Richard_Hamming
/wiki/Marvin_Minsky
/wiki/James_H._Wilkinson
/wiki/John_McCarthy_(computer_scientist)
/wiki/Edsger_W._Dijkstra
/wiki/Charles_Bachman
/wiki/Donald_Knuth
/wiki/Allen_Newell
/wiki/Herbert_A._Simon
/wiki/Michael_O._Rabin
/wiki/Dana_Scott
/wiki/John_Backus
/wiki/Robert_W._Floyd
/wiki/Kenneth_E._Iverson
/wiki/Tony_Hoare
/wiki/Edgar_F._Codd
/wiki/Stephen_A._Cook
/wiki/Ken_Thompson_(computer_programmer)
/wiki/Dennis_M._Ritchie
/wiki/Niklaus_Wirth
/wiki/Richard_M._Karp
/wiki/John_Hopcroft
/wiki/Robert_Tarjan
/wiki/John_Cocke
/wiki/Ivan_Sutherland
/wiki/William_Kahan
/wiki/Fernando_J._Corbat%C3%B3
/wiki/Robin_Milner
/wiki/Butler_W._Lampson
/wiki/Juris_Hartmanis
/wiki/Richard_E._Stearns
/wiki/Edward_Feigenbaum
/wiki/Raj_Reddy
/wiki/Manuel_Blum
/wiki/Amir_Pnueli
/wiki/Douglas_Engelbart
/wiki/Jim_Gray_(computer_scientist)
/wiki/Frederick_P._Brooks
/wiki/Andrew_Chi-Chih_Yao
/wiki/Ole-Johan_Dahl
/wiki/Kristen_Nygaard
/wiki/Ron_Rivest
/wiki/

**Challenge**

But we have one problem. The links are not the full URLs we look for. We want "https://en.wikipedia.org/wiki/Alan_Perlis" not "wiki/Alan_Perlis". How can you fix this?

In [6]:
''.join(["https://en.wikipedia.org",links[0]]) # using join method

'https://en.wikipedia.org/wiki/Alan_Perlis'

In [7]:
new_links = []

for i in range(0,67):
    if i <= 67:
        new_links.append(''.join(["https://en.wikipedia.org",links[i]]))
        print(new_links[i]) # to check whether we get the right list
    else:
        print("Something is wrong") # for debugging

https://en.wikipedia.org/wiki/Alan_Perlis
https://en.wikipedia.org/wiki/Maurice_Wilkes
https://en.wikipedia.org/wiki/Richard_Hamming
https://en.wikipedia.org/wiki/Marvin_Minsky
https://en.wikipedia.org/wiki/James_H._Wilkinson
https://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)
https://en.wikipedia.org/wiki/Edsger_W._Dijkstra
https://en.wikipedia.org/wiki/Charles_Bachman
https://en.wikipedia.org/wiki/Donald_Knuth
https://en.wikipedia.org/wiki/Allen_Newell
https://en.wikipedia.org/wiki/Herbert_A._Simon
https://en.wikipedia.org/wiki/Michael_O._Rabin
https://en.wikipedia.org/wiki/Dana_Scott
https://en.wikipedia.org/wiki/John_Backus
https://en.wikipedia.org/wiki/Robert_W._Floyd
https://en.wikipedia.org/wiki/Kenneth_E._Iverson
https://en.wikipedia.org/wiki/Tony_Hoare
https://en.wikipedia.org/wiki/Edgar_F._Codd
https://en.wikipedia.org/wiki/Stephen_A._Cook
https://en.wikipedia.org/wiki/Ken_Thompson_(computer_programmer)
https://en.wikipedia.org/wiki/Dennis_M._Ritchie
https://en.w

### Crawl through the list

Let's extract the birth information from each of the prize winner's Wikipage.

In [8]:
try:
    test_page = urlopen(new_links[0])
except HTTPError as e:
    print(e)
except URLError as e:
    print("The server is broken")
else:
    print("The site is working")
    
test_soup = BeautifulSoup(test_page, "html.parser")
test_infobox = test_soup.find("table",{"class":"vcard"})

test_infobox.find(text = re.compile("Born")).find_next().text

The site is working


'(1922-04-01)April 1, 1922Pittsburgh, Pennsylvania, U.S.'

Now we loop the code over the list.

In [9]:
birth_list = []

for url in new_links:
    req = urlopen(url)
    page = BeautifulSoup(req, 'html.parser')
    birth = page.find("table",{"class":"vcard"}).find(text = re.compile("Born")).find_next().text
    birth_list.append(birth)

In [10]:
birth_list[0:20]

['(1922-04-01)April 1, 1922Pittsburgh, Pennsylvania, U.S.',
 'John Maurice Vincent Wilkes(1913-06-26)26 June 1913Dudley, Worcestershire, England',
 '(1915-02-11)February 11, 1915Chicago, Illinois, U.S.',
 'Marvin Lee Minsky(1927-08-09)August 9, 1927New York City, New York, U.S.',
 'James Hardy Wilkinson(1919-09-27)27 September 1919Strood, England',
 '(1927-09-04)September 4, 1927Boston, Massachusetts, U.S.',
 '(1930-05-11)11 May 1930Rotterdam, Netherlands',
 'Charles William Bachman III(1924-12-11)December 11, 1924Manhattan, Kansas',
 'Donald Ervin Knuth (1938-01-10) January 10, 1938 (age\xa080)Milwaukee, Wisconsin, U.S.',
 '(1927-03-19)March 19, 1927San Francisco',
 'Herbert Alexander Simon(1916-06-15)June 15, 1916Milwaukee, Wisconsin',
 ' (1931-09-01) September 1, 1931 (age\xa087)Breslau, Germany',
 ' (1932-10-11) October 11, 1932 (age\xa086)Berkeley, California',
 'John Warner Backus(1924-12-03)December 3, 1924Philadelphia, Pennsylvania',
 '(1936-06-08)June 8, 1936New York City, New