# Tutorial Web Scraping 3: Advanced Web Crawling

In this tutorial, we explore more advanced web scraping techniques such as crawling across multiple pages, collecting specific data from sites, and even traversing the entire internet! We will dive into how to retrieve links from a webpage, perform a random walk through articles, and collect external and internal links systematically.


## Learning objectives
By the end of this tutorial, you will be able to:

- Understand how to retrieve data from web pages systematically.
- Traverse through multiple pages of a website.
- Collect specific content such as internal and external links.
- Crawl across multiple websites and explore the internet through web scraping.

---

### Step 1: Retrieving all links from a page
We will start by retrieving all links from a webpage using BeautifulSoup.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup 

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
    if 'href' in link.attrs:
        print(link.attrs['href'])

**Explanation:**
We open the Wikipedia page for "Kevin Bacon".
We use `find_all('a')` to extract all the anchor (`<a>`) tags, which usually contain links.

## Exercise 1:
Modify the code to print only the first 10 links found on the page.

---

### Step 2: Retrieving only article links
Now, let's modify our scraper to retrieve only Wikipedia article links. We will use regular expressions to filter out non-article links.

In [None]:
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import re

html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find('div', {'id':'bodyContent'}).find_all(
    'a', href=re.compile('^(/wiki/)((?!:).)*$')):
    print(link.attrs['href'])

**Explanation:** We use a re.compile regular expression to only find URLs starting with /wiki/ and exclude those with colons (:), which represent non-article pages (like categories or file pages).

## Exercise 2:

Modify the code to print only the first 5 article links and store them in a list.

---

### Step 3: Random walk on Wikipedia articles
We will now implement a random walk through Wikipedia articles. This means, starting from an article (e.g., Kevin Bacon), we will follow a random link to another Wikipedia article and repeat the process.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import random
import re

random.seed(datetime.datetime.now().strftime('%s'))
def getLinks(articleUrl):
    html = urlopen(f'http://en.wikipedia.org{articleUrl}')
    bs = BeautifulSoup(html, 'html.parser')
    return bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$'))

links = getLinks('/wiki/Kevin_Bacon')
while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = getLinks(newArticle)

** Explanation:** We use a random function to select a random Wikipedia article from the list of article links.
The program continues to walk randomly from one article to another.

## Exercise 3:

Set a limit to only walk through 5 Wikipedia articles and print the final article visited.

---

### Step 4: Recursively crawling an entire site
Now let's implement a recursive crawler that will traverse through all Wikipedia pages it encounters and collect all internal links.

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    html = urlopen(f'http://en.wikipedia.org{pageUrl}')
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')

**Explanation:** We keep track of all pages visited in the pages set to avoid duplication.
The function recursively calls itself, crawling through every internal Wikipedia page it encounters.

## Exercise 4:

Limit the recursion to crawl only 3 pages deep and stop after reaching 20 unique pages.

---

### Step 5: Collecting data from a Wikipedia site
We will modify the crawler to not only traverse pages but also collect useful data from them such as the title, the first paragraph, and the edit link (if available).

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    html = urlopen(f'http://en.wikipedia.org{pageUrl}')
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        #mw-parser-output
        bodyContent = bs.find('div', {'id':'bodyContent'}).find_all('p')
        if len(bodyContent):
            print(bodyContent[0])
        print(bs.find(id='ca-edit').find('a').attrs['href'])
    except AttributeError:
        print('This page is missing something! Continuing.')
    
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('/wiki/General-purpose_programming_language') 

## Exercise 5:

Modify the function to also collect the last paragraph of the article, if available.

---

### Step 6: Crawling across the Internet
Let’s move beyond Wikipedia and build a crawler that can traverse external sites.

In [None]:
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

#Retrieves a list of all Internal links found on a page
def getInternalLinks(bs, url):
    netloc = urlparse(url).netloc
    scheme = urlparse(url).scheme
    internalLinks = set()
    for link in bs.find_all('a'):
        if not link.attrs.get('href'):
            continue
        parsed = urlparse(link.attrs['href'])
        if parsed.netloc == '':
            internalLinks.add(f'{scheme}://{netloc}/{link.attrs["href"].strip("/")}')
        elif parsed.netloc == netloc:
            internalLinks.add(link.attrs['href'])
    return list(internalLinks)
            
#Retrieves a list of all external links found on a page
def getExternalLinks(bs, url):
    netloc = urlparse(url).netloc
    externalLinks = set()
    for link in bs.find_all('a'):
        if not link.attrs.get('href'):
            continue
        parsed = urlparse(link.attrs['href'])
        if parsed.netloc != '' and parsed.netloc != netloc:
            externalLinks.add(link.attrs['href'])
    return list(externalLinks)

def getRandomExternalLink(startingPage):
    bs = BeautifulSoup(urlopen(startingPage), 'html.parser')
    externalLinks = getExternalLinks(bs, startingPage)
    if not len(externalLinks):
        print('No external links, looking around the site for one')
        internalLinks = getInternalLinks(bs, startingPage)
        return getRandomExternalLink(random.choice(internalLinks))
    else:
        return random.choice(externalLinks)
    
def followExternalOnly(startingSite):
    externalLink = getRandomExternalLink(startingSite)
    print(f'Random external link is: {externalLink}')
    followExternalOnly(externalLink)


followExternalOnly('https://www.oreilly.com/')


## Conclusion

You now have the tools to build more advanced web crawlers that can traverse single domains or even multiple websites. This tutorial also demonstrated how to use regular expressions and Python's random library for web scraping and exploration.

Be mindful of the ethical and legal considerations discussed in earlier tutorials when building and deploying web crawlers.

## Exercise 6:
Build a crawler that logs every external link it finds from the homepage of a website and prints the total count of unique external links discovered.

In [None]:
# Collects a list of all external URLs found on the site
allExtLinks = []
allIntLinks = []


def getAllExternalLinks(url):
    bs = BeautifulSoup(urlopen(url), 'html.parser')
    internalLinks = getInternalLinks(bs, url)
    externalLinks = getExternalLinks(bs, url)
    for link in externalLinks:
        if link not in allExtLinks:
            allExtLinks.append(link)
            print(link)

    for link in internalLinks:
        if link not in allIntLinks:
            allIntLinks.append(link)
            getAllExternalLinks(link)


allIntLinks.append('https://oreilly.com')
getAllExternalLinks('https://www.oreilly.com/')