# Web Crawler

This is a basic Web Crawler for Wikipedia articles. The Crawler will move to the first link it finds on the Wikipedia page (user input) and will keep on moving to a new page unless it finds a target url (user input) or returns to a page which it has encountered earlier. It will keep on outputting all the links to which it goes on during the whole process.

## Basic Overview For Our Code  

In [69]:
# page = a random starting page
# article_chain = []
# while title of page isn't 'Philosophy' and we have not discovered a cycle:
#     append page to article_chain
#    download the page content
#     find the first link in the content
#     page = that link
#     pause for a second

## Code

In [70]:
from bs4 import BeautifulSoup # https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import urllib
import time
import requests

def continue_crawl(search_history, target_url, max_steps = 25):
    if search_history[-1] == target_url:
        print("\n"+"We've found the target article!")
        print(search_history[-1])
        return False
    elif len(search_history) > max_steps:
        print("\n"+"The search has gone on suspiciously long, aborting search!")
        print(search_history[-1])
        return False
    elif search_history[-1] in search_history[:-1]:
        print("\n"+"We've arrived at an article we've already seen, aborting search!")
        print(search_history[-1])
        return False
    else:
        return True
    
def find_first_link(url):
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html,'html.parser')
    content_div = soup.find(id="mw-content-text").find(class_="mw-parser-output")
    article_link = None
    for element in content_div.find_all("p", recursive=False):
        if element.find("a", recursive=False):
            article_link = element.find("a", recursive=False).get('href')
            break
    if not article_link:
        return
    first_link = urllib.parse.urljoin('https://en.wikipedia.org/', article_link)
    return first_link

In [71]:
article_chain = ["https://en.wikipedia.org/wiki/Special:Random"]
target_url = "https://en.wikipedia.org/wiki/Philosophy"
while continue_crawl(article_chain, target_url):
    print(article_chain[-1])
    first_link = find_first_link(article_chain[-1])
    article_chain.append(first_link)
    time.sleep(2)

https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Murcia
https://en.wikipedia.org/wiki/Spain
https://en.wikipedia.org/wiki/Spanish_language
https://en.wikipedia.org/wiki/Romance_languages
https://en.wikipedia.org/wiki/Vulgar_Latin
https://en.wikipedia.org/wiki/Standard_language
https://en.wikipedia.org/wiki/Variety_(linguistics)
https://en.wikipedia.org/wiki/Sociolinguistics
https://en.wikipedia.org/wiki/Society
https://en.wikipedia.org/wiki/Social_group
https://en.wikipedia.org/wiki/Social_science
https://en.wikipedia.org/wiki/Branches_of_science
https://en.wikipedia.org/wiki/Science
https://en.wikipedia.org/wiki/Latin
https://en.wikipedia.org/wiki/Classical_language
https://en.wikipedia.org/wiki/Language
https://en.wikipedia.org/wiki/Grammar
https://en.wikipedia.org/wiki/Linguistics

We've arrived at an article we've already seen, aborting search!
https://en.wikipedia.org/wiki/Science
