<div class="frontmatter text-center">
<h1>Introduction to Data Science and Programming</h1>
<h2>Exercise 10: Python Crash Course - Web scraping</h2>
<h3>IT University of Copenhagen, Fall 2023</h3>
</div>

* Task 1: An endless Wikipedia knowledge loop
* Task 2: Michael's links

> Note: to import the python package `bs4` (beautifulsoup4) in the cell below, bs4 must be installed on your computer. We learned how to do this in Exercise09. See the Exercise09 notebook for detailed instructions!

In [None]:
# import packages we need
import requests
import bs4 # this is beautifulsoup4
import random
random.seed(42) # change to your personalized seed if you want

# Task 1: Wikipedia link to link

In this task, we will write a function that will play a version of the [Wikipedia Game](https://www.thewikipediagame.com) for us - randomly jumping from page to page within Wikipedia. Write a function that:
1. takes the url of a Wikipedia article, and N (number of steps), as input
2. prints out the title of that Wikipedia article
3. scrapes that Wikipedia article for all its `/wiki/` links (i.e. for all links to other Wikipedia articles)
4. randomly chooses one of those `/wiki/` links, and repeats steps 1 & 2 & 3 with it
5. this jumping from page to page repeats N times

The output could look something like that: (with `random.seed(29)`, and starting at the Wikipedia page [https://en.wikipedia.org/wiki/Fibonacci_sequence](https://en.wikipedia.org/wiki/Fibonacci_sequence))

<p style="text-align:left;">
    <img src="images/wikigame.png" alt="wiki game" width=500px>
</p>

In [None]:
# hint 1: to find the TITLE of the wikipedia article, use 
# the following method on your "soup":
# .find(id="firstHeading")
# and then the .text attribute


In [None]:
# hint 2: to find only the wiki article content,
# (excluding e.g. the links on the left part of the page)
# use the following method on your soup:
# .find(id="bodyContent")

In [None]:
# hint 3: to check whether a link is a wikipedia link or not,
# check whether it contains the string "/wiki/"

In [None]:
# hint 4: to create a full link, concatenate the wikipedia link like so:
# "https://en.wikipedia.org" + wikilink

In [None]:
# function to scrape one wikipedia article for all the "internal"
# (wiki) links in the BODY of the article
def scrape_wiki_article(wikiurl):
    '''
    takes a wikipedia url as input and scrapes it;
    looks for other ("internal") wikipedia links in the BODY of the wikipedia article;
    and returns the title of the wikipedia article (str); and a list of all wikipedia links found
    '''
    # make soup
    response = requests.get(wikiurl)
    my_text = response.content
    soup = bs4.BeautifulSoup(my_text)

    # find out the title of the article
    article_title = soup.find(id="firstHeading").text

    # get only the BODY of the article
    article_body = soup.find(id="bodyContent")
    
    # find links in article body
    article_links = article_body.find_all("a")
    
    # extract the links
    article_links = [l.get("href") for l in article_links if l.get("href")]

    # keep only those that contain "/wiki/"
    article_links = [l for l in article_links if "/wiki/" in l]

    # exclude those that contain ":" (these are links to wikipedia portals/help/categories...)
    article_links = [l for l in article_links if ":" not in l]

    return article_title, article_links

# function to choose a random internal wiki link, and make it "exteral" (by adding full link):
def make_random_wikilink(my_list, prepend = "https://en.wikipedia.org"):
    ''' 
    takes a list of internal (wiki) links as input;
    randomly chooses one of them;
    and prepends (by default) "https://en.wikipedia.org" (to convert it 
    in scrapable format understood by requests)
    '''
    return prepend + random.choice(my_list)

In [None]:
# define where to start
my_article = "https://en.wikipedia.org/wiki/Coolio"

# define how many times to jump from page to page
n_jumps = 20

# 10 steps
for _ in range(n_jumps):
    
    # scrape the current article; get title and its links
    my_title, my_links = scrape_wiki_article(my_article)
    
    # print out the title
    print(my_title)

    # make the new my_article link to scrape at next iteration
    my_article = make_random_wikilink(my_links)

# Task 2: Links to Michael from Michael's links

Find out **how many of the websites linked on Michael's homepage ([http://michael.szell.net](http://michael.szell.net)) link back to his homepage**? For example, the very first link on his website is [https://en.itu.dk](https://en.itu.dk) - now you need to check whether the website [https://en.itu.dk](https://en.itu.dk) contains the link [http://michael.szell.net](http://michael.szell.net); if yes, that increases the count of linking-back websites by 1.

For this task, you need to
1. get the content of http://michael.szell.net using `requests`
2. search the content for all links to external websites using `beautifulsoup4`
3. for each of the links from step 2, 
    * get the content with `requests`
    * search the content for all links with `beautifulsoup`
    * check whether any of the links contains `michael.szell.net`

> Note: When creating your list of links on Michael's website (step 2), remove `senseable.mit.edu` and `lab.moovel.com` from the list to avoid connection time out errors.

Extra challenge: "What does this have to do with Google?" >> Read up here: [PageRank algorithm](https://en.wikipedia.org/wiki/PageRank)

In [None]:
# get the data with the "requests" module
response = requests.get("http://michael.szell.net")
# the html code is in the attribute .content
my_text = response.content
# make your "soup" (use bs4 to read the html code)
soup_michael = bs4.BeautifulSoup(my_text)
# now we have the html code in the "soup" variable,
# this "soup" variable can be easily searched with bs4 functions.
print(soup_michael)

In [None]:
# find all the html objects that contain links on Michael's website
all_links = [l for l in soup_michael.find_all("a")]
all_links

In [None]:
# extract from all the links only the hyperlinks with .get("href")
all_hyperlinks = [l.get("href") for l in all_links if l.get("href")]

In [None]:
# keep only the exteral links to websites - the ones that start with "http", and don't contain michael.szell
all_external_links = [link for link in all_hyperlinks if (not "michael.szell" in link) and (link[0:4]=="http")]
all_external_links

In [None]:
# remove lab.moovel.com and senseable.mit.edu from the list (to avoid connection timeout)
all_external_links = [l for l in all_external_links if "lab.moovel.com" not in l]
all_external_links = [l for l in all_external_links if "senseable.mit.edu" not in l]
all_external_links


**Now we need to webscrape each of the websites in `all_external_links`; find the links on each of those websites; and find out whether any of them contain the string `michael.szell.net`**

In [None]:
# function that, given a website, scrapes it for its hyperlinks;
# and returns a list of only those hyperlinks that contain a specified string.
# we will call this function on each of the links on Michael's website.
def find_link_on_page(my_page, my_string):
    '''
    takes a website (my_page; url) and a string (my_string) as input;
    webscrapes my_page and checks whether any of its links contain my_string;
    returns the list of links on my_page that contain my_string 
    '''
    # get the contents of my_page and make a "soup"
    my_page_response = requests.get(my_page)
    my_page_text = my_page_response.text
    soup=bs4.BeautifulSoup(my_page_text)

    # search the contents of the page for links
    # store all links on the page in the variable all_links
    all_links = [l for l in soup.find_all("a")]

    # extract only the hyperlinks ("href" in html code)
    links_href = [l.get("href") for l in all_links if l.get("href")]

    # make a list of only those hyperlinks that contain my_string
    links_found = [l for l in links_href if my_string in l]

    return links_found

In [None]:
# initiate a list
websites_linking_to_michael = []

# loop through all the external links
for l in all_external_links:
    # at each step, if the website of the link contains "michael.szell.net" in its links
    links_found = find_link_on_page(l, "michael.szell.net")
    # if so (if not an empty list is returned),
    # add the current link to our list:
    if links_found:
        websites_linking_to_michael.append(l)
        # and print it out
        print(l)

In [None]:
# How many links link back to michael? # Count only the unique ones:
set(websites_linking_to_michael)
# 6! (or less, if you don't count pages/subpages as separate websites)