# Mid-Term Exam

Internet access is not permitted during the exam. You must only use the notebook, which you save at the end of the exam in PDF format. The PDF file must then be submitted to the dedicated Moodle space. Remember to check that the kernel is connected (white circle at top right) and reconnect it if necessary.

## Part 1: Theoretical questions (30 points)
**Instructions:** Answer the following questions based on the concepts covered in the tutorials seen in class. Simply double-click on the box where you want to put your answer and press the “Shift-Enter” keys to validate your answer.

(5 points) What is the role of the BeautifulSoup library in web scraping, and how does it interact with urlopen?

**Your answer here:**

(5 points) what does parsing mean in web scraping?

**Your answer here:**

(5 points) What is the difference between ``find()`` and ``find_all()`` in BeautifulSoup? Provide an example of each.

**Your asnwer here:**

(5 points) Explain how regular expressions can be useful in web scraping. Provide a simple regex example used in one of the tutorials.

**Your answer here:**

(5 points) Discuss the ethical considerations of web scraping. When might scraping be considered inappropriate or illegal?

**Your anser here:**

(5 points) In a web scraping project, you want to collect data from multiple linked pages. Describe the strategy you would use to navigate and gather data from all the relevant pages without scraping unnecessary links (e.g., about pages, contact pages, etc.).

**Your answer here:**

## Part 2: Practical questions (70 points)
**Instructions:** Complete the following coding exercises. The codes provided here refer only to Tutorials 1 and 2 seen in class. 

### Extract Data from a Wikipedia Page (20 points)

Scrape the first paragraph of the "Kim Wilde" French Wikipedia page (/wiki/Kim_Wilde). Then, extract and print all the internal links (links starting with /wiki/) from the page using BeautifulSoup.

To help you, you will find below three code elements that allow you to :
- get help on a method or function
- view the rendering of a web page directly in the notebook
- edit the html code of a page in the notebook

In [None]:
help(BeautifulSoup.find_all)

In [None]:
from urllib.request import urlopen
from IPython.display import HTML

# Fetch the HTML content
url = 'https://fr.wikipedia.org/wiki/Kevin_Bacon'
html_content = urlopen(url).read().decode('utf-8')

# Display the HTML content in the notebook
HTML(html_content)


In [None]:
from urllib.request import urlopen

# Fetch the HTML content
url = 'http://www.pythonscraping.com/pages/warandpeace.html'
html_content = urlopen(url).read().decode('utf-8')

# Display the raw HTML code
print(html_content)


In [17]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('https://fr.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')

# Extract the first paragraph
first_paragraph = bs.find('p')[6].get_text()
print('Find here the 1st paragraph:', first_paragraph)

# Extract internal links
internal_links = bs.find_all('a', href=re.compile('^(/wiki/)'))
for link in internal_links:
    print(link.attrs['href'])

KeyError: 6

### Modify the Web Crawler (20 points)

Write a crawler that starts from the "Kim wilde" Wikipedia page and retrieves all article links (/wiki/) up to 1 level deep. Make sure to avoid scraping non-article pages such as "Category" and "Talk" pages by adding the required code lines.

In [5]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

def getLinks(articleUrl, pages, depth=2):
    if depth == 0:
        return
    html = urlopen(f'http://en.wikipedia.org{articleUrl}')
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find('div', {'id':'bodyContent'}).find_all('a', href=re.compile('^(/wiki/)((?!:).)*$')):
        if 'href' in link.attrs and link.attrs['href'] not in pages:
            newPage = link.attrs['href']
            print(newPage)
            pages.add(newPage)
            getLinks(newPage, pages, depth-1)

pages = set()
getLinks('/wiki/Kevin_Bacon', pages)


/wiki/Kevin_Bacon_(disambiguation)
/wiki/Philadelphia
/wiki/Kevin_Bacon_filmography
/wiki/Kyra_Sedgwick
/wiki/Sosie_Bacon
/wiki/Edmund_Bacon_(architect)
/wiki/Michael_Bacon_(musician)
/wiki/Leading_man
/wiki/Character_actor
/wiki/Golden_Globe_Award
/wiki/Screen_Actors_Guild_Award
/wiki/Primetime_Emmy_Award
/wiki/National_Lampoon%27s_Animal_House
/wiki/Diner_(1982_film)
/wiki/Footloose_(1984_film)
/wiki/JFK_(film)
/wiki/A_Few_Good_Men
/wiki/Apollo_13_(film)
/wiki/Mystic_River_(film)
/wiki/Frost/Nixon_(film)
/wiki/Friday_the_13th_(1980_film)
/wiki/Tremors_(1990_film)
/wiki/The_River_Wild
/wiki/The_Woodsman_(2004_film)
/wiki/Crazy,_Stupid,_Love
/wiki/Patriots_Day_(film)
/wiki/Losing_Chase
/wiki/Loverboy_(2005_film)
/wiki/Golden_Globe_Award_for_Best_Actor_%E2%80%93_Miniseries_or_Television_Film
/wiki/Screen_Actors_Guild_Award_for_Outstanding_Performance_by_a_Male_Actor_in_a_Miniseries_or_Television_Movie
/wiki/Michael_Strobl
/wiki/HBO
/wiki/Taking_Chance
/wiki/Fox_Broadcasting_Company
/wik

### Advanced Regular Expression Search (15 points)

Modify the code to find all paragraphs in the "War and Peace" page that contain the exact phrase "Prince". Print only the paragraphs containing the phrase.

In [38]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

# Open the URL and parse the HTML content
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, 'html.parser')

# Find all paragraphs containing the exact phrase "Prince" (case insensitive)
paragraphs = bs.find_all(['p', 'div'], string=re.compile(r'\bThe prince\b'))

# Print only the paragraphs containing the exact phrase
for paragraph in paragraphs:
    print(paragraph.get_text())


### Error Handling (15 points)

Write a Python function that attempts to open a web page. If the page doesn't exist or there is a URL error, the function should print an error message. Handle the errors gracefully.

In [20]:
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup

def getPage(url):
    try:
        html = urlopen(url)
        bs = BeautifulSoup(html, 'html.parser')
        return bs
    except HTTPError as e:
        print("The server returned an HTTP error.")
    except URLError as e:
        print("The server could not be found!")
    return None

bs = getPage('http://www.pythonscraping.com/pages/warandpeace.html')
if bs:
    print(bs.h1.get_text())


War and Peace
