# Chapter 4: Encoding and Annotation Schemes
## Building a scraper

Programs from the book: [_Python for Natural Language Processing_](https://link.springer.com/book/9783031575488)

__Author__: Pierre Nugues

#### Using `requests`

In [1]:
import requests

url_en = 'https://en.wikipedia.org/wiki/Aristotle'
url_fr = 'https://fr.wikipedia.org/wiki/Aristote'

In [2]:
headers = {
    'User-Agent': 'PNLP/1.0 (pierre.nugues@cs.lth.se)'
}

In [3]:
html_doc = requests.get(url_en, headers=headers).text
print(html_doc[:2000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Aristotle - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-c

## Parsing HTML and a Wikipedia page

#### We import the modules

In [4]:
import bs4
from urllib.parse import urljoin

#### We load a page and parse it

In [5]:
url_en = 'https://en.wikipedia.org/wiki/Aristotle'
html_doc = requests.get(url_en, headers=headers).text
parse_tree = bs4.BeautifulSoup(html_doc, 'html.parser')

#### We extract elements

In [6]:
parse_tree.title
# <title>Aristotle - Wikipedia, the free encyclopedia</title>

<title>Aristotle - Wikipedia</title>

In [7]:
parse_tree.title.text
# Aristotle - Wikipedia, the free encyclopedia

'Aristotle - Wikipedia'

In [8]:
# We extract header 1

In [9]:
parse_tree.h1.text
# Aristotle

'Aristotle'

#### We extract all the headers h2

In [10]:
headings = parse_tree.find_all('h2')
[heading.text for heading in headings]
# ['Contents', 'Life', 'Thought', 'Loss and preservation of his works', 'Legacy', 'List of works', 'Eponyms', 'See also', 'Notes and references', 'Further reading', 'External links', 'Navigation menu']

['Contents',
 'Life',
 'Theoretical philosophy',
 'Natural philosophy',
 'Practical philosophy',
 'Legacy',
 'Depictions in art',
 'Eponyms',
 'See also',
 'Notes',
 'References',
 'Further reading',
 'External links']

#### We extract the links

In [11]:
links = parse_tree.find_all('a', href=True)
links[:20]

[<a class="mw-jump-link" href="#bodyContent">Jump to content</a>,
 <a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>,
 <a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>,
 <a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>,
 <a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>,
 <a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>,
 <a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>,
 <a href="/wiki/Help:Contents" title="Guidance on how to use and edit Wikipedia"><span>Help</span></a>,
 <a href="/wiki/Help:Introduction" title="Learn how to edit Wikipedia"><span>Learn to edit</span></a>,
 <a href="/wiki/Wikipedia:Community_portal" title="The

#### The labels

In [12]:
[link.text for link in links][:15]

['Jump to content',
 'Main page',
 'Contents',
 'Current events',
 'Random article',
 'About Wikipedia',
 'Contact us',
 'Help',
 'Learn to edit',
 'Community portal',
 'Recent changes',
 'Upload file',
 'Special pages',
 '\n\n\n\n\n\n',
 '\nSearch\n']

#### The links

In [13]:
[link.get('href') for link in links][:35]

['#bodyContent',
 '/wiki/Main_Page',
 '/wiki/Wikipedia:Contents',
 '/wiki/Portal:Current_events',
 '/wiki/Special:Random',
 '/wiki/Wikipedia:About',
 '//en.wikipedia.org/wiki/Wikipedia:Contact_us',
 '/wiki/Help:Contents',
 '/wiki/Help:Introduction',
 '/wiki/Wikipedia:Community_portal',
 '/wiki/Special:RecentChanges',
 '/wiki/Wikipedia:File_upload_wizard',
 '/wiki/Special:SpecialPages',
 '/wiki/Main_Page',
 '/wiki/Special:Search',
 'https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en',
 '/w/index.php?title=Special:CreateAccount&returnto=Aristotle',
 '/w/index.php?title=Special:UserLogin&returnto=Aristotle',
 'https://donate.wikimedia.org/?wmf_source=donate&wmf_medium=sidebar&wmf_campaign=en.wikipedia.org&uselang=en',
 '/w/index.php?title=Special:CreateAccount&returnto=Aristotle',
 '/w/index.php?title=Special:UserLogin&returnto=Aristotle',
 '/wiki/Help:Introduction',
 '/wiki/Special:MyContributions',
 '/wiki/Special:MyTalk',
 '#',
 

#### The absolute addresses

In [14]:
try:
    out = [urljoin(url_en, link['href']) for link in links]
except Exception as ex:
    type(ex)
out[:15]

['https://en.wikipedia.org/wiki/Aristotle#bodyContent',
 'https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Wikipedia:Contents',
 'https://en.wikipedia.org/wiki/Portal:Current_events',
 'https://en.wikipedia.org/wiki/Special:Random',
 'https://en.wikipedia.org/wiki/Wikipedia:About',
 'https://en.wikipedia.org/wiki/Wikipedia:Contact_us',
 'https://en.wikipedia.org/wiki/Help:Contents',
 'https://en.wikipedia.org/wiki/Help:Introduction',
 'https://en.wikipedia.org/wiki/Wikipedia:Community_portal',
 'https://en.wikipedia.org/wiki/Special:RecentChanges',
 'https://en.wikipedia.org/wiki/Wikipedia:File_upload_wizard',
 'https://en.wikipedia.org/wiki/Special:SpecialPages',
 'https://en.wikipedia.org/wiki/Main_Page',
 'https://en.wikipedia.org/wiki/Special:Search']