# Exercise Sheet \#3


## Exercise 1. Quotes: manual scraping

In this exercise, you are required to compile a dataset of biographies taken from http://quotes.toscrape.com.
Recall this website displays 10 quotes per page, together with a link to their author's biography. This will be a step by step guide.

#### 1.1 Getting URLs of authors' pages

To get a list of URLs pointing at author pages, you will process quotes' pages. 

To do so, first complete the function get_links below which expects as parameter:

* `url` the URL of a page from quotes.toscrape.com

and returns:

* `authors` the list of links to author pages contained in the given quotes' page (beware of duplicates!)

In [14]:
import requests, re
from bs4 import BeautifulSoup

BASE_URL = 'http://quotes.toscrape.com'

def get_links(url):
    # authors = []
    authors = {}
    for i in range(1, 11):
        url = f'http://quotes.toscrape.com/page/{i}/'
        # Get page located at url:
        ua = {'User-agent': 'Mozilla/5.0'}
        page = requests.get(url, headers=ua)
        soup = BeautifulSoup(page.content, 'html.parser')

        #Get all links corresponding to authors:
        for para in soup.find_all('div', class_='quote'):
            author = para.find('small', class_='author').text.strip()
            if authors.get(author) == None:
                link_href = para.find('a')['href']
                authors[author] = BASE_URL+link_href
    # title_tag = para.find("a", title=True)
    # link = title_tag.get("href")
    # print(title_tag, link)
    # for para in soup.find_all('quote'):
    #     title_tag = para.find("a", title=True)
    #     link = title_tag.get("href")
    
    #Loop over these:
    
        #if a link is not in authors, add it:
        
    #Return results
    return authors

#Test:
authors = get_links(BASE_URL)
print(authors)

{'Albert Einstein': 'http://quotes.toscrape.com/author/Albert-Einstein', 'J.K. Rowling': 'http://quotes.toscrape.com/author/J-K-Rowling', 'Jane Austen': 'http://quotes.toscrape.com/author/Jane-Austen', 'Marilyn Monroe': 'http://quotes.toscrape.com/author/Marilyn-Monroe', 'André Gide': 'http://quotes.toscrape.com/author/Andre-Gide', 'Thomas A. Edison': 'http://quotes.toscrape.com/author/Thomas-A-Edison', 'Eleanor Roosevelt': 'http://quotes.toscrape.com/author/Eleanor-Roosevelt', 'Steve Martin': 'http://quotes.toscrape.com/author/Steve-Martin', 'Bob Marley': 'http://quotes.toscrape.com/author/Bob-Marley', 'Dr. Seuss': 'http://quotes.toscrape.com/author/Dr-Seuss', 'Douglas Adams': 'http://quotes.toscrape.com/author/Douglas-Adams', 'Elie Wiesel': 'http://quotes.toscrape.com/author/Elie-Wiesel', 'Friedrich Nietzsche': 'http://quotes.toscrape.com/author/Friedrich-Nietzsche', 'Mark Twain': 'http://quotes.toscrape.com/author/Mark-Twain', 'Allen Saunders': 'http://quotes.toscrape.com/author/All

#### IMPLEMENTED IN 1.1 1.2 iterate over pages of quotes

In a second step, fill the `collect` function below, which will iteratively collect author links. This function will take as input parameters:
- `url`: the starting url from which to collect links,
- `authors`: the list of links to be updated
- `limit`: the number of pages to visit (default being `None`, which means visit all pages)

In [2]:
def collect(url, authors, limit=None):
    #Add links contained in page located at url to the authors being computed
    authors.extend([x for x in get_links(url) if x not in authors])
    #If no limit is given or limit > 1

        # Get page located at url:

        # Get url of next page

        # recursively collect links (if any)

# Test
authors = []
collect(BASE_URL, authors, limit=2)
print(authors)

[]


#### Question 1.3 : get actual biographies

For each of the links computed in the previous question, retrieve the corresponding webpage and extract the biography it contains. To do so, fill the `get_biography` function below. It will feed a list of dictionaries of the following form:
```python
bios = [{name: '...', birth_date: '...', birth_place: '...', bio: '...'}, ...]
```

In [24]:
def get_biography(url):
    # Get page located at URL and parse it
    ua = {'User-agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=ua)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Get name with BeautifulSoup
    para = soup.find('div', class_='author-details')
    # print(para.prettify)
    name = para.find('h3', class_='author-title').get_text(strip=True) 
    # Get birth date
    temp = para.find('p')
    birth_date = temp.find('span', class_='author-born-date').get_text(strip=True)
    # birth_date = birth_date.get('author-born-date')
    # Get birth place
    birth_place= temp.find('span', class_='author-born-location').get_text(strip=True)
    # Get bio
    bio = soup.find('div', class_='author-description').get_text(strip=True)
    return {'name':name, 'birth_date': birth_date, 'birth_place': birth_place, 'bio': bio}

def get_bios(urls):
    bios = []
    for u in urls:
        bios.append(get_biography(u))
    return bios

#Test
bios=get_bios(list(authors.values())[0:6])
print(bios)

[{'name': 'Albert Einstein', 'birth_date': 'March 14, 1879', 'birth_place': 'in Ulm, Germany', 'bio': 'In 1879, Albert Einstein was born in Ulm, Germany. He completed his Ph.D. at the University of Zurich by 1909. His 1905 paper explaining the photoelectric effect, the basis of electronics, earned him the Nobel Prize in 1921. His first paper on Special Relativity Theory, also published in 1905, changed the world. After the rise of the Nazi party, Einstein made Princeton his permanent home, becoming a U.S. citizen in 1940. Einstein, a pacifist during World War I, stayed a firm proponent of social justice and responsibility. He chaired the Emergency Committee of Atomic Scientists, which organized to alert the public to the dangers of atomic warfare.At a symposium, he advised: "In their struggle for the ethical good, teachers of religion must have the stature to give up the doctrine of a personal God, that is, give up that source of fear and hope which in the past placed such vast power i

#### Question 1.4: save your dataset

Finally, write a `save` function which takes as an input a list of biographies as computed above and save them in JSON on disk (the filename being an input parameter).

In [25]:
import json

In [28]:
def save(filename, dataset):
    # Open output file
    json_string = json.dumps(bios, indent=2)
    print(json_string)
    with open(filename, 'w') as f:
        # f.write('\n'.join(column_str))
        json.dump(bios, f, indent=2)
    # write data in JSON format
    # pass #remove when ready

save('bios.json', bios)

[
  {
    "name": "Albert Einstein",
    "birth_date": "March 14, 1879",
    "birth_place": "in Ulm, Germany",
    "bio": "In 1879, Albert Einstein was born in Ulm, Germany. He completed his Ph.D. at the University of Zurich by 1909. His 1905 paper explaining the photoelectric effect, the basis of electronics, earned him the Nobel Prize in 1921. His first paper on Special Relativity Theory, also published in 1905, changed the world. After the rise of the Nazi party, Einstein made Princeton his permanent home, becoming a U.S. citizen in 1940. Einstein, a pacifist during World War I, stayed a firm proponent of social justice and responsibility. He chaired the Emergency Committee of Atomic Scientists, which organized to alert the public to the dangers of atomic warfare.At a symposium, he advised: \"In their struggle for the ethical good, teachers of religion must have the stature to give up the doctrine of a personal God, that is, give up that source of fear and hope which in the past pla

## Exercise 2. Let's use Scrapy now!

Here the goal is to play with scrapy. Let's look at the wikipedia article https://en.wikipedia.org/wiki/List_of_French_artists. Let's say, we want to extract all names of artists from here with links to their corresponding wikipedia pages and the first paragraph about them.

You will find a file called `Exercise_sheet_3_scrapy.py`. Can you fill in the gaps in this script?


In addition to the Scrapy documentation I highly recommend you to look at possible selectors: https://www.w3schools.com/cssref/css_selectors.php

In [29]:
%pip install scrapy

Defaulting to user installation because normal site-packages is not writeable
Collecting scrapy
  Downloading Scrapy-2.12.0-py2.py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 KB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting lxml>=4.6.0
  Downloading lxml-5.3.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hCollecting tldextract
  Downloading tldextract-5.1.3-py3-none-any.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.9/104.9 KB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyDispatcher>=2.0.5
  Downloading PyDispatcher-2.0.7-py3-none-any.whl (12 kB)
Collecting cryptography>=37.0.0
  Downloading cryptography-44.0.0-cp39-abi3-manylinux_2_28_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [38]:
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class WikipediaSpider(scrapy.Spider): #Something is missing here. What exactly?
    def __init__(self):
        self.name = "wikipedia"
        self.start_urls = ["https://en.wikipedia.org/wiki/List_of_French_artists"]

    def parse(self, response):
        # print(f"Response URL: {response.url}")
        # print(f"Response status: {response.status}")

        list_els = response.css('ul > li > a::attr(href)').getall()
        list_see_also = response.css('h2[id="See_also"] + ul > li > a::attr(href)').getall()
        res_list = list(set(list_els) - set(list_see_also))
        for link in res_list:
            #check that the link actually exists and is not red
                yield response.follow(link, callback=self.parse_artist)
        
    def parse_artist(self, response):
        url = response.url
        name = response.css('h1.firstHeading::text').get().strip()
        paragraph = response.css('p:first-child::text').get().strip()

        yield {'url': url,
            'name': name,
            'paragraph': paragraph}
        
        # url = #get url of the page
        # name = # get name of the artist
        # paragraph = # get the first paragraph
        # yield {'url': url,
        #        'name': name,
        #        'paragraph': paragraph}
        
        
if __name__=='__main__':
    import scrapy.crawler
    
    process = scrapy.crawler.CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'FEEDS': {
            "artists.json": {"format": "json"},
        },
    })
    process.crawl(WikipediaSpider)
    process.start()
    process.stop()

2025-01-20 19:04:49 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-20 19:04:49 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.10.0, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0], pyOpenSSL 25.0.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
2025-01-20 19:04:49 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-20 19:04:49 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2025-01-20 19:04:49 [scrapy.extensions.telnet] INFO: Telnet Password: fa1b8e23be628fe3
2025-01-20 19:04:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-20 19:04:49 [scrapy.crawler] IN

ReactorNotRestartable: 

In [None]:
# import scrapy.crawler
    
# process = scrapy.crawler.CrawlerProcess({
#     'USER_AGENT': 'Mozilla/5.0',
#     'FEEDS': {
#         "artists.json": {"format": "json"},
#     },
# })
# process.crawl(WikipediaSpider)
# process.start()
# process.stop()


2025-01-20 18:37:58 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2025-01-20 18:37:58 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.10.0, w3lib 2.2.1, Twisted 24.11.0, Python 3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0], pyOpenSSL 25.0.0 (OpenSSL 3.4.0 22 Oct 2024), cryptography 44.0.0, Platform Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
2025-01-20 18:37:58 [scrapy.addons] INFO: Enabled addons:
[]
2025-01-20 18:37:58 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2025-01-20 18:37:58 [scrapy.extensions.telnet] INFO: Telnet Password: 0815e3f7991ab220


2025-01-20 18:37:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2025-01-20 18:37:58 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/5.0'}
2025-01-20 18:37:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermidd

ReactorNotRestartable: 

In [37]:
process.stop()

2025-01-20 19:04:34 [scrapy.core.engine] INFO: Closing spider (shutdown)
2025-01-20 19:04:34 [scrapy.extensions.feedexport] INFO: Stored json feed (0 items) in: artists.json
2025-01-20 19:04:34 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 50.886055,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2025, 1, 20, 16, 4, 34, 810483, tzinfo=datetime.timezone.utc),
 'items_per_minute': None,
 'log_count/DEBUG': 1,
 'log_count/INFO': 11,
 'memusage/max': 134254592,
 'memusage/startup': 134254592,
 'responses_per_minute': None,
 'start_time': datetime.datetime(2025, 1, 20, 16, 3, 43, 924428, tzinfo=datetime.timezone.utc)}
2025-01-20 19:04:34 [scrapy.core.engine] INFO: Spider closed (shutdown)


<DeferredList at 0x7fc008659ea0 current result: [(True, None)]>