<a href="https://colab.research.google.com/github/misupova/BSSDH-2022/blob/main/solutions/WebHarvesting_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# STEP 1 TASK

## Excercise

Find the table with the list of largest cities in the EU and construct a query to import it into a Google Sheet.

## Solution

For that, you first need to create an empty Google Sheet.

Then, find a table with the list of largest cities in the EU, for example: https://en.wikipedia.org/wiki/List_of_cities_in_the_European_Union_by_population_within_city_limits

Finally, construct a formula in the Google Sheets to extract that table:
```
=IMPORTHTML("https://en.wikipedia.org/wiki/List_of_cities_in_the_European_Union_by_population_within_city_limits", "table", 1)
```

Note, that in this query there should be specifically double quotes.

# STEP 2 TASKS


## Excercise

Find out how many articles are there in the Spaceflight News API database.


## Solution

In this task I wanted you to explore documentation of this API. 

The documentation is available here: https://api.spaceflightnewsapi.net/v3/documentation

The API call I'm asking for is: https://api.spaceflightnewsapi.net/v3/articles/count


## Exercise

Construct a query for the GDELT database with the following parameters: Search for articles mentioning both **fuel** and **nuclear** search terms published in Latvia, Estonia and Lithuania from Jun-01-2022 to Jun-30-2022. Retrieve the search results using Python and save results to CSV file.

Hint: you can construct a query to extract data in the CSV format directly.


## Solution



1.   Go to the GDELT API web interface: https://gdelt.github.io/
2.   In the task I'm asking for the search term fuel and nuclear, so they should put `fuel AND nuclear` into *search term* field.
3.   Then they should select *Latvia, Lithuania and Estonia* in the *Source Countries* fiels
4.   Then they should select the correct data range. To be able to select the data range, the field *Recent* should be empty.
5.   Finally, they should select either *csv* in the format field and export the data directly, or *json* field, and then convert the response to the *csv* using Pandas.
6.   The final API call should be: `https://api.gdeltproject.org/api/v2/doc/doc?query=fuel AND nuclear%20(sourcecountry:LG%20OR%20sourcecountry:LH%20OR%20sourcecountry:EN)&mode=ArtList&maxrecords=75&format=csv&startdatetime=20220601000000&enddatetime=20220630235959`
7.   Then they can call the API from Python as follows:
```
import requests
```
```
base_url = 'https://api.gdeltproject.org/api/v2/'
endpoint = 'doc/doc'
parameters = {
  'query': 'fuel AND nuclear (sourcecountry:LG OR sourcecountry:LH OR sourcecountry:EN)',
  'mode': 'ArtList',
  'maxrecords': 75,
  'format': 'csv',
  'startdatetime': '20220601000000',
  'enddatetime': '20220630235959'
}
```
```
url = base_url + endpoint
response = requests.get(url, params=parameters)
response.json()
```



# STEP 3 TASKS

## Exercise

Study the structure of any article on https://eng.lsm.lv/ website. Identify elements of the page that we might want to scrape, and their respective tags and classes.

## Possible solution

Example article: https://eng.lsm.lv/article/society/defense/latvia-interested-in-buying-himars-systems.a466648/

Participant should solve this task using their browser's Developer Tools

Example solution:

*   Title: `h1` tag, `article-title` class
*   Lead section: `h2` tag, `article-lead` class
*   Article text: `div` tag, `article__body` class



## Exercise

In this exercise, we will try to scrape Wikipedia. 

1.   Check the *robots.txt* file on Wikipedia. Is it allowed to scrape Wikipedia? What else is interesting in this file?
2.   Select an article you will scrape (e.g. https://en.wikipedia.org/wiki/Lion%27s_mane_jellyfish)
3.   Initialize a new spider (Scrapy project) called wikiSpider. There, create a
spider article.py 
4.   Extract the title of this article
5.   Try to add more Wikipedia pages to the `start_requests` urls and check whether your scraper works with other Wikipedia pages.

## Possible solution

In [None]:
# install Scrapy
!pip install Scrapy
!scrapy startproject wikiSpider

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Scrapy
  Downloading Scrapy-2.6.2-py2.py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 5.3 MB/s 
[?25hCollecting pyOpenSSL>=16.2.0
  Downloading pyOpenSSL-22.0.0-py2.py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 4.0 MB/s 
[?25hCollecting zope.interface>=4.1.3
  Downloading zope.interface-5.4.0-cp37-cp37m-manylinux2010_x86_64.whl (251 kB)
[K     |████████████████████████████████| 251 kB 38.5 MB/s 
Collecting tldextract
  Downloading tldextract-3.3.1-py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 2.0 MB/s 
Collecting protego>=0.1.15
  Downloading Protego-0.2.1-py2.py3-none-any.whl (8.2 kB)
Collecting cryptography>=2.0
  Downloading cryptography-37.0.4-cp36-abi3-manylinux_2_24_x86_64.whl (4.1 MB)
[K     |████████████████████████████████| 4.1 MB 50.0 MB/s 
[?25hCollecting itemadapter>=0.1.0
  Dow

In [None]:
import os
# change working directories
os.chdir('/content/wikiSpider/wikiSpider/spiders')

In [None]:
%%writefile article.py
import scrapy

class ArticleSpider(scrapy.Spider):
  name='article'
  start_urls = ['https://en.wikipedia.org/wiki/Lion%27s_mane_jellyfish',
                'https://en.wikipedia.org/wiki/Robyn_E._Kenealy',
                'https://en.wikipedia.org/wiki/Grodzany']

  def parse(self, response):
    title = response.css('h1.firstHeading::text').get()
    print('Title is: ' + title)

Writing article.py


In [None]:
!scrapy runspider article.py

2022-07-26 12:16:15 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: wikiSpider)
2022-07-26 12:16:15 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
2022-07-26 12:16:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'wikiSpider',
 'NEWSPIDER_MODULE': 'wikiSpider.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_LOADER_WARN_ONLY': True,
 'SPIDER_MODULES': ['wikiSpider.spiders']}
2022-07-26 12:16:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-07-26 12:16:15 [scrapy.extensions.telnet] INFO: Telnet Password: 9ac28542a786a8e8
2022-07-26 12:16:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.

## Exercise

In this exercise, you will scrape the Estonian news website: https://news.err.ee/

The task is to create a spider to collect all links for the articles on the main page, then extract their title, publication date and the lead section of the article (in bold). Create a separate Scrapy project for that task. Don't forget to think about Item structure for the extracted data.

Bonus task: extract also tags below the article body.



## Possible solution

In [None]:
# install Scrapy
!pip install Scrapy
!scrapy startproject errSpider

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting Scrapy
  Downloading Scrapy-2.6.2-py2.py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 5.0 MB/s 
[?25hCollecting PyDispatcher>=2.0.5
  Downloading PyDispatcher-2.0.5.zip (47 kB)
[K     |████████████████████████████████| 47 kB 4.2 MB/s 
[?25hCollecting itemadapter>=0.1.0
  Downloading itemadapter-0.6.0-py3-none-any.whl (10 kB)
Collecting parsel>=1.5.0
  Downloading parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting Twisted>=17.9.0
  Downloading Twisted-22.4.0-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 50.9 MB/s 
[?25hCollecting cssselect>=0.9.1
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting pyOpenSSL>=16.2.0
  Downloading pyOpenSSL-22.0.0-py2.py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 4.2 MB/s 
[?25hCollecting queuelib>=1.4.2
  Downloading queuelib-1.6.2-py2.p

In [None]:
import os
os.chdir('/content/errSpider/errSpider')

Create the file with Item structure definition

In [None]:
%%writefile items.py
import scrapy


class ErrspiderItem(scrapy.Item):
    title = scrapy.Field()
    lead = scrapy.Field()
    date = scrapy.Field()
    tags = scrapy.Field()


Overwriting items.py


In [None]:
os.chdir('/content/errSpider/errSpider/spiders')

Create the Spider file:

In [None]:
%%writefile errMain.py
import scrapy
from errSpider.items import ErrspiderItem

class ErrMainSpider(scrapy.Spider):
  name = 'errMain'
  allowed_domains = ["delfi.ee"]

  start_urls = ['https://www.delfi.ee/']

  def parse(self, response):
    for href in response.css('h5.C-headline-title').css('a::attr("href")'):
      url = response.urljoin(href.get())
      yield scrapy.Request(url, callback = self.parse_article_content)

  def parse_article_content(self, response):
    article = ErrspiderItem()
    article['title'] = response.css('h1.C-article-info__title').get()
    yield article

Overwriting errMain.py


In [None]:
!scrapy runspider errMain.py -o errMain.csv -t csv

  opts.overwrite_output,
2022-07-27 18:23:22 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: errSpider)
2022-07-27 18:23:22 [scrapy.utils.log] INFO: Versions: lxml 4.2.6.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
2022-07-27 18:23:22 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'errSpider',
 'NEWSPIDER_MODULE': 'errSpider.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_LOADER_WARN_ONLY': True,
 'SPIDER_MODULES': ['errSpider.spiders']}
2022-07-27 18:23:22 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-07-27 18:23:22 [scrapy.extensions.telnet] INFO: Telnet Password: 4af5d890f7d884ca
2022-07-27 18:23:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.Teln