### Working with APIs##
#### CAS Applied Data Science 2024 ####

#Exercises#



---


<font color='violet'>
Hints are written in white, so you do not see them immediately. If you highlight them (or double-click on them), they will appear!
<font color='white'> I am a hint! :-)


---


## 1. Basic exercises

Note: In these exercises, you will have to work with dictionaries, list comprehensions and loops quite often. If you don't remember/understand them very well, you should first go through the respective concepts once more!

### Exercise 1.1

Import the ``requests`` and the ``BeautifulSoup`` library.

In [1]:
import requests
from bs4 import BeautifulSoup

Use the Wikipedia API to retrieve the Wikipedia page on tigers (Tiger).

In [5]:
ENDPOINT = "https://en.wikipedia.org/w/api.php"

PARAMS = {
    "action": "parse",
    "page": "Tiger",
    "format": "json",
}

r = requests.get(url=ENDPOINT, params=PARAMS)


Parse the JSON response object (so that it is converted to a Python dictionary).

In [6]:
data = r.json()["parse"]

What is the URL you retrieved the data from? Type it into your browser and inspect the structure of the data.

In [None]:
https://en.wikipedia.org/w/api.php?page=Tiger&format=json&action=parse

Now inspect the keys of your dictionary (and those of the dictionary within the "parse" key)

In [7]:
data.keys()



Now try to retrieve the following:

1. The title of the page
2. All external links
3. All section headings (*extra task: all main section headings, i.e. headings on level 2)
4. The number of images in the article
5. All URLs to Wikipedia articles on tigers in other languages

In [8]:
# 1. The title of the page
data["title"]

'Tiger'

In [15]:
# 2. All external links
[ link["url"] for link in data["iwlinks"]]

['https://kk.wikipedia.org/wiki/%D0%86%D0%BB%D0%B5-%D0%91%D0%B0%D0%BB%D2%9B%D0%B0%D1%88_(%D1%80%D0%B5%D0%B7%D0%B5%D1%80%D0%B2%D0%B0%D1%82)',
 'https://commons.wikimedia.org/wiki/Panthera_tigris',
 'https://commons.wikimedia.org/wiki/Category:Panthera_tigris',
 'https://species.wikimedia.org/wiki/Panthera_tigris',
 'https://en.wikiquote.org/wiki/Tigers',
 'https://en.wikivoyage.org/wiki/Tigers',
 'https://www.wikidata.org/wiki/Q19939',
 'https://www.wikidata.org/wiki/Q41083521',
 'https://de.wikipedia.org/wiki/Tierstimmenarchiv']

In [21]:
# 3. All section headings
data["sections"]
[section["anchor"] for section in data["sections"]]

['Etymology',
 'Taxonomy',
 'Subspecies',
 'Evolution',
 'Hybrids',
 'Characteristics',
 'Coat',
 'Colour_variations',
 'Distribution_and_habitat',
 'Population_density',
 'Behaviour_and_ecology',
 'Social_spacing',
 'Communication',
 'Hunting_and_diet',
 'Competitors',
 'Reproduction_and_life_cycle',
 'Health_and_diseases',
 'Threats',
 'Conservation',
 'Relationship_with_humans',
 'Hunting',
 'Attacks',
 'Captivity',
 'Cultural_significance',
 'See_also',
 'References',
 'Bibliography',
 'External_links']

In [25]:
# 3. Extra task: all top level section headings (level 2)
data["sections"]
[section["anchor"]  for section in data["sections"] if section["toclevel"] == 2]

['Subspecies',
 'Evolution',
 'Hybrids',
 'Coat',
 'Population_density',
 'Social_spacing',
 'Communication',
 'Hunting_and_diet',
 'Competitors',
 'Reproduction_and_life_cycle',
 'Health_and_diseases',
 'Hunting',
 'Attacks',
 'Captivity',
 'Cultural_significance',
 'Bibliography']

In [30]:
# 4. The number of images in the article
data.keys()
data["images"]
length = len(data["images"])
print(f"Number of images: {length}")


Number of images: 63


In [38]:
# 5. All URLs to Wikipedia articles on tigers in other languages
print(data.keys())
data["langlinks"]
[{"lang": link["lang"], "url": link["url"]} for link in data["langlinks"]]



[{'lang': 'ace', 'url': 'https://ace.wikipedia.org/wiki/Rimu%C3%ABng'},
 {'lang': 'kbd',
  'url': 'https://kbd.wikipedia.org/wiki/%D0%A5%D1%8C%D1%8D%D1%89%D0%BE%D0%BC%D1%8B%D1%89'},
 {'lang': 'ady',
  'url': 'https://ady.wikipedia.org/wiki/%D0%9A%D1%8A%D1%8D%D0%BF%D0%BB%D1%8A%D0%B0%D0%BD'},
 {'lang': 'af', 'url': 'https://af.wikipedia.org/wiki/Tier'},
 {'lang': 'als', 'url': 'https://als.wikipedia.org/wiki/Tiger'},
 {'lang': 'am',
  'url': 'https://am.wikipedia.org/wiki/%E1%8A%90%E1%89%A5%E1%88%AD'},
 {'lang': 'anp',
  'url': 'https://anp.wikipedia.org/wiki/%E0%A4%AC%E0%A4%BE%E0%A4%98'},
 {'lang': 'ang', 'url': 'https://ang.wikipedia.org/wiki/Tiger'},
 {'lang': 'ar', 'url': 'https://ar.wikipedia.org/wiki/%D8%A8%D8%A8%D8%B1'},
 {'lang': 'an', 'url': 'https://an.wikipedia.org/wiki/Panthera_tigris'},
 {'lang': 'roa-rup', 'url': 'https://roa-rup.wikipedia.org/wiki/Tigru'},
 {'lang': 'ast', 'url': 'https://ast.wikipedia.org/wiki/Panthera_tigris'},
 {'lang': 'gn', 'url': 'https://gn.wikipedi

Repetition (dictionaries and loops): Now make a dictionary of all languages for which a Wikipedia page exists as well as the titles of these pages. Use the languages as keys and the titles as values (`{"English":"Tiger", ...}`). How are tigers called in `Finnish`?

In [40]:
[{"lang": link["lang"], "url": link["url"]} for link in data["langlinks"] if link["lang"] == "fi"]

[{'lang': 'fi', 'url': 'https://fi.wikipedia.org/wiki/Tiikeri'}]

### Exercise 1.2

Find the HTML text within the structured data on the tiger page and assign it to a variable named ``htmlText``.

Convert the string to a BeautifulSoup object and assign it to a variable called ``tigerSoup``.

Repetitition: Extract (1) the first table and (2) the third paragraph of the article.

### Exercise 1.3

Retrieve the information about 10 Wikipedia pages that match with the word "tiger" and convert the response to a dictionary. ><font color='violet'> Hint: <font color='white'> Hint: The "srlimit" parameter allows you to specify how many pages you want to retrieve.

Navigate through the dictionary or type the URL into your browser to inspect how the data is structured.

Now print out the titles of the 10 tiger pages.

### Exercise 1.4

In the web scraping exercises, you wrote a simple scraper that fetched you some information from the following animal pages:

In [None]:
animals = ["Cat", "Dog", "Tiger", "Giant_panda"]

You will now try to do the same using the API: Write a loop that fetches all the pages and retrieves (1) the title and (2) the number of images on each page.




Let's do this step by step. First try to fetch the "Cat" page and convert the JSON response you get into a Python dictionary.

Now try to retrieve the title and the number of images. Assign each of them to a variable.

Now write a loop that fetches all the pages and writes the response into a list. You will have to create an empty list and ``append`` the new elements to it:

Finally, you can bring everything together. Improve your loop so that it parses each page, retrieves the title and the number of images and writes them into a nested list.

## 2. Advanced exercises*

---


<font color='red'>
*Feel free to skip the advanced exercises if you feel overwhelmed or if trying to solve the basic exercises already took you a lot of time!


---




### Exercise 2.1

Write a function called ``getWiki`` that allows you to enter the name of a Wikipedia page and returns the parsed JSON response as a Python dictionary.

### Exercise 2.2

You would like to know if Zürich or Bern is more popular on Wikipedia. For this purpose, you will measure (1) the number of Wikipedia articles within a 1km-radius around the train station and (2) the total number of images in these articles. Try to work with functions instead of copying and pasting code!

You can take the following coordinates:
* Bern: 46.949722, 7.439444
* Zürich: 47.377455, 8.539688


Start by comparing the number of articles:

Now try to compare the total number of images in these articles. You will have to write a loop to retrieve each page.