<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item">
<li><span><a href="#1.-Introduction-to-Web-Scraping" data-toc-modified-id="1.-Introduction-to-Web-Scraping-1">1. Introduction to Web Scraping</a></span><ul class="toc-item">
<li><span><a href="#1.1-Example:-Getting-information-about-the-International-Space-Station-(ISS)-from-http://api.open-notify.org" data-toc-modified-id="1.2-Example:-Getting-information-about-the-International-Space-Station-(ISS)-from-http://api.open-notify.org-1.2">1.2 Example: Getting information about the International Space Station (ISS) from <a href="http://api.open-notify.org" target="_blank">http://api.open-notify.org</a></a></span></li>
<li><span><a href="#1.2-Example:-Getting-information-about-countries-using-Rest-Countries-API" data-toc-modified-id="1.2-Example:-Getting-information-about-countries-using-Rest-Countries-API-1.3">1.3 Example: Getting information about countries using Rest Countries API</a></span></li></ul>
</li></ul></div>

---
# 1. Introduction to Web Scraping
---

Web scraping is the process of extracting data from websites. It can be performed manually or automated using software to download and store the data in an accessible format. 

In this notebook, we will be exploring web data access in Python using the built-in **requests** package, which allows us to make HTTP requests.

## 1.1 Example: Getting information about the International Space Station (ISS) from http://api.open-notify.org

In [1]:
import requests

In [2]:
# Example: Get info about the ISS
url = "http://api.open-notify.org/iss-now.json"
r = requests.get(url)
print(r.status_code)

200


In [3]:
r.text # the raw response

'{"iss_position": {"longitude": "-120.5067", "latitude": "11.0557"}, "timestamp": 1656576593, "message": "success"}'

The response is in .json format, which looks similar to a Python dictionary. It can be converted into an actual Python dictionary using `.json()` method

Note: Some websites may also return the response in formats other than json (e.g. html or xml)

In [4]:
import json # deserilaise this response into python objects (nested lists and dictionaries)

data = json.loads(r.text)
data["timestamp"]
json.dump(data, open("data.json", "w")) # This writes it to a file if we want to access it later

In [5]:
data = r.json()
data

{'iss_position': {'longitude': '-120.5067', 'latitude': '11.0557'},
 'timestamp': 1656576593,
 'message': 'success'}

In [6]:
data["iss_position"]["latitude"]

'11.0557'

Once converted into a dictionary, the response can be manipulated using standard indexing techniques

In [7]:
from datetime import datetime

iss_now = datetime.fromtimestamp(data["timestamp"])
iss_now

datetime.datetime(2022, 6, 30, 9, 9, 53)

The timestamp is a sequence of numbers ([Unix time format](https://en.wikipedia.org/wiki/Unix_time)). We can convert this into a readable date using the built-in **datetime** package in Python

### Concept Check <a class="tocSkip">

Print out a list of all people who are currently in space on the ISS. Use `http://api.open-notify.org/astros.json`



In [9]:
api_endpoint = "http://api.open-notify.org/astros.json"
r = requests.get(api_endpoint)
print(f"GET request status code for {api_endpoint}: {r.status_code}")
response = r.json()

for person in response["people"]:
    if person["craft"] == "ISS":
        print(person["name"])

GET request status code for http://api.open-notify.org/astros.json: 200
Oleg Artemyev
Denis Matveev
Sergey Korsakov
Kjell Lindgren
Bob Hines
Samantha Cristoforetti
Jessica Watkins


## 1.2 Example: Getting information about countries using Rest Countries API
Rest Countries API: <https://restcountries.com>

In [13]:
# Getting API URI for a particular coutnry
url = "https://restcountries.com/v3.1/name/Japan"

In [14]:
r = requests.get(url) # Get request
r.status_code # Check status code

200

In [15]:
response = r.json()
response # Note: The response is a list of dictionaries

[{'name': {'common': 'Japan',
   'official': 'Japan',
   'nativeName': {'jpn': {'official': '日本', 'common': '日本'}}},
  'tld': ['.jp', '.みんな'],
  'cca2': 'JP',
  'ccn3': '392',
  'cca3': 'JPN',
  'cioc': 'JPN',
  'independent': True,
  'status': 'officially-assigned',
  'unMember': True,
  'currencies': {'JPY': {'name': 'Japanese yen', 'symbol': '¥'}},
  'idd': {'root': '+8', 'suffixes': ['1']},
  'capital': ['Tokyo'],
  'altSpellings': ['JP', 'Nippon', 'Nihon'],
  'region': 'Asia',
  'subregion': 'Eastern Asia',
  'languages': {'jpn': 'Japanese'},
  'translations': {'ara': {'official': 'اليابان', 'common': 'اليابان'},
   'ces': {'official': 'Japonsko', 'common': 'Japonsko'},
   'cym': {'official': 'Japan', 'common': 'Japan'},
   'deu': {'official': 'Japan', 'common': 'Japan'},
   'est': {'official': 'Jaapan', 'common': 'Jaapan'},
   'fin': {'official': 'Japani', 'common': 'Japani'},
   'fra': {'official': 'Japon', 'common': 'Japon'},
   'hrv': {'official': 'Japan', 'common': 'Japan'},


In [18]:
response[0].keys()

dict_keys(['name', 'tld', 'cca2', 'ccn3', 'cca3', 'cioc', 'independent', 'status', 'unMember', 'currencies', 'idd', 'capital', 'altSpellings', 'region', 'subregion', 'languages', 'translations', 'latlng', 'landlocked', 'area', 'demonyms', 'flag', 'maps', 'population', 'gini', 'fifa', 'car', 'timezones', 'continents', 'flags', 'coatOfArms', 'startOfWeek', 'capitalInfo', 'postalCode'])

In [17]:
response[0]["currencies"]["JPY"]["name"]

'Japanese yen'

### Concept Check  <a class="tocSkip">

1. Print out a list of all the capital cities in Europe that begin with the letter 'L'?
2.  Print out a list of all the capital cities for countries that begin with the letter 'L'?

In [20]:
europe_endpoint = "https://restcountries.com/v3.1/region/europe"
r = requests.get(europe_endpoint)
print(f"GET request status code for {europe_endpoint}: {r.status_code}")
response = r.json() # convert to dictionary

GET request status code for https://restcountries.com/v3.1/region/europe: 200


In [25]:
# Object is a list of dictionaries, one for each country
response[2].keys() # Check out keys for each country

dict_keys(['name', 'tld', 'cca2', 'ccn3', 'cca3', 'cioc', 'independent', 'status', 'unMember', 'currencies', 'idd', 'capital', 'altSpellings', 'region', 'subregion', 'languages', 'translations', 'latlng', 'landlocked', 'borders', 'area', 'demonyms', 'flag', 'maps', 'population', 'gini', 'fifa', 'car', 'timezones', 'continents', 'flags', 'coatOfArms', 'startOfWeek', 'capitalInfo', 'postalCode'])

In [32]:
response[0].get("name") # if key doesn't exist, .get returns none instead of failing
# A dictionary of names

{'common': 'Cyprus',
 'official': 'Republic of Cyprus',
 'nativeName': {'ell': {'official': 'Δημοκρατία της Κύπρος',
   'common': 'Κύπρος'},
  'tur': {'official': 'Kıbrıs Cumhuriyeti', 'common': 'Kıbrıs'}}}

In [33]:
response[0].get("capital") # Returns a list of a name

['Nicosia']

In [35]:
for country in response:
    capital_city = country.get("capital")[0]
    name = country.get("name").get("common")
    if capital_city.startswith("L"):
        print(f"{name}: {capital_city}")

Portugal: Lisbon
United Kingdom: London
Luxembourg: Luxembourg
Svalbard and Jan Mayen: Longyearbyen
Slovenia: Ljubljana
