# Webscraping: APIs

## Environment setup

In [1]:
from google.colab import drive, files
import json
drive.mount('/mntDrive') 
path = "/mntDrive/My Drive/Colab Notebooks/"

Mounted at /mntDrive


## Theory behind 

![REST API](https://drive.google.com/uc?id=1jUdf7QX76DtwcDRHgQAC_qBCAdh0VmZP)

**Figure:** REST API - Author: Seobility 

Websites sometimes use a different approach to serve their content - instead of generating and returning a complete site, they send a skeleton of a site with javascript code snippets which queries the server for contents to dynamically populate the aforementioned site-skeleton. It is a widespread solution, most of the sites applies this approach to provide their contents. This architecture style is called a **[REST API](https://en.wikipedia.org/wiki/Representational_state_transfer)**. It has three main component:
- Client (the javascript code running in the webbrowser)
- API (the software running on a server)
- Database (the storage solution)

The client communicates with the API (but it has no direct access to the database itself) through different commands:
- GET: the receive data
- POST: to send (and possibly receive) data
- PUT: to add new content
- DELETE: to remove content

Throughout the communication the data is sent in a structured format, generally in [JSON](https://en.wikipedia.org/wiki/JSON) or [XML](https://en.wikipedia.org/wiki/XML). The client side code is responsible to transform the received data and populate the site.

The API will wait for incoming commands in a so called [endpoint](https://en.wikipedia.org/wiki/Service-oriented_architecture). Some sites tell you about (expose) their endpoint directly - in this case you are encouraged to use them to gather information. Other sites don't but that doesn't mean they are not using one. We are going to use this information for our advantage.

##  General algorithm to uncover and exploit REST APIs:

__Warning #1:__ Sometimes, the direct usage of APIs is forbidden for commercial purposes. Before you start building a business on it, you might want to read the related terms and conditions of the website. Rare and non-commercial usage should not result in any actions.

__Warning #2:__ Not every website uses REST API (or they are restricted in some ways). Therefore, this method will __not__ work in every single case. Sometimes, parsing an HTML is just not something you can avoid. However, it is surely worth checking as you may retrieve the whole dataset without having to parse and clean anything. 

__Task__: Say you want to scrape the departing flights for a given day from [Budapest Liszt Ferenc Airport](https://www.bud.hu/indulo_jaratok). You need every detail that is accessible.

1. Open the [website](https://www.bud.hu/indulo_jaratok), right click and go inspect. On the top bar, instead of browsing the `Elements` tab, change to `Network`. If nothing is displayed here, refresh the page. This will show you the list of network traffic that happens under the hoods. There are pictures here, JavaScript codes and a bunch of scary process that we will avoid, don't worry. You will want to order the requests by `Type`. In most of the cases, `xhr` and `document` types will be the ones we care about. If you click on one of the `xhr` types, this is what should pop up.

![micro0](https://drive.google.com/uc?id=1AIb9eMa5dmh-vkaijBlaoh7LL7KXnbEM)

2. The `Headers` tab shows you the input details of the request that was sent out retrieve this specific content. If you change to the `Preview` or the `Response` tabs, the result of this request will be shown to you. While clicking the former will give you a nicer and rendered look, the latter returns a raw version.

3. Now, the task is to find the entry that returns the pieces of flights data we need. Let's check all the ones with `Type` = `xhr` first and check their `Preview` tabs to find the right one. I think we have a winner here, this looks great: 

 ![micro2](https://drive.google.com/uc?id=1c1oPqg3ClrL68lRyGRZBTJQiT-okhUq9)
 
4. Click on the "play button looking" triangle to expand an entry. Okay, this is very cool, we have it.

5. Next, we need to find a way replicate it so that we can get the data programmatically. If only there was a way to retrieve the input data for this very request. Oh wait! This is what the `Headers` tab is there for, isn't it? It is!

6. Now, the `Headers` tab contains details in a non-Python format (this is not entirely true, but at this point you are not assumed to have the skills needed to transform it manually).

7. We are going to transform it with a third party service: https://curl.trillworks.com/

8. We need to copy the [curl](https://en.wikipedia.org/wiki/CURL) equivalent of the request by right clicking -> Copy -> copy as curl. At this point, the curl command is copied to the keyboard. Go to https://curl.trillworks.com/ and paste it to the curl command box. This will generate the Python code we can use.

![micro3](https://drive.google.com/uc?id=1NHA029QlzMjlERnixRkPjFRGFmbXNeXU)

9. You are all done :) From now in, the sucess only depends on your Python skills.

In [2]:
# This is the code snippet curl.trillworks.com generated to me
import requests

cookies = {
    'cookie_bar': 'enabled',
    '_ga': 'GA1.2.270795426.1604223546',
    '_gid': 'GA1.2.1464611313.1604223546',
    'XSRF-TOKEN': 'eyJpdiI6ImFDdE11RUFSZWEwa0QrN3VJRVJhbFE9PSIsInZhbHVlIjoibFhGNENRK3RPeVhRUW5VS3ZGYkhyREJTU29kVEQzMVhIeVQzOWo1dTNscUd2RkQxN0xURUZJcDBRblVCdHRQMUNVbXFDQXBmbXk3ZVdSR1A0SlBkWGc9PSIsIm1hYyI6IjI5YTcyZjJlYzk4YmZmOGZmYTFlNTQxMWQ4ZGVmM2ZjMDVhYjMwOWU4MzhkNjI5MjNjYzAzMTBlNTFhYjA5ZjUifQ%3D%3D',
    'budhu_session': 'eyJpdiI6IlhJTHEraE5jYmJ0Z2lLXC9zeVk1VmRBPT0iLCJ2YWx1ZSI6IjFBT0FyQmhDaGc0UlwvM0Z5NDBYd1pOQzIxNlpHcGRqbGFGQ3NPOXI1NlZlaCtKWHZ0c3Z5UENkb0RxK1N5WkVpcHhBV1JxYUsybFU5aXRjampVU3FJUT09IiwibWFjIjoiNTQ3OWJlZTQzNjY3MzAwZmFlYzJiN2FlNTI4MTA5YjAyOWYxZWQ2ZDdmNmQ5MTkwNWYxMTEwNmM2YTc1Mjc5YSJ9',
}

headers = {
    'Connection': 'keep-alive',
    'Accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Dest': 'empty',
    'Referer': 'https://www.bud.hu/indulo_jaratok',
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

params = (
    ('mode', 'list'),
    ('lang', 'hun'),
    ('dir', '0'),
    ('flightdate_custom_from_date', 'today'),
    ('flightdate_custom_from_time', '09:30'),
)

response = requests.get('https://www.bud.hu/api/ajaxFlights/', headers=headers, params=params, cookies=cookies)

Always check the status code!

In [3]:
response.status_code

200

Remember, 200 is great, means success. It is usually the case, that you do not need to include cookies in the request. Just saying, but up to you.

In [4]:
response = requests.get('https://www.bud.hu/api/ajaxFlights/', 
                        headers=headers, 
                        params=params)  # deleted cookies from here
response.status_code

200

Now, as the response is a JSON file, we don't need to parse it with `BeautifulSoup`, just simply convert it to a variable. If you are not familiar with the format JSON, just think of it as a Python dictionary or a list of dictionaries.

In [5]:
data = response.json() # interpreting it as JSON
type(data) # result object is a list this time

list

As there is no documentation in what format data are coming, we need to uncover the pattern. But relax, it is usually not very handy. First, have a look at the first item of the list.

In [6]:
data[0] # First item of the list, a dictionary

{'airline': 'Egyptair',
 'airline_fullname': 'Egyptair',
 'airline_id': 'MS',
 'airline_website': 'http://www.egyptair.com',
 'airport_via_iata': '',
 'airport_via_name': '',
 'custom001': 'webbookingsupport-hu@egyptair.com',
 'custom002': 'https://facebook.com/EGYPTAIR',
 'custom003': 'https://twitter.com/EGYPTAIR',
 'custom004': 'http://www.egyptair.com/English/Pages/BaggageAllow',
 'custom005': '1',
 'direction': 0,
 'honnan': 'Hurghada',
 'jarat': 'MS',
 'jaratszam': '752',
 'masterautoid': 3120720,
 'megjegyzes': 'Várható',
 'megjegyzes_eng': 'Scheduled',
 'sst': '00',
 'suffix': '',
 'szinezes': 0,
 'terminal': '2B',
 'tervezett_datum': '20201128',
 'tervezett_ido': '14:55',
 'uniqueautoid': 'M3120720',
 'varhato_datum': '',
 'varhato_ido': ''}

This will probably be a list of dictionaries, each item containing pieces of information on one spicific departing flight. Hurray!!

Now, let's use pandas to export the dataset to a table-like format!

In [7]:
import pandas as pd

df = pd.DataFrame(data)
df.head()

Unnamed: 0,jarat,jaratszam,honnan,tervezett_datum,tervezett_ido,varhato_datum,varhato_ido,terminal,megjegyzes,megjegyzes_eng,airline,airline_fullname,airline_website,sst,masterautoid,szinezes,direction,uniqueautoid,suffix,airport_via_name,airport_via_iata,airline_id,custom001,custom002,custom003,custom004,custom005
0,MS,752,Hurghada,20201128,14:55,,,2B,Várható,Scheduled,Egyptair,Egyptair,http://www.egyptair.com,00,3120720,0,0,M3120720,,,,MS,webbookingsupport-hu@egyptair.com,https://facebook.com/EGYPTAIR,https://twitter.com/EGYPTAIR,http://www.egyptair.com/English/Pages/BaggageA...,1
1,QF,8112,Dubai DXB,20201128,15:10,20201128.0,15:00,2B,Várható,Expected,Qantas,Qantas,,SL,3120718,0,0,C1056560,,,,QF,,,,,1
2,EK,112,Dubai DXB,20201128,15:10,20201128.0,15:00,2B,Várható,Expected,Emirates,Emirates,http://www.emirates.com/,01,3120718,0,0,M3120718,,,,EK,,https://facebook.com/Emirates,https://twitter.com/emirates,http://www.emirates.com/english/plan_book/esse...,1
3,ET,1554,Frankfurt,20201128,15:15,,,2B,Várható,Scheduled,Ethiopian Airlines,Ethiopian Airlines Enterprise,http://www.flyethiopian.com/,SL,3120722,0,0,C1056568,,,,ET,CustomerRelations@ethiopianairlines.com,https://facebook.com/Ethiopianairlines,https://twitter.com/flyethiopian,http://www.ethiopianairlines.com/en/travel/bag...,1
4,LH,1339,Frankfurt,20201128,15:15,,,2B,Várható,Scheduled,Lufthansa,Lufthansa,http://www.lufthansa.com,01,3120722,0,0,M3120722,,,,LH,,https://facebook.com/lufthansa,https://twitter.com/lufthansa,http://www.lufthansa.com/hu/en/Baggage-overview,1


And that we have it, quickly write it out to disk!

In [8]:
df.to_csv(path + '/bud_flights.csv')
df.to_excel(path + '/bud_flights.xlsx')

## Lab #1: Let's hack the system!
![hackerman](https://wompampsupport.azureedge.net/fetchimage?siteId=7575&v=2&jpgQuality=100&width=700&url=https%3A%2F%2Fi.kym-cdn.com%2Fentries%2Ficons%2Ffacebook%2F000%2F021%2F807%2Fig9OoyenpxqdCQyABmOQBZDI0duHk2QZZmWg2Hxd4ro.jpg) <br></br>
 Change the parameters so that:
 
 - Instead of today, it will return flights from the day before (that is, yesterday). 
 - Instead of departing flights, it will return the arrivals.
 - Instead of showing flights after 10.30 AM, it will return all the flights that day.
 
__Warning #3:__ Note, that every single website has different API and hence parameters. What we are doing is specific to [bud.hu](https://www.bud.hu/). When scraping another website, you need to uncover the parameter space and find the possibilities you have.

In [11]:
custom_params = (
    ('mode', 'list'),
    ('lang', 'hun'),
    ('dir', '1'),
    ('flightdate_custom_from_date', 'yesterday'),
    ('flightdate_custom_from_time', '00:00'),
)

response = requests.get('https://www.bud.hu/api/ajaxFlights/', headers=headers, params=custom_params)

In [12]:
response.status_code

200

In [13]:
response.json()

[{'airline': 'LOT',
  'airline_fullname': 'LOT Polish Airlines',
  'airline_id': 'LO',
  'airline_website': 'http://www.lot.com',
  'airport_via_iata': '',
  'airport_via_name': '',
  'custom001': 'ticketing@lot.hu',
  'custom002': 'https://facebook.com/PllLOT',
  'custom003': 'https://twitter.com/PolishAirlines',
  'custom004': 'http://www.lot.com/hu/en/baggage',
  'custom005': '0',
  'direction': 1,
  'honnan': 'Warsaw',
  'jarat': 'LO',
  'jaratszam': '535',
  'masterautoid': 3120612,
  'megjegyzes': 'Leszállt',
  'megjegyzes_eng': 'Landed',
  'sst': '00',
  'suffix': '',
  'szinezes': 0,
  'terminal': '2B',
  'tervezett_datum': '20201127',
  'tervezett_ido': '11:20',
  'uniqueautoid': 'M3120612',
  'varhato_datum': '20201127',
  'varhato_ido': '11:08'},
 {'airline': 'Ryanair',
  'airline_fullname': 'Ryanair',
  'airline_id': 'FR',
  'airline_website': 'http://www.ryanair.com',
  'airport_via_iata': '',
  'airport_via_name': '',
  'custom001': '',
  'custom002': '',
  'custom003': '

## Lab #2:  More flying with Wizz

- Go to the [fare finder](https://wizzair.com/en-gb/flights/fare-finder#/) page of wizzair.
- Pick an origin and a destination (make sure you choose something that they operate a flight on). Budapest/London surely works.
- Get the dates and prices for a given month.
- Save the data as a csv called `wizz_data.csv`. You can save it to multiple files if that is more comfortable.
- Start messing with the input parameters to find out their meanings.

__Extra:__ Functionise it!

In [None]:
.get_text()

## Lab #3: Vote counting

- Go to [this](https://www.valasztas.hu/ogy2018) page which contains data on the 2018 parliamentary elections in Hungary. Wait for the regional map load, it takes some seconds. 
- Then scrape all the data for a given sub-region (e.g _Veszprém megye 3. számú OEVK (székhely: Tapolca)_).
- Save it as a CSV or Excfel file. Name should be `votes.xxx`

__Extra:__ Iterate over every single sub-region to collect all the pieces of data for the whole of the country. This way you would get the whole dataset of the election in just a couple of lines of code. Cool, huhh?