**Web Scraping und Data Mining in Python**

# URL und WebAPI

Jan Riebling, *Universität Wuppertal*

# URL

## Allgemein

* Uniform Ressource Locator.
* Gibt die Adresse einer spezifischen Ressource in einem Netzwerk an.
* Kann benutzt werden um Anfragen an Server zu stellen.
* Allgemeine Form:  
```
scheme://domain:port/path?query_string#fragment_id
```

## HyperText Transfer Protocol

HTTP und HTTPS spezifizieren den Datenaustausch zwischen Client (meistens der Browser) und einem Server über das TCP/IP System.

Technische Spezifikation: [RFC2616](https://datatracker.ietf.org/doc/html/rfc2616.html).

## `urllib`

Python Standardbibliothek zum Umgang mit URLs. Für diesen Workshop sind insbesondere die Funktionen in `urllib.request` von Bedeutung, da sich hiermit Anfragen an Server stellen lassen.

## Beispiel

`urllib.request.urlopen` nimmt als Argument eine URL, schickt einen GET Request an den Server und gibt dessen Antwort zurück.

In [20]:
from urllib.request import urlopen

url = 'https://www.uni-mainz.de/'

response = urlopen(url)

html = response.read()

In [23]:
html.decode('utf8')

'<!DOCTYPE html>\r\n<html dir="ltr" lang="de">\r\n   <head>\r\n      <meta charset="utf-8">\r\n      <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n      <meta name="viewport" content="width=device-width, initial-scale=1.0">      <meta name="author" content="Johannes Gutenberg-Universität Mainz">\r\n      <meta name="description" content="Die Johannes Gutenberg-Universität Mainz zählt mit rund 33.000 Studierenden aus über 130 Nationen zu den zehn größten Universitäten Deutschlands. Als einzige Volluniversität in Rheinland-Pfalz vereint sie nahezu alle akademischen Disziplinen.">\r\n      <meta name="page_id" content="2">\r\n      <meta name="copyright" content="Johannes Gutenberg-Universität Mainz">\r\n      <meta name="Linktitle" content="">\r\n      <meta name="keywords" content="">\r\n      <meta name="robots" content="index, follow, noarchive">\r\n      <meta name="generator" content="">\r\n      <title>\r\n         Willkommen an der JGU!\r\n      </title>\r\n      <link 

# Application Programming Interface

## API

* Schnittstelle zur Interaktion zwischen Programmen oder Programmen und Servern.
* WebAPIs:
    * Abruf/Veränderung von Ressourcen (GET vs. POST).
    * Client-Server Beziehung.
    * Vermittelt durch HTTP.
    * Daten meist in Form der Webstandards für Dokumente (z.B. JSON oder XML).
* Für viele APIs ist eine vorherige Anmeldung notwendig. Hier sind unbedingt die ToS zu beachten!
* API Verzeichnis: https://www.programmableweb.com/

##  Funktionsweise

* Aufbau einer HTTP Anfrage entsprechend dem URL Schema.
    * Konkrete Anfrage geschieht über `?query_string`.
    * Übersetzung von Programmcode in die spezifische Anfrage
* Änderung des Zustands des Servers oder Abrufen von Daten
* Übersetzung der Daten in Elemente der Programmumgebung


# Beispiel: World Bank Indicators 

* Offen zugängliche Datensätze der Weltbank.
* Dokumentation: [http://data.worldbank.org/developers/api-overview](http://data.worldbank.org/developers/api-overview).
* Python Module:
    * `py-worldbank`.
    * `wbpy`.
    * `pandas_datareader`.

In [28]:
url = '''http://api.worldbank.org/v2/country/all/indicator/AG.LND.ARBL.ZS?date=2000'''
url2 = 'http://api.worldbank.org/v2/country/deu;usa;fra/indicator/SP.POP.TOTL?date=2000:2010&format=json'


data = urlopen(url2).read()

In [30]:
import json

json.loads(data)

[{'page': 1,
  'pages': 1,
  'per_page': 50,
  'total': 33,
  'sourceid': '2',
  'sourcename': 'World Development Indicators',
  'lastupdated': '2022-02-15'},
 [{'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'DE', 'value': 'Germany'},
   'countryiso3code': 'DEU',
   'date': '2010',
   'value': 81776930,
   'unit': '',
   'obs_status': '',
   'decimal': 0},
  {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'DE', 'value': 'Germany'},
   'countryiso3code': 'DEU',
   'date': '2009',
   'value': 81902307,
   'unit': '',
   'obs_status': '',
   'decimal': 0},
  {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'DE', 'value': 'Germany'},
   'countryiso3code': 'DEU',
   'date': '2008',
   'value': 82110097,
   'unit': '',
   'obs_status': '',
   'decimal': 0},
  {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'DE', 'value': 'Germany'},
 

## API Wrappers

Zwar ist es immer möglich eine direkte Anfrage an den Server zu schicken (z.B. mittels Pythons `urllib` Bibliothek), doch in vielen Fällen bietet sich die einfachere Variante an einen bereits bestehenden "Wrapper" zu benutzen. Dabei handelt es sich um Python-Code, der die Interaktion mit der API übernimmt. Um einen solchen Wrapper zu finden ist es meistens ausreichend eine Suchanfrage der Art:

```
<Name der Anwendung> python api
```

zu stellen.

## Beispiel: World Bank Indicators Wrapper

In [31]:
import pandas as pd
from pandas_datareader import wb

## DataFrame of all indicators:
ind_df = wb.get_indicators()

In [37]:
ind_df.describe()

Unnamed: 0,id,name,unit,source,sourceNote,sourceOrganization,topics
count,20127,20127,20127.0,20127,20127.0,20127,20127.0
unique,20124,19983,1.0,57,6186.0,518,121.0
top,GFDD.DM.16,Administrative Data for Labor Admin (ILO),,Education Statistics,,b'',
freq,2,6,20127.0,4269,7653.0,7477,14488.0


## Viewing DataFrames

A representation (view) of the DataFrame can be called by returning the object or through printing. For bigger datasets this view will be shortened in order to prevent memory hangups. The first and last rows can be viewed through `.head()` and `.tail()`.

All parts of the DataFrame are represented as attributes and can be accessed as such:

* `df.columns`
* `df.index`
* `df.values`

In [3]:
ind_df.head()

Unnamed: 0,id,name,unit,source,sourceNote,sourceOrganization,topics
0,1.0.HCount.1.90usd,Poverty Headcount ($1.90 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
1,1.0.HCount.2.5usd,Poverty Headcount ($2.50 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
2,1.0.HCount.Mid10to50,Middle Class ($10-50 a day) Headcount,,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
3,1.0.HCount.Ofcl,Official Moderate Poverty Rate-National,,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of data from Nati...,Poverty
4,1.0.HCount.Poor4uds,Poverty Headcount ($4 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty


## Durchsuchen 

Pandas `.str.` Notation kann genutzt werden um nach Variablen Namen zu suchen.

In [72]:
ind_df[ind_df['topics'].str.contains('Poverty')]

Unnamed: 0,id,name,unit,source,sourceNote,sourceOrganization,topics
0,1.0.HCount.1.90usd,Poverty Headcount ($1.90 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
1,1.0.HCount.2.5usd,Poverty Headcount ($2.50 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
2,1.0.HCount.Mid10to50,Middle Class ($10-50 a day) Headcount,,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
3,1.0.HCount.Ofcl,Official Moderate Poverty Rate-National,,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of data from Nati...,Poverty
4,1.0.HCount.Poor4uds,Poverty Headcount ($4 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
...,...,...,...,...,...,...,...
13235,SI.SPR.PC40.05,"Survey mean consumption or income per capita, ...",,WDI Database Archives,Mean consumption or income per capita (2005 PP...,"b'World Bank, Global Database of Shared Prospe...",Poverty
13236,SI.SPR.PC40.ZG,Annualized average growth rate in per capita r...,,World Development Indicators,The growth rate in the welfare aggregate of th...,"b'World Bank, Global Database of Shared Prospe...",Poverty
13237,SI.SPR.PCAP,"Survey mean consumption or income per capita, ...",,World Development Indicators,Mean consumption or income per capita (2011 PP...,"b'World Bank, Global Database of Shared Prospe...",Poverty
13238,SI.SPR.PCAP.05,"Survey mean consumption or income per capita, ...",,WDI Database Archives,Mean consumption or income per capita (2005 PP...,"b'World Bank, Global Database of Shared Prospe...",Poverty


## Daten herunterladen

In [70]:
## Select some indicators
ind = ['NY.GDP.PCAP.KD', 
       'IT.MOB.COV.ZS', 
       'SP.RUR.TOTL.ZS']

df = wb.download(indicator=list(indic),
                 country=['DEU', 'FRA', 'USA'],
                 start=2001,
                 end=2011)



 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.POV.2DAY
 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.POV.GAP2
 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.POV.NAGP
 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.POV.RUGP
 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.POV.RUHC
 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.POV.URGP
 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.POV.URHC
 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.SPR.PC40.05
 Invalid format 
  The indicator was not found. It may have been deleted or archived.. Indicator: SI.SPR.PCAP.05


In [73]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,EN.POP.SLUM.UR.ZS,SI.DST.02ND.20,SI.DST.03RD.20,SI.DST.04TH.20,SI.DST.05TH.20,SI.DST.10TH.10,SI.DST.50MD,SI.DST.FRST.10,SI.DST.FRST.20,SI.POV.DDAY,...,SI.POV.MDIM.IT,SI.POV.MDIM.MA,SI.POV.MDIM.XQ,SI.POV.NAHC,SI.POV.UMIC,SI.POV.UMIC.GP,SI.SPR.PC40,SI.SPR.PC40.ZG,SI.SPR.PCAP,SI.SPR.PCAP.ZG
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Germany,2011,,13.0,17.0,22.4,39.1,24.3,8.7,3.5,8.5,0.0,...,,18.5,,16.1,0.2,0.0,27.89,,51.95,
Germany,2010,,13.1,17.1,22.6,38.8,24.0,9.0,3.4,8.4,0.0,...,,18.6,,15.8,0.2,0.1,,,,
Germany,2009,,13.1,17.0,22.6,39.0,24.1,9.2,3.3,8.3,0.0,...,,,,15.6,0.2,0.1,,,,
Germany,2008,,12.9,16.8,22.3,39.6,24.8,8.7,3.3,8.4,0.2,...,,,,15.5,0.2,0.2,,,,
Germany,2007,,12.9,16.8,22.1,39.9,25.2,8.5,3.4,8.3,0.0,...,,,,15.2,0.2,0.0,,,,
Germany,2006,,12.8,16.7,22.3,39.8,24.9,9.0,3.4,8.3,0.0,...,,,,15.2,0.0,0.0,,,,
Germany,2005,,12.6,16.7,22.1,40.5,25.5,8.7,3.2,8.1,0.0,...,,,,12.5,0.2,0.1,,,,
Germany,2004,,13.1,16.9,22.4,39.1,24.2,8.5,3.4,8.5,0.0,...,,,,12.2,0.2,0.1,,,,
Germany,2003,,13.3,17.1,22.3,38.8,24.0,8.0,3.4,8.6,0.0,...,,,,,0.2,0.2,,,,
Germany,2002,,13.1,17.1,22.6,38.6,23.7,8.0,3.4,8.6,0.0,...,,,,,0.2,0.1,,,,


In [77]:
## Indikatorinformationen zurückspielen.
## .isin() prüft Elemente in einer Liste.

df.columns = list(ind_df[ind_df['id'].isin(df.columns)]['name'])

#  Caveat: MultiIndex

In addition to selection using columns and rows a DataFrame with a MultiIndex provides different levels with which to select from the Index.

In [78]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Population living in slums (% of urban population),Income share held by second 20%,Income share held by third 20%,Income share held by fourth 20%,Income share held by highest 20%,Income share held by highest 10%,Proportion of people living below 50 percent of median income (%),Income share held by lowest 10%,Income share held by lowest 20%,Poverty headcount ratio at $1.90 a day (2011 PPP) (% of population),...,Multidimensional poverty intensity (average share of deprivations experienced by the poor),"Multidimensional poverty headcount ratio, male (% of male population)",Multidimensional poverty index (scale 0-1),Poverty headcount ratio at national poverty lines (% of population),Poverty headcount ratio at $5.50 a day (2011 PPP) (% of population),Poverty gap at $5.50 a day (2011 PPP) (%),"Survey mean consumption or income per capita, bottom 40% of population (2011 PPP $ per day)","Annualized average growth rate in per capita real survey mean consumption or income, bottom 40% of population (%)","Survey mean consumption or income per capita, total population (2011 PPP $ per day)","Annualized average growth rate in per capita real survey mean consumption or income, total population (%)"
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Germany,2011,,13.0,17.0,22.4,39.1,24.3,8.7,3.5,8.5,0.0,...,,18.5,,16.1,0.2,0.0,27.89,,51.95,
Germany,2010,,13.1,17.1,22.6,38.8,24.0,9.0,3.4,8.4,0.0,...,,18.6,,15.8,0.2,0.1,,,,
Germany,2009,,13.1,17.0,22.6,39.0,24.1,9.2,3.3,8.3,0.0,...,,,,15.6,0.2,0.1,,,,
Germany,2008,,12.9,16.8,22.3,39.6,24.8,8.7,3.3,8.4,0.2,...,,,,15.5,0.2,0.2,,,,
Germany,2007,,12.9,16.8,22.1,39.9,25.2,8.5,3.4,8.3,0.0,...,,,,15.2,0.2,0.0,,,,
Germany,2006,,12.8,16.7,22.3,39.8,24.9,9.0,3.4,8.3,0.0,...,,,,15.2,0.0,0.0,,,,
Germany,2005,,12.6,16.7,22.1,40.5,25.5,8.7,3.2,8.1,0.0,...,,,,12.5,0.2,0.1,,,,
Germany,2004,,13.1,16.9,22.4,39.1,24.2,8.5,3.4,8.5,0.0,...,,,,12.2,0.2,0.1,,,,
Germany,2003,,13.3,17.1,22.3,38.8,24.0,8.0,3.4,8.6,0.0,...,,,,,0.2,0.2,,,,
Germany,2002,,13.1,17.1,22.6,38.6,23.7,8.0,3.4,8.6,0.0,...,,,,,0.2,0.1,,,,


In [51]:
## Selecting specific levels via .xs
## Only the year 2011:
df.xs('European Union', level='country')

Unnamed: 0_level_0,NY.GDP.PCAP.KD,IT.MOB.COV.ZS,SP.RUR.TOTL.ZS
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,29796.75618,,26.831942
2010,29201.246733,,27.027674
2009,28598.670095,,27.236589
2008,29969.43055,,27.444969
2007,29874.16604,,27.663601
2006,29057.95463,,27.883351
2005,28168.354283,,28.100069
2004,27734.740655,,28.318894
2003,27135.655402,,28.537252
2002,26987.28839,,28.759617


In [51]:
## Most methods take an aditional 
## level argument.

df['GDP_per_capita']

country                         year
Arab World                      2011     5989.899176
                                2010     5917.245724
                                2009     5784.690614
                                2008     5898.885250
                                2007     5713.955066
                                2006     5599.102508
                                2005     5382.993750
                                2004     5208.602873
                                2003     4870.031225
                                2002     4724.645810
                                2001     4795.152767
Caribbean small states          2011     9075.822740
                                2010     9039.288633
                                2009     8970.525520
                                2008     9355.413993
                                2007     9295.101868
                                2006     9069.133411
                                2005     8558.331571
         

## Weitere Informationen

Im Bezug auf normale [Indizes](http://pandas.pydata.org/pandas-docs/stable/indexing.html) und für [MultiIndizes](http://pandas.pydata.org/pandas-docs/stable/advanced.html) finden sich in der Pandas Dokumentation.

# Weitere API Wrapper

## Einfach googeln

"Name der Plattform + Python + API" reicht meistens aus.

Viele APIs erfordern heute eine Registrierung und eine Authentifizierung.