**Web Scraping und Data Mining in Python**

# URL und WebAPI

Jan Riebling, *Universität Wuppertal*

# URL

## Allgemein

* Uniform Ressource Locator.
* Gibt die Adresse einer spezifischen Ressource in einem Netzwerk an.
* Kann benutzt werden um Anfragen an Server zu stellen.
* Allgemeine Form:  
```
scheme://domain:port/path?query_string#fragment_id
```

## HyperText Transfer Protocol

HTTP und HTTPS spezifizieren den Datenaustausch zwischen Client (meistens der Browser) und einem Server über das TCP/IP System.

Technische Spezifikation: [RFC2616](https://datatracker.ietf.org/doc/html/rfc2616.html).

## `urllib`

Python Standardbibliothek zum Umgang mit URLs. Für diesen Workshop sind insbesondere die Funktionen in `urllib.request` von Bedeutung, da sich hiermit Anfragen an Server stellen lassen.

## Beispiel

`urllib.request.urlopen` nimmt als Argument eine URL, schickt einen GET Request an den Server und gibt dessen Antwort zurück.

In [1]:
from urllib.request import urlopen

url = 'https://www.uni-mainz.de/'

response = urlopen(url)

In [2]:
response

<http.client.HTTPResponse at 0x7fa8132226e0>

# Application Programming Interface

## API

* Schnittstelle zur Interaktion zwischen Programmen oder Programmen und Servern.
* WebAPIs:
    * Abruf/Veränderung von Ressourcen (GET vs. POST).
    * Client-Server Beziehung.
    * Vermittelt durch HTTP.
    * Daten meist in Form der Webstandards für Dokumente (z.B. JSON oder XML).
* Für viele APIs ist eine vorherige Anmeldung notwendig. Hier sind unbedingt die ToS zu beachten!
* API Verzeichnis: https://www.programmableweb.com/

##  Funktionsweise

* Aufbau einer HTTP Anfrage entsprechend dem URL Schema.
    * Konkrete Anfrage geschieht über `?query_string`.
    * Übersetzung von Programmcode in die spezifische Anfrage
* Änderung des Zustands des Servers oder Abrufen von Daten
* Übersetzung der Daten in Elemente der Programmumgebung


# Beispiel: World Bank Indicators 

* Offen zugängliche Datensätze der Weltbank.
* Dokumentation: [http://data.worldbank.org/developers/api-overview](http://data.worldbank.org/developers/api-overview).
* Python Module:
    * `py-worldbank`.
    * `wbpy`.
    * `pandas_datareader`.

In [41]:
jsondata = urlopen('http://api.worldbank.org/v2/country/deu;usa;fra/indicator/SP.POP.TOTL?date=2000:2010&format=json').read()

In [46]:
import json

json.loads(jsondata)

[{'page': 1,
  'pages': 1,
  'per_page': 50,
  'total': 33,
  'sourceid': '2',
  'sourcename': 'World Development Indicators',
  'lastupdated': '2022-02-15'},
 [{'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'DE', 'value': 'Germany'},
   'countryiso3code': 'DEU',
   'date': '2010',
   'value': 81776930,
   'unit': '',
   'obs_status': '',
   'decimal': 0},
  {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'DE', 'value': 'Germany'},
   'countryiso3code': 'DEU',
   'date': '2009',
   'value': 81902307,
   'unit': '',
   'obs_status': '',
   'decimal': 0},
  {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'DE', 'value': 'Germany'},
   'countryiso3code': 'DEU',
   'date': '2008',
   'value': 82110097,
   'unit': '',
   'obs_status': '',
   'decimal': 0},
  {'indicator': {'id': 'SP.POP.TOTL', 'value': 'Population, total'},
   'country': {'id': 'DE', 'value': 'Germany'},
 

## API Wrappers

Zwar ist es immer möglich eine direkte Anfrage an den Server zu schicken (z.B. mittels Pythons `urllib` Bibliothek), doch in vielen Fällen bietet sich die einfachere Variante an einen bereits bestehenden "Wrapper" zu benutzen. Dabei handelt es sich um Python-Code, der die Interaktion mit der API übernimmt. Um einen solchen Wrapper zu finden ist es meistens ausreichend eine Suchanfrage der Art:

```
<Name der Anwendung> python api
```

zu stellen.

## Beispiel: World Bank Indicators Wrapper

In [1]:
import pandas as pd
from pandas_datareader import wb

## DataFrame of all indicators:
ind_df = wb.get_indicators()

In [2]:
ind_df.columns

Index(['id', 'name', 'unit', 'source', 'sourceNote', 'sourceOrganization',
       'topics'],
      dtype='object')

## Viewing DataFrames

A representation (view) of the DataFrame can be called by returning the object or through printing. For bigger datasets this view will be shortened in order to prevent memory hangups. The first and last rows can be viewed through `.head()` and `.tail()`.

All parts of the DataFrame are represented as attributes and can be accessed as such:

* `df.columns`
* `df.index`
* `df.values`

In [3]:
ind_df.head()

Unnamed: 0,id,name,unit,source,sourceNote,sourceOrganization,topics
0,1.0.HCount.1.90usd,Poverty Headcount ($1.90 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
1,1.0.HCount.2.5usd,Poverty Headcount ($2.50 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
2,1.0.HCount.Mid10to50,Middle Class ($10-50 a day) Headcount,,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty
3,1.0.HCount.Ofcl,Official Moderate Poverty Rate-National,,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of data from Nati...,Poverty
4,1.0.HCount.Poor4uds,Poverty Headcount ($4 a day),,LAC Equity Lab,The poverty headcount index measures the propo...,b'LAC Equity Lab tabulations of SEDLAC (CEDLAS...,Poverty


## Durchsuchen 

Pandas `.str.` Notation kann genutzt werden um nach Variablen Namen zu suchen.

In [12]:
ind_df[ind_df['name'].str.contains(r'[Aa]gricult')]

Unnamed: 0,id,name,source,sourceNote,sourceOrganization,topics
119,3.01.04.01.agcen,Agricultural census,Statistical Capacity Indicators,Agricultural censuses collect information on a...,b'World Bank: Microdata library. Original sour...,
973,AG.AGR.TRAC.NO,"Agricultural machinery, tractors",World Development Indicators,Agricultural machinery refers to the number of...,"b'Food and Agriculture Organization, electroni...",Agriculture & Rural Development
998,AG.LND.AGRI.HA,Agricultural land (hectares),Africa Development Indicators,Agricultural land refers to the share of land ...,"b'Food and Agriculture Organization, electroni...",
999,AG.LND.AGRI.K2,Agricultural land (sq. km),World Development Indicators,Agricultural land refers to the share of land ...,"b'Food and Agriculture Organization, electroni...",Agriculture & Rural Development ; Climate Ch...
1000,AG.LND.AGRI.ZS,Agricultural land (% of land area),World Development Indicators,Agricultural land refers to the share of land ...,"b'Food and Agriculture Organization, electroni...",Agriculture & Rural Development ; Climate Ch...
1019,AG.LND.IRIG.AG.ZS,Agricultural irrigated land (% of total agricu...,World Development Indicators,Agricultural irrigated land refers to agricult...,"b'Food and Agriculture Organization, electroni...",Agriculture & Rural Development ; Climate Ch...
1021,AG.LND.IRIG.HA.AG,Agricultural area irrigated (ha),Africa Development Indicators,"Agricultural area irrigated, part of the full ...","b'Food and Agriculture Organization, electroni...",
1036,AG.LND.TRAC.ZS,"Agricultural machinery, tractors per 100 sq. k...",World Development Indicators,Agricultural machinery refers to the number of...,"b'Food and Agriculture Organization, electroni...",Agriculture & Rural Development
1038,AG.PRD.AGRI.XD,Agriculture production index (1999-2001 = 100),Africa Development Indicators,The FAO indices of agricultural production sho...,"b'Food and Agriculture Organization, electroni...",
1045,AG.PRD.GAGRI.XD,"Agriculture production index (gross, 1999-2001...",Africa Development Indicators,The FAO indices of agricultural production sho...,"b'Food and Agriculture Organization, electroni...",


## Daten herunterladen

In [6]:
## Select some indicators
ind = ['NY.GDP.PCAP.KD', 
       'IT.MOB.COV.ZS', 
       'SP.RUR.TOTL.ZS']

df = wb.download(indicator=ind,
                 country='all',
                 start=2001,
                 end=2011)

In [10]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,NY.GDP.PCAP.KD,IT.MOB.COV.ZS,SP.RUR.TOTL.ZS
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Arab World,2011,5989.899176,,43.330359
Arab World,2010,5917.245724,,43.681892
Arab World,2009,5784.690614,,44.114470
Arab World,2008,5898.885250,,44.548919
Arab World,2007,5713.955066,,44.984331
Arab World,2006,5599.102508,,45.274258
Arab World,2005,5382.993750,,45.674088
Arab World,2004,5208.602873,,46.041090
Arab World,2003,4870.031225,,46.346152
Arab World,2002,4724.645810,,46.631232


In [15]:
## Indikatorinformationen zurückspielen.
## .isin() prüft Elemente in einer Liste.

ind_df[ind_df.id.isin(ind)]

Unnamed: 0,id,name,source,sourceNote,sourceOrganization,topics,unit
6412,IT.MOB.COV.ZS,Population coverage of mobile cellular telepho...,Africa Development Indicators,Please cite the International Telecommunicatio...,"b'International Telecommunication Union, World...",,
7880,NY.GDP.PCAP.KD,GDP per capita (constant 2010 US$),World Development Indicators,GDP per capita is gross domestic product divid...,"b'World Bank national accounts data, and OECD ...",Economy & Growth,
10498,SP.RUR.TOTL.ZS,Rural population (% of total population),World Development Indicators,Rural population refers to people living in ru...,"b""World Bank staff estimates based on the Unit...",Agriculture & Rural Development,


#  Caveat: MultiIndex

In addition to selection using columns and rows a DataFrame with a MultiIndex provides different levels with which to select from the Index.

In [18]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,GDP_per_capita,Cell_coverage,Rural_population
country,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Arab World,2011,5989.899176,,43.330359
Arab World,2010,5917.245724,,43.681892
Arab World,2009,5784.690614,,44.114470
Arab World,2008,5898.885250,,44.548919
Arab World,2007,5713.955066,,44.984331
Arab World,2006,5599.102508,,45.274258
Arab World,2005,5382.993750,,45.674088
Arab World,2004,5208.602873,,46.041090
Arab World,2003,4870.031225,,46.346152
Arab World,2002,4724.645810,,46.631232


In [26]:
## Selecting specific levels via .xs
## Only the year 2011:
df.xs('European Union', level='country')

Unnamed: 0_level_0,GDP_per_capita,Cell_coverage,Rural_population
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2011,34279.107501,,25.776954
2010,33677.002643,,25.995514
2009,33056.426081,,26.218004
2008,34663.286787,,26.439156
2007,34626.921433,,26.669236
2006,33723.542783,,26.900341
2005,32753.293496,,27.128921
2004,32203.566959,,27.358844
2003,31522.189349,,27.587616
2002,31228.060098,,27.819386


In [51]:
## Most methods take an aditional 
## level argument.

df['GDP_per_capita']

country                         year
Arab World                      2011     5989.899176
                                2010     5917.245724
                                2009     5784.690614
                                2008     5898.885250
                                2007     5713.955066
                                2006     5599.102508
                                2005     5382.993750
                                2004     5208.602873
                                2003     4870.031225
                                2002     4724.645810
                                2001     4795.152767
Caribbean small states          2011     9075.822740
                                2010     9039.288633
                                2009     8970.525520
                                2008     9355.413993
                                2007     9295.101868
                                2006     9069.133411
                                2005     8558.331571
         

## Weitere Informationen

Im Bezug auf normale [Indizes](http://pandas.pydata.org/pandas-docs/stable/indexing.html) und für [MultiIndizes](http://pandas.pydata.org/pandas-docs/stable/advanced.html) finden sich in der Pandas Dokumentation.

# Weitere API Wrapper

## Einfach googeln

"Name der Plattform + Python + API" reicht meistens aus.

Viele APIs erfordern heute eine Registrierung und eine Authentifizierung.