# Data management

### Scraping the Internet to Collect Data

## [Malka Guillot](https://malkaguillot.github.io/)

### HEC Liège | [ECON2306]()

## Outline
1. Introduction
2. HTML: scraping and parsing
    - Gathering (unstructured) web data and transforming it into structured data (“web scraping”).
3. Web APIs
    - Accessing data on the web: APIs.

## Introduction

### Motivation

**Publication of crawling papers by year**

<center> 
    
<div class="r-stack"><img src="images/publication_crawling_papers_by_year.png" style="height: 400px;" > </div>
</center>

*Source*: Claussen, Jörg and Peukert, Christian, **[Obtaining Data from the Internet: A Guide to Data Crawling in Management Research](https://ssrn.com/abstract=3403799)** (June 2019). 


#### Examples in Economics
1. Davis and Dingell (2016): ["How Segregated Is Urban Consumption"](http://www.jdingel.com/research/DavisDingelMonrasMorales.pdf) 
   - use Yelp to look racial segregation in consumption (do different races consume different things?)
2. Cavallo and Rigobon (2015): ["The Billion Prices Project: Using Online Prices for Measurement and Research."](https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.30.2.151)
   - collects prices from online retailers to look at macro price changing issues; also 
   - see also Cavallo (2015) “Scraped Data and Sticky Prices”
3. Halket and Pignatti (2015): 
   - scrape Craigslist to better understand US rental market
4. Many papers on eBay, some on Alibaba
5. Edleman, B. ["Using Internet Data for Economic Research."](https://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.26.2.189), (JEP 2012): 
   - useful discussion of many issues

#### Coding resources
- [Python's requests & Beautiful Soup libraries](https://blog.hartleybrody.com/web-scraping-cheat-sheet/) (for web scraping & APIs)
- Ryan Mitchell, [Web Scraping with Python](https://learning.oreilly.com/library/view/web-scraping-with/9781491985564/), O'Reilly Media, 2018

#### Example of data

- **Online markets**: 
    - housing, job, goods
    
- **Social media**: 
    - Twitter, Facebook, Wechat, newspaper text
    
- **Historical data** using the internet Archives

### What is webscraping ?
<center>
<div class="r-stack"><img src="images/screenscraping.png" style="height: 400px;" > </div>
</center>

Source: [SICSS](https://compsocialscience.github.io) 

#### What is Web Scraping?
- Process of **gathering information** from the Internet
    - structure or unstructured info
- Involves **automation** 

#### Challenges of Web Scraping
- **Variety**. Every website is different.
- **Durability**. Websites constantly change


#### Points to keep in mind
- It may or may not be legal
    - Look at websites’ terms of service and robots.txt files
- Webscraping is tedious and frustrating

#### Getting started: Things to consider before you begin
- **What**  do you want ?
    - Is the website only online for a limited time? 
    - Do you want an original snapshot as a backup? 
    - Is it more convenient to filter your data offline?
    
- **How** do you want to proceed? 
    - What scraping approach (depends on the website)?
    - Which `python package` is needed? 

### Best practices

#### 1. Check out the data are already available 

- Send an **email** to try to get the data directly
- Search if somebody has already **faced the same or a similar problem**.
- Does the site or service provide an **API** that you can access directly?
    - An API or Application programming interface helps you get data you need via a simple computer program!

#### 2. Be gentle
- If possible, you can scrape during off-peak hours 
- Limit the number of parallel / concurrent requests
- Spread the requests across multiple IP's 
- Add delays to successive requests  
- Avoid unnecessary requests

#### 3. Respect `Robots.txt`

The `robot.txt` = a text file the website administrators create to instruct web scrapers on how to crawl pages on their website. 

$\rightarrow$ lays out the rules for acceptable behavior 
    - which web pages can and can't be scraped, 
    - which user agents are not allowed
    - how fast you can do it, 
    - how frequently you can do it

I'd also recommend you read the terms of service of the website. 

#### 4. Don't follow the same crawling pattern.

Even though human users and bots consume data from a web page, there are some differences:  

- Real Humans are slow & unpredictable, 
- Bots are fast but predictable. 

$\Rightarrow$ used by anti-scraping technologies to block web scraping.

*Solution*: incorporate some random actions that confuse the anti-scraping technology. 

### Most important python library for data collection
- Standard: 
    - `Requests`
    - `Beautiful Soup`
- More advanced
    - `Scrapy` [documentation](https://docs.scrapy.org/en/latest/)
    - `Selenium` 

$+$ installing the package:

In [None]:
pip install beautifulsoup4

### Load packages

In [2]:
# Import packages + set options
from IPython.display import display
import json

import pandas as pd
pd.options.display.max_columns = None # Display all columns of a dataframe
pd.options.display.max_rows = 700

from pprint import pprint
import re  

### Data communication for the World Wide Web 

- `HTTP protocol`= way of communication between the client (browser) and the web server 
    - no encryption $\rightarrow$ not safe
- `HTTPS protocol`= S for secured

$\Rightarrow $ works by doing `Requests` and `Responses`

 ### Static vs. dynamic websites 

<center>
<div class="r-stack"><img src="https://about.gitlab.com/images/blogimages/ssg-gitlab-pages-series/part-1-dynamic-x-static-server.png" style="height: 450px;" > </div>
</center>


- **Static Websites**: the server that hosts the site sends back HTML documents that already contain all the data you’ll get to see as a user.

### Request and Response

All interactions between a client and a web server are split into a request and a response:

- `Requests` contain relevant data regarding your request call:
    - base URL 
        - \[ More on this for **API**: the *endpoint*, the *method* used, the *headers*, and so on.\]
- `Responses` contain relevant data returned by the server:
    - the data or content, the status code, and the headers.

#### `get` method

In [None]:
import requests 
url='https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress'
response = requests.get(url)
response

### `Request`'s attributes

In [None]:
request = response.request
print('request: ',request)
print('-----')
print('url: ',request.url)
print('-----')  
print('path_url: ', request.path_url)
print('-----')
print('Method: ', request.method)
print('-----')
print('Method: ', request.headers)

### `Response`'s attributes

- `.text` returns the response contents in Unicode format
- `.content` returns the response contents in bytes.

In [None]:
print('', response)
print('-----')
print('Text:', response.text[:50]) # Only the first 50 characters
print('-----')
print('Status_code:', response.status_code)
print('-----')
print('hHeaders:', response.headers)

### Status Codes

**Important information**: 
if your request was successful, if it’s missing data, if it’s missing credentials

<center>    
<div class="r-stack"><img src="https://bitmaskers.in/content/images/2023/10/1-1.jpg" style="height: 400px;" > </div>
</center>

## Scraping & Parsing in Practice

### STEPS:
1. Inspect Your Data Source
1.  Scrape HTML Content From a Page
1.  Parse HTML Code With Beautiful Soup

### Step 1: Inspect Your Data Source

#### Explore the Website 

**Objective**: understanding its underlying structure

*We will scrape the list of current members of the U.S. Congress*

#### Website example for today

 
<center>
<div class="r-stack"><img src="images/ballotpedia.png" style="height: 400px;" > </div>
</center>

Source: [ballotpedia website](https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress) 


### Understanding URLs
- **Base URL**: https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress
- More complex URL with query parameter https://ballotpedia.org/wiki/index.php?search=jerry
    - query parameter=`p?search=jerry`
    - can be used to crawl websites if you have a list of queries that you want to loop over (e.g. dates, localities...)
    - query structure:
        - *Start*: `?`
        - *Information*: pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (key=value). 
        - *Separator*: `&` -> if multiple query parameters 
        
Other example of URL: https://opendata.swiss/en/dataset?political_level=commune&q=health. 

<div class="alert alert-info">
<h3> Your turn</h3>
<p> Try to change the search and selection parameters and observe how that affects the  <a href= "https://opendata.swiss/en/dataset?political_level=commune&q=health">URL</a> . 
<p>   Next, try to change the values directly in your URL. See what happens when you paste the following URL into your browser’s address bar:
</div>

**Conclusion**: When you explore URLs, you can get information on how to retrieve data from the website’s server.

#### Inspect the site: Using Developer Tools
We use the `inspect` function (right click) to access the underlying HTML interactively. 

<center>
<div class="r-stack"><img src="images/ballotpedia_inspect.png" style="height: 400px;" > </div>
</center>


#### Developer tools
- Developer tools can help you understand the structure of a website
- I use it in chrome or firefox, but exists for most browsers
- Interactively explore the source html & the webpage

**`html` is great but intricated $\Rightarrow$ sublimed by `beautifulsoup`** 

###  Step 2: Scrape HTML Content From a Page

In [None]:
import requests
url='https://ballotpedia.org/List_of_current_members_of_the_U.S._Congress'
response = requests.get(url)
html=response.text
html[:500]

html looks messy.

Using the `prettify()` function from `BeautifulSoup` helps

In [None]:
# Parse raw HTML
from bs4 import BeautifulSoup # package for parsing HTML
soup = BeautifulSoup(html, 'html.parser') # parse html of web page
print(soup.prettify()[:500])

###  Step 3: Parse HTML Code With Beautiful Soup

**Objectif**: extract url of senators from the webpage to build a list of url that will be used for scraping info on senators

#### Find Elements by ID using `find`

In an HTML web page, every element can have an id attribute assigned. 

Can be used to directly access the element. 

**Syntax**: 
- `soup.find(id='the-id-of-what-you-are-looking for')`
- Output: some text

<center>
<div class="r-stack"><img src="images/ballotpedia-senate.png" style="height: 420px;" > </div>
</center>


In [None]:
balance=soup.find(id='Leadership_and_partisan_balance')
print(balance.prettify()[:500])

<div class="alert alert-info">
<h4> Your turn</h4>
<p> Extract the page title "List of current members of the U.S. Congress" using the find method
</div>

In [None]:
soup.find(id="firstHeading").prettify() 

In [None]:
officeholder_table=soup.find(id='firstHeading')
print(officeholder_table.prettify()[:500])

#### Getting the table entitled *List of current U.S. Senate members*:

In [None]:
officeholder_table=soup.find(id='officeholder-table')
print(officeholder_table.prettify()[:400])

#### Find Elements by HTML Class Name (using `find_all`)
Because the result is not unique `find_all` instead of `find`. 

Lets' rely on the html structure to find the row of the table

**Syntax**: 
- `soup.find_all(id='the-id-of-what-you-are-looking for')`
- Output: a list

In [None]:
thead= officeholder_table.find('thead')
thead

In [None]:
rows=officeholder_table.find_all('tr')
len(rows) # consistent: 100 members + headline + 2 delegations from puerto rico

#### Find Elements by CSS Class using `select_one()` or  `find_all()`

In an HTML web page, some tag can contain a "class" attribute. 

It can be useful if you want to get information from a tag which has no id attribute:

`<table class="bptable gray sortable jquery-tablesorter" id="officeholder-table" style="width:auto; border-bottom:1px solid #bcbcbc;">`

Note that this table tag has **4 classes**: 
- bptable, 
- gray, 
- sortable 
- jquery-tablesorter. 

In this case you can use any of the three first class name to get the `officeholder_table`:

#### First syntax:  `select_one(tag_name.class_name)` 
You can select multiple elements with `select()`.

In [None]:
officeholder_table = soup.select_one('table.bptable')
print(officeholder_table.prettify()[:300]) 

#### Second syntax: `find(tag_name, {'attribute_name' : 'attribute_value'})`
You can select multiple elements with `find_all()`.

In [None]:
officeholder_table = soup.find('table', {'class' : 'bptable'})
print(officeholder_table.prettify()[:300])

#### Let's try to get the `url` for one example row:
We will use the `html tag`

In [None]:
row=rows[1]
row

In [None]:
tds=row.find_all('td')
tds[:4]

In [None]:
url= tds[1].find_all('a') 
print('--')
print('a list:', url)
print('--')
print('its unique element', url[0])
print('--')
print('url wanted', url[0]['href'] )
print('--')
print('Text content', url[0].get_text())

<div class="alert alert-info">
<h4> Your turn</h4>
<p> Use the code for 1 row in order to build a loop that gives a list of all of the wanted url. 
</div>

In [None]:
list_url=[]
for row in rows[1:]: 
    tds=row.find_all('td')
    url = tds[1].find("a")['href']
    list_url.append(url)
len(list_url)
print(list_url[:10])

<div class="alert alert-info">
<h4> Your turn</h4>
<p> Get the table containing the "List of current U.S. House members". 
</div>

In [None]:
officeholder_tables=soup.find_all(id='officeholder-table')
print("There are {} tables".format(len(officeholder_tables)))

house_table=officeholder_tables[1] # the second table
print("------ looking at the content of the second one ------")
print(house_table.prettify()[450:800])

### Your First Scraper

Then, the same logic can be implemented to get the info from the senators' page (e.g. https://ballotpedia.org/Jerry_Moran). The following code extracts info from the first 10 url from the list scraped above. 

#### Strategy

- Loop of a list of url (`list_url`)
    - get the soup for the `url` $\rightarrow$ `get_the_soup()`
        - Extract information from the `soup`


In [21]:
#1. Get the soup
def get_the_soup(url):
    '''
    Requests the URL and cooks it into a soup object. 
    ---
    Arguments: url
    Returns: soup
    '''
    
    response = requests.get(url)
    html=response.text
    soup = BeautifulSoup(html, 'html.parser') # parse html of web page
    return soup

In [None]:
url = "https://ballotpedia.org/Alex_Padilla"
soup=get_the_soup(url)
soup.prettify()[:500]

In [23]:
#2. Extract info from the soup
from bs4 import NavigableString, Tag

def extract_soup_info_to_dictionary(soup):
    '''
    Extracts relevant information from a bs4 object and stores it into a dictionary by html header. 
    ---
    Arguments: soup
    Returns: dictionary of text (value) by header (key)
    '''
    
    dic_text_by_header=dict()
    
    # get all the text content between 2 header (h2)
    for header in soup.find_all('h2')[0:len(soup.find_all('h2'))-1] :
        # print('--------',header.get_text())        
        nextNode=header
        # use the nextSibling method
        while True:
            nextNode=nextNode.nextSibling
            if nextNode is None:
                break
            if isinstance(nextNode, Tag):
                if nextNode.name == "h2":
                    break
                #print(nextNode.get_text(strip=True).strip())
                # The result is put in a dictionary as a value for key=corresponding header
                dic_text_by_header[header.get_text()]=[nextNode.get_text(strip=True).strip()]
    
    return dic_text_by_header

In [None]:
extract_soup_info_to_dictionary(soup)

#### Putting everything together

- A dataframe to put things together: `df_parsed`


- **LOOP**: `url` from `list_url`): 

    - `get_the_soup()`: 
        - `url`  -> `soup`

    - `extract_soup_info_to_dictionary()`: 
        - `soup` -> `dic_text_by_header`
        
    - `pd.concat()`
        - `dic_text_by_header` -> `df_parsed`

#### `pd.concat()`? 
Powerfull method to combine data:

In [None]:
df1 = pd.DataFrame([['a', 1], ['b', 2]], columns=['letter', 'number'])
df1

In [None]:
df2 = pd.DataFrame([['c', 3], ['d', 4]], columns=['letter', 'number'])
df2

#### `pd.concat(list-of-dataframes)` 
Vertical or horizontal concatenation

In [None]:
pd.concat([df1, df2], axis=0) # axis = 0 -> along the index, the default

In [None]:
pd.concat([df1, df2], axis=1) # axis = 1 -> ON the index

#### Wrappers: LOOPING over `list_url` using our homemade functions

Writing requests manually is tedious. Wrappers add a layer of abstraction that makes this job easier for us.

In [29]:
# Let's use this short list (but cf. our list extracted above )
list_url= ['https://ballotpedia.org/Katie_Britt',
 'https://ballotpedia.org/Tommy_Tuberville',
 'https://ballotpedia.org/Lisa_Murkowski'
]

In [None]:
from bs4 import NavigableString, Tag

# the dataframe in which we will put the scraper's output
df_parsed=pd.DataFrame()

for url in list_url:
    print('--------',url, '--------')
    #1. Get the soup
    soup = get_the_soup(url)
    
    #2. Extract info from the soup
    dic_text_by_header=extract_soup_info_to_dictionary(soup)
                
    # put the dictionary into a dataframe
    temp=pd.DataFrame.from_dict(dic_text_by_header)

    # Concats the temporary dataframe with the global one
    df_parsed=pd.concat([temp, df_parsed])

In [None]:
df_parsed.head()

### Saving the `DataFrame` in a `pickle` format

<div class="alert alert-block alert-warning">
<i class="fa fa-warning"></i>&nbsp;<code>pickle</code> format
    <ul>
        <li> Useful to store <code>python</code> objects 
        </li>
        <li> Well integrated in  <code>pandas</code> (using <code>to_pickle</code> and <code>read_pickle</code>)
        </li>
        <li> When the object is not a pandas Dataframe, use the <code>pickle</code> package
        </li>
    </ul>
</div>


#### Managing path
<div class="alert alert-block alert-warning">
<i class="fa fa-warning"></i>&nbsp;<code>os</code> package
    <ul>
        <li> <code>os.getcwd()</code>: fetchs the current path
        </li>
        <li> <code>os.path.dirname()</code>: go back to the parent directory
        </li>
        <li> <code>os.path.join()</code>: concatenates several paths
        </li>
    </ul>
</div>

In [None]:
import os

os.getcwd() #   

In [None]:
parent_path=os.path.dirname(os.getcwd()) 
parent_path

In [34]:
data_path=os.path.join(parent_path, 'data') 

# Saving the data to pickle:
df_parsed.to_pickle(os.path.join(data_path, 'df_senators.pickle'))

# Saving the data to csv:
df_parsed.to_csv(os.path.join(data_path, 'df_senators.csv'))

### Going further

There are also **dynamic websites**: the server does not always send back HTML, but your browser also receive and interpret JavaScript code that you cannot retreive from the HTML. You receive JavaScript code that you cannot parse using `beautiful soup` but that you would need to execute like a browser does. 

Solutions: 
- Use `requests-html` 
- Simulate a browser using [selenium](https://selenium-python.readthedocs.io/) 

## Application Programming Interfaces (API)

### What is an API?
<div class="r-stack"><img src="https://raw.githubusercontent.com/malkaguillot/ECON2206-Data-Management/refs/heads/main/slides/images/api/API_3.svg?token=GHSAT0AAAAAACWYMATYOB76VG6756UFYU3UZYM5HYQ" style="height: 360px;" > </div>

Communication layer that allows different systems to talk to each other without having to understand exactly what each other does.

$\Rightarrow$ provide a to **progammable** access to data.


The (retired) website [Programmable Web](https://www.programmableweb.com/apis/directory) used to list more than 225,353 API from sites as diverse as Google, Amazon, YouTube, the New York Times, del.icio.us, LinkedIn, and many others.

<center>
<div class="r-stack"><img src="https://www.researchgate.net/publication/362048443/figure/fig1/AS:11431281099021888@1669172071410/The-increasing-growth-of-Web-API-6.png" style="height: 400px;" > </div>
</center>

Source: [Programmable Web](https://www.programmableweb.com/news/apis-show-faster-growth-rate-2019-previous-years/research/2019/07/17) 


### How Does an API Work?

- Relying on **HTTP messages** :
    - `request` for information or data, 
    - the API returns a `response` with what you requested
- Similar to visiting a website: you specify a URL and information is sent to your machine.

###  Better than webscraping if possible because: 
- More stable than webpages
- No HTML but already structured data (e.g. in `json`)
- we focus on the APIs that use HTTP protocol

<center><div class="r-stack">
<img src="https://s3.us-west-1.wasabisys.com/idbwmedia.com/images/api/restapi_restapi.svg" style="height: 500px;" > </div>
</center>

### Endpoints and Resources

*An API endpoint is the end of your communication channel to the API. An API can have different endpoints for channels leading to different resources.*

- **base URL**: https://api.carbonintensity.org.uk
    - Other examples: https://api.twitter.com; https://api.github.com
    - very basic information about an API, not the real data.
- Extend the url with **endpoint**
    - = a part of the URL that specifies what resource you want to fetch
    - check the [documentation](https://carbon-intensity.github.io/api-definitions/#carbon-intensity-api-v2-0-0)
 to learn more about what endpoints are available

### Anatomy of a request
<div class="r-stack"><img src="https://raw.githubusercontent.com/malkaguillot/ECON2206-Data-Management/refs/heads/main/slides/images/api/API_call_anatomy.svg?token=GHSAT0AAAAAACWYMATYD7WH4TQMCZNVL6EKZYM5NTA" style="height: 160px;" > </div>



### Anatomy of a response

Very often API responses are formatted as JSON objects.



In [None]:
api_response = {
    "data":
    [
        {
            "end":"2022-07-01",
            "start":"2022-06-30",
            "tweet_count":3
        },
        {
            "end":"2022-07-02",
            "start":"2022-07-01",
            "tweet_count":2
        }
    ]
}
api_response 

A json works as a dictionary:

In [None]:
api_response['data'][0]['end'] 

<div class="alert alert-info">
<h3> Your turn</h3>
Using the previous `api_response` json, extract the number of tweet for `2022-07-01`.
</div>

In [None]:
api_response['data'][1]['tweet_count'] 

### HTTP Methods
| HTTP Method | Description                  | Requests method   |
|-------------|------------------------------|-------------------|
| POST        | Create a new resource.       | requests.post()   |
| GET         | Read an existing resource.   | requests.get()    |
| PUT         | Update an existing resource. | requests.put()    |
| DELETE      | Delete an existing resource. | requests.delete() |

### Calling Your First API Using Python

Forecasts from the [**Carbon Intensity API**](https://carbonintensity.org.uk/) (include CO2 emissions related to eletricity generation only).

See the API [documentation](https://carbon-intensity.github.io/api-definitions/#carbon-intensity-api-v2-0-0)

<center>
<img src="attachment:image.png">
</center>


In [None]:
import requests
headers = { 
  'Accept': 'application/json'
}
# fetch (or get) data from the URL
requests.get('https://api.carbonintensity.org.uk', params={}, headers = headers) 

In [None]:
response = requests.get('https://api.carbonintensity.org.uk', params={}, headers = headers) 
print(response.text[:500])

### Using the `intensity`  endpoint:

In [None]:
# Get Carbon Intensity data for current half hour
r = requests.get('https://api.carbonintensity.org.uk/intensity', params={}, headers = headers)

# Different outputs (same information):
print("--- text ---")
pprint(r.text)
print("--- Content ---")
pprint(r.content)
print("--- JSON---")
pprint(r.json())

<div class="alert alert-block alert-warning">
<h3><i class="fa fa-warning"></i><code>json</code> </h3>
    <ul>
        <li><code>json</code>= python dictionary
        </li>
        <li>A great format for structured data
        </li>
    </ul>
</div>

In [None]:
# json objects work as do any other dictionary in Python
json=r.json()
json['data']

In [None]:
# get the actual intensity value:
json['data'][0]['intensity']['actual']

<div class="alert alert-info">
<h3> Your turn</h3>
        <li>Get Carbon Intensity factors for each fuel type -> look for the relevant endpoint
        </li>
        <li> Get Carbon Intensity data for current half hour for GB regions
        </li>
    </ul>

</div>

In [None]:
r = requests.get('https://api.carbonintensity.org.uk/intensity/factors', params={}, headers = headers)
pprint(r.json())

In [None]:
r = requests.get('https://api.carbonintensity.org.uk/regional', params={}, headers = headers)
pprint(r.json())

In [None]:
r = requests.get('https://api.carbonintensity.org.uk/intensity/factors', params={}, headers = headers)
pprint(r.json())

In [None]:
# Get Carbon Intensity data for current half hour for GB regions
r = requests.get('https://api.carbonintensity.org.uk/regional', params={}, headers = headers)
#pprint(r.json())

### Query Parameters
- cf. slide on `url`
- used as filters you can send with your API request to further narrow down the responses.

In [47]:
# In the carbonintensity API, it works differently:
from_="2018-08-25T12:35Z"
to="2018-08-25T13:35Z"
r = requests.get('https://api.carbonintensity.org.uk/regional/intensity/{}/{}'.format(from_, to), params={}, headers = headers)
#pprint(r.json())

### API Limitations
To prevent collection of huge amount of individual data, many APIs require you to obtain “credentials” or codes/passwords that identify you and determine which types of data you are allowed to access. 

#### API Credentials
- Different methods/level of authentification exist
    - API keys
    - OAuth      
####  Rate Limits & quotas
- The credentials also define how often we are allowed to make requests for data. 
- Be careful not to exceed the limits set by the API developers. 

#### API Keys

- Most common level of authentication 
- These keys are used to identify you as an API user or customer and to trace your use of the API. 
- API keys are typically sent as a request header or as a query parameter.


### Examples in research

#### The Telegram API

- *What do you get?* 
  - Messages & media in a given channel, channel members.
- *Usefulness*: 
  - Very useful to get data from fringe-groups that are not represented on big social media platforms. Severly underused in computational social science research.
- *Accessibility*: 
  - Clunky authentification process, lacking documentation, python wrapper not really designed for data collection.
- *Example publication*: 
  - [Organization and evolution of the UK far-right network on Telegram](https://appliednetsci.springeropen.com/articles/10.1007/s41109-022-00513-8) 

#### The New York Times API
- *What do you get?* 
  - Article abstracts and metadata, not full article texts.
- *Usefulness*: 
  - Somewhat limited because full texts are missing, but abstracts can serve as a window into histporic high-quality journalistic texts.
- *Accessibility*: 
  - Easy authentification, good documentation, nice python wrapper, hard to get article texts.
- *Example publication*: 
  - [The rise and fall of rationality in language](https://www.pnas.org/doi/epdf/10.1073/pnas.2107848118) 

#### Other useful APIs:
- Spotify   
  - Metadata about artists, playlists & tracks, song features like "danceability".
- CrossRef
  - Article titles, authors, journals, number of references.
- Open Street Maps
  - Map data (nodes, ways, relations), alternative to Google maps.
- The MediaWiki API 
  - Access wikipedia articles, article discussions & revision histories.

## General remarks
- Start simple
    - Expand your program incrementally.
- Keep it simple. 
    - Do not overengineer the problem.
- Do not repeat yourself. 
    - Code duplication implies bug reuse.
- Limit the number of iterations for test runs. 
    - Use print statements toinspect objects.
- Write tests to verify things work as intended.
- If the web page cannot be navigated easily or has hidden javascript, look into Selenium.

### More resources
Excellent [tutorial](https://www.dataquest.io/blog/web-scraping-python-using-beautiful-soup/) for a workflow from HTML to pandas DataFrame.

- Web scraping with Selenium [Notebook]()
- Web scraping [code snippets](https://github.com/JanaLasser/SICSS-aachen-graz/blob/main/02_01_APIs/exercise/web_scraping_code_snippets.ipynb).
- API access [code snippets](https://github.com/JanaLasser/SICSS-aachen-graz/blob/main/02_01_APIs/exercise/API_access_code_snippets.ipynb).
- Crowd-sourced [list](https://docs.google.com/spreadsheets/d/1ZEr3okdlb0zctmX0MZKo-gZKPsq5WGn1nJOxPV7al-Q/edit?gid=0#gid=0) of useful APIs.