# Scraping Data from the World Wide Web - Solutions

In this tutorial we will scrape data from the world wide web. We will try to get data from one german news outlet, abstracts from one economic journal and access one API from space.

## Packages

In [1]:
InstallPackages = False
if InstallPackages:
    !pip install pandas
    !pip install requests
    !pip install bs4
    !pip install numpy

In [2]:
import pandas as pd
import numpy as np

In [3]:
import requests
import bs4

## Seed

In [4]:
seed = 42

## Exercise 1 - Examine the front page of BILD newspaper

Examine front page of BILD newspaper (www.bild.de) and create a list of all articles that can be found on that page. Each item of the list must contain

* article title,
* main image of the article,
* url of the article.

**Hint:**

* request content of `www.bild.de` page and use `"rel": "bookmark"` properties for identifying links pointing at articles,
* request the content of each article for obtaining the url, the title, the teaser and main image of the article,
* you can use `"og"` properties of `<meta>` tag whithin an article to retrieve its title, main image and url.

Scrape bild.de

In [5]:
url = 'http://www.bild.de'
page = requests.get(url).text

Create a BeautifulSoup object

In [6]:
page = bs4.BeautifulSoup(page, 'html.parser')

Create a list of article links

In [7]:
article_links = [url + a['href'] for a in page.select('article a[href$="bild.html"]')]

print(len(article_links))

16


Scrape the article data

In [8]:
articles = []

for link in article_links:
    try:
        # scrape the article
        article = requests.get(link).text
        article_bs_tree = bs4.BeautifulSoup(article, 'html.parser')
        
        # select relevant data from the article
        title = article_bs_tree.find(name='meta', attrs={'property': 'og:title'}).get('content')
        image = article_bs_tree.find(name='meta',  attrs={'property': 'og:image'}).get('content')
        url = article_bs_tree.find(name='link', attrs={'rel': 'canonical'}).get('href')

        # store the data in a dict
        article = {
                'title': title,
                'image': image,
                'url': url}
        
        # add that dict to the list of articles
        articles.append(article)
        
    except Exception:
        continue




Look at the articles

In [9]:
print(articles[:3])
print(len(articles))

[{'title': 'So schützen Sie Smartphone, Tablet und Co vor der Sonne', 'image': 'https://images.bild.de/62c1ab37c016196b8bf28699/13bac7e78efa41fdb40c05e1e3837113,89157260?w=1280', 'url': 'https://www.bild.de/bild-plus/digital/2022/digital/was-sonne-mit-der-technik-macht-sogar-das-internet-braucht-mal-hitzefrei-80589828.bild.html'}]
1


## Exercise 2 - Scrape the Abstracts from the American Economic Review

Examine front page of the American Economic Review (https://www.aeaweb.org) and create a list of all articles that can be found on that page using the doi data. Each item of the doi data is either

* an academic article,
* or a non-academic article.

**Hint:**

* request content of `https://www.aeaweb.org/articles?id=` page plus the `doi` properties for identifying articles,
* request the content of each article for obtaining the abstract using `"section"` properties of `class_` whithin `'article-information abstract'`.

Read the doi data

In [10]:
df = pd.read_csv('Data/01.1 doi.csv')

See the info

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 328 entries, 0 to 327
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    328 non-null    int64  
 1   year          328 non-null    int64  
 2   journal_id    328 non-null    object 
 3   issue         328 non-null    float64
 4   article_page  328 non-null    int64  
 5   first_author  328 non-null    object 
 6   doi           328 non-null    object 
dtypes: float64(1), int64(3), object(3)
memory usage: 18.1+ KB


Set the artweb

In [12]:
artweb = 'https://www.aeaweb.org/articles?id='

Scrape the abstracts

In [13]:
abstract_list = []
for i in range(0, len(df)):
    
    doi = df.loc[i, "doi"]

    link = artweb + doi
    root = requests.get(link)
    leaf = bs4.BeautifulSoup(root.text,'html.parser')

    try:
        ## get abstracts
        abs_list = leaf.find_all('section',class_ = 'article-information abstract')
        abs_text = abs_list[0].get_text()
        abstract_list.append(abs_text)

    except IndexError:
        print('Not an academic article')
        abstract_list.append("didn't work")

    else:
        print(doi + ' done!')
            

10.1257/aer.100.1.35 done!
10.1257/aer.100.1.70 done!
10.1257/aer.100.1.98 done!
10.1257/aer.100.1.130 done!
10.1257/aer.100.1.164 done!
10.1257/aer.100.1.193 done!
10.1257/aer.100.1.214 done!
10.1257/aer.100.1.247 done!
10.1257/aer.100.1.304 done!
10.1257/aer.100.1.420 done!
10.1257/aer.100.1.504 done!
10.1257/aer.100.1.541 done!
10.1257/aer.100.1.557 done!
10.1257/aer.100.1.590 done!
10.1257/aer.100.3.691 done!
10.1257/aer.100.3.724 done!
10.1257/aer.100.3.763 done!
10.1257/aer.100.3.837 done!
10.1257/aer.100.3.870 done!
10.1257/aer.100.3.958 done!
10.1257/aer.100.3.984 done!
10.1257/aer.100.3.1008 done!
10.1257/aer.100.3.1046 done!
10.1257/aer.100.3.1080 done!
10.1257/aer.100.3.1104 done!
10.1257/aer.100.3.1195 done!
10.1257/aer.100.3.1238 done!
10.1257/aer.100.4.1358 done!
10.1257/aer.100.4.1399 done!
10.1257/aer.100.4.1493 done!
10.1257/aer.100.4.1556 done!
10.1257/aer.100.4.1572 done!
10.1257/aer.100.4.1759 done!
10.1257/aer.100.4.1778 done!
10.1257/aer.100.4.1804 done!
10.1257/a

See the abstract list

In [14]:
abstract_list

["\nAbstract\n\t\t\t\t\tThis paper investigates the role of social learning in the diffusion of a new agricultural\r\ntechnology in Ghana. We use unique data on farmers' communication\r\npatterns to define each individual's information neighborhood. Conditional on\r\nmany potentially confounding variables, we find evidence that farmers adjust\r\ntheir inputs to align with those of their information neighbors who were surprisingly\r\nsuccessful in previous periods. The relationship of these input adjustments\r\nto experience further indicates the presence of social learning. In addition,\r\napplying the same method to input choices for another crop, of known technology,\r\ncorrectly indicates an absence of social learning effects. (JEL D83, O13,\r\nO33, Q16)\t\t\t\t",
 '\nAbstract\n\t\t\t\t\tThis paper examines the frequency, pervasiveness, and determinants of product\r\nswitching by US manufacturing firms. We find that one-half of firms alter\r\ntheir mix of five-digit SIC products eve

## Exercise 3 - Open APIs from Space

Using [Open APIs From Space](http://open-notify.org) output the current latitude and longitute of the International Space Station and number of its crew members and their names.

**Hints:**

* Check [API documentation](http://api.open-notify.org).
* User `requests` library to query the API.
* The API returns responses in JSON format. To use retrieved JSON a a Python dictionary, apply `respose.json()`, where `response` is a variable holding the results of `requests.get(<url>)` request.

In [15]:
response = requests.get("http://api.open-notify.org/iss-now.json")
data = response.json()

#print(data)

print('Current ISS position: longitude: ' + 
      data['iss_position']['longitude'] + ', latitude: ' +
      data['iss_position']['latitude'])

Current ISS position: longitude: -115.5642, latitude: 16.8281


In [16]:
response = requests.get("http://api.open-notify.org/astros.json")
data = response.json()

crew_num = data["number"]
print('Number of crew members on ISS: ' + str(crew_num))

for i in range(crew_num):
    print('Crew member: ' + data['people'][i]['name'])

Number of crew members on ISS: 12
Crew member: Oleg Kononenko
Crew member: Nikolai Chub
Crew member: Tracy Caldwell Dyson
Crew member: Matthew Dominick
Crew member: Michael Barratt
Crew member: Jeanette Epps
Crew member: Alexander Grebenkin
Crew member: Butch Wilmore
Crew member: Sunita Williams
Crew member: Li Guangsu
Crew member: Li Cong
Crew member: Ye Guangfu
