# Data acquisition - Exercises

## Exercise 9.1.

The list below contains a number of URLs. They are the web addressed of texts created for the Project Gutenberg website.   

```
urls = [ 'http://www.gutenberg.org/files/580/580-0.txt' ,
'http://www.gutenberg.org/files/1400/1400-0.txt' ,
'http://www.gutenberg.org/files/786/786-0.txt' ,
'http://www.gutenberg.org/files/766/766-0.txt' 
]
```

Write a program in Python which can download all the files that are listed. As filenames, use the same names that are used by Project Gutenberg (e.g. '580-0.txt' or '1400-0.txt'). The basename in a URL can be extracted using the os.path.basename() function.



In [None]:
import requests
import os 

urls = [ 'http://www.gutenberg.org/files/580/580-0.txt' ,
'http://www.gutenberg.org/files/1400/1400-0.txt' ,
'http://www.gutenberg.org/files/786/786-0.txt' ,
'http://www.gutenberg.org/files/766/766-0.txt' 
]


for text in urls:
    response = requests.get(text)
    response.encoding = 'utf-8'
    out = open( os.path.basename(text) , 'w' )
    out.write( response.text )
    out.close()
    

## Exercise 9.2.

Write Python code which can download the titles and the URLs of Wikipedia articles whose titles contain the word 'Dutch'. Your code needs to display the first 30 results only.    

In [None]:
import requests
import json

baseURL = 'https://en.wikipedia.org/w/api.php?action=opensearch'

searchTerm = "Dutch"
limit = 30
format = 'json'

apiCall = '{}&search={}&limit={}&format={}'.format( baseURL, searchTerm , limit , format )


responseData = requests.get( apiCall )

wikiResults = responseData.json()


for i in range( 0 , len(wikiResults[1]) ):
    print( 'Title: ' + wikiResults[1][i] )
    print( 'Tagline: ' + wikiResults[2][i] )
    print( 'Url: ' + wikiResults[3][i] + '\n')
    


## Exercise 9.3.

Write an application in Python which can extract all the publications that have been added to a specific ORCID account. Make use of the ORCID API to do this. Information about individual ORCID accounts can be obtained by appending these to the following base URL: https://pub.orcid.org/v2.0/. The ORCID API returns data in XML. The list of publications can be found underneath "record/activities-summary/works/group".

In [None]:
orcid = '0000-0002-8469-6804'


import re
import requests
import xml.etree.ElementTree as ET


ns = {'o': 'http://www.orcid.org/ns/orcid' ,
's' : 'http://www.orcid.org/ns/search' ,
'h': 'http://www.orcid.org/ns/history' ,
'p': 'http://www.orcid.org/ns/person' ,
'pd': 'http://www.orcid.org/ns/personal-details' ,
'a': 'http://www.orcid.org/ns/activities' ,
'e': 'http://www.orcid.org/ns/employment' ,
'c': 'http://www.orcid.org/ns/common' , 
'w': 'http://www.orcid.org/ns/work'}


try:
    orcidUrl = "https://pub.orcid.org/v2.0/" + orcid
    print( orcidUrl )
    
    response = requests.get( orcidUrl )
    root = ET.fromstring(response.text)
    
    creationDate = root.find('h:history/h:submission-date' , ns ).text
    
    print('\nORCID created on:')
    print(creationDate)
    
    print('\nWorks:')
    
    works = xml.findall('a:activities-summary/a:works/a:group' , ns )
    for w in works:
        title = w.find('w:work-summary/w:title/c:title' , ns ).text
        print(title)
        doiEl = w.find('c:external-ids/c:external-id/c:external-id-url' , ns )
        if doiEl is not None:
            doi = doiEl.text
            print(doi)
            
except:
    print("Data could not be downloaded")


## Exercise 9.4.

The API developed by [OpenStreetMap](https://www.openstreetmap.org/) can be used, among other things, to find the precise geographic coordinates of a specific location. The base URL of this API is https://nominatim.openstreetmap.org/search. Following the 'q' parameter, you need to supply a string describing the locations whose latitude and longitude you want to find. As values for the 'format' parameter, you can use 'xml' or 'json'. Use this API to find the longitude and the latitude of the addresses in the following list:

```
addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' , 'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']
```

In [None]:

import requests
import xml.etree.ElementTree as ET
import re
import string
from os.path import isfile, join , isdir
import os

addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden' , 'Singel 425 Amsterdam' , 'Drift 27 Utrecht' , 'Broerstraat 4 Groningen']




for a in addresses:
    url = 'https://nominatim.openstreetmap.org/search?q='+ a + '&format=xml'
    url = re.sub( '\s+' , '%20' , url )

    response = requests.get( url )
    root = ET.fromstring( response.text )
    el = root.findall('place')
    
    count = 0
    if el is not None:
        for place in el:
            count += 1
            lat = place.attrib['lat']
            lon = place.attrib['lon']
            if count == 1:
                print( '{}: {},{}\n'.format( a, lat , lon ) )



## Exercise 9.5.

Extract the titles of all the movies which are included on the [list of top rated movies](https://www.imdb.com/chart/top?ref_=ft_250) using web scraping. Also extract the URL of the webpage on IMDB descriobing these movies.

In [None]:
from bs4 import BeautifulSoup
import requests
import re


soup = ""


url = 'https://www.imdb.com/chart/top?ref_=ft_250'


response = requests.get( url )
soup = BeautifulSoup( response.text ,"lxml")


movies = soup.find_all('td', {'class': 'titleColumn'})

for m in movies:
    children = m.findChildren("a" , recursive=False)
    for c in children:
        movieTitle = c.text
        url = c.get('href')
        url = 'http://imdb.com' + url
        print( '{}: {}'.format( movieTitle , url ) )

