< [Regular expressions](https://tdm.universiteitleiden.nl/Python/Regular%20expressions.html) | [Table of contents](https://tdm.universiteitleiden.nl/Python/) | [NLTK](https://tdm.universiteitleiden.nl/Python/NLTK.html) >

# Data acquisition


Data science projects typically start with the acquisition of data. In many cases, such data sets consist of secondary data made available by commercial or non-commercial organisations. 


## Direct downloads

The methode 'urlopen()', from the module urllib.request can be used to download files from the web. The code in the cell below downloads a given text from project Gutenberg, using the url of this file. If succesful, the 'urlopen()' method creates a new HTTPResponse object, which represents the file that was downloaded. Its function can be compared to that of a file handler. The contents of the HTTPResponse object can be accessed using the 'read()' method. By default, this method renders the contents of the downloaded file as a bytes object. The contents can be processed more effectively if it decoded into a text with UTF-8 encoding, using the decode() method. 

When you run the code that is given below, the contents of the webpage that is specified at the beginning becomes available as a string, assigned to the variable 'contents'.  

In [None]:
import urllib.request

url = "http://www.gutenberg.org/files/98/98-0.txt"

request = urllib.request.urlopen(url)

print( type( request ) )

bytes = request.read()
contents = bytes.decode("utf-8")
request.close()

print(contents)


##  Acquiring data via APIs


Organisations which aim to make their data available for reuse often do this through an Application Programming Interface (API). An API, simply put, is a software application which can process online requests for information. These are typically request for specific data sets. The communication between the sender and the recipient of such requests needs to take place according to a protocol; the requests need to formulated according to certain rules. As part of the protocol, the API can respond to incoming queries by returning the requested data in the format that was specified. An API, in short, is a definition of an application which enables organisations to share some of the data that they have with other parties, so that these other external parties can make use of these data in new applications. For many APIs, you need to create an account before you can send reuqests. This is the case, for instance, for the Twitter API. There are also APIs which are fully open, however. 

The Wikidata API, for instance, is fully open. You can send request to the API without having to provide an access key. This API can be used to find information about Wikipedia pages. More details about the precise format of the requests that you can send to Wikipedia can be found at the [documentation pages](https://www.mediawiki.org/wiki/API). They explain, among other things, that you can use the 'opensearch' function, in which the 'search' parameter needs to be assigned a specific search term. Once the request of this kind has been received, the API will return all the Wikipedia lemmas whose titles contain the provided search term. The API call can be sent to the Wikipedia API using roughly the same methods as those used for direct downloads. 

In the code below, the JSON data that is sent by Wikidata is processed further using the 'loads()' method from the 'json' library.

In [None]:
import urllib.request
import json
import pprint

apiURL = 'https://en.wikipedia.org/w/api.php?action=opensearch&format=json&search='
searchTerm = "amsterdam"

apiCall = apiURL + searchTerm

wikiHeader = {'User-Agent':'p.a.f.verhaar@hum.leidenuniv.nl'}

wikiRequest = urllib.request.Request(apiCall, headers=wikiHeader)

request = urllib.request.urlopen(wikiRequest)
responseData = request.read().decode("utf-8")
request.close()

print(responseData)


wikiResults = json.loads(responseData)


for i in range( 0 , len(wikiResults[1]) ):
    print( 'Title: ' + wikiResults[1][i] )
    print( 'Tagline: ' + wikiResults[2][i] )
    print( 'Url: ' + wikiResults[3][i] + '\n')



The API developed by [Open Street Map](https://www.openstreetmap.nl/) is also open. The 'search' function of this API needs to used in combination with a textual decsription of an address. The API can return, among other things, the geographic coordinates of the address that is mentioned. Such data can evidently be very useful in GIS applications. 

In this case, the data are returned in the XML format. The data in this format are processed using the 'xml.etree.ElementTree' module. 

In [None]:

import urllib.request
import xml.etree.ElementTree as ET
import re
import string
from os.path import isfile, join , isdir
import os

addresses = ['Grote Looiersstraat 17 Maastricht' , 'Witte Singel 27 Leiden']


for a in addresses:
    print(a)
    url = 'https://nominatim.openstreetmap.org/search?q='+ a + '&format=xml'
    url = re.sub( '\s+' , '%20' , url )

    fp = urllib.request.urlopen( url )
    mybytes = fp.read()
    result = mybytes.decode("utf8")
    fp.close()
    root = ET.fromstring(result)
    el = root.findall('place')
    
    count = 0
    if el is not None:
        for place in el:
            count += 1
            lat = place.attrib['lat']
            lon = place.attrib['lon']
            if count == 1:
                print( '{},{}\n'.format( lat , lon ) )



## Webscraping



When a website does not offer access to its structured data via a well-defined API, it may be an option to acquire the data that can be viewed on a site by making use of web scraping. It is a process in which a computer program tries to process the contents of given webpage, and to extract the data values that are needed. The function of a Web scraping application can be compared to that of a web crawler or a bot. The aim of such an application is generally to copy information on a web page and to paste it into a local database. 

One of the libraries that you can use in Python for scraping online resources is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). 

To scrape webpages, you firstly need to download them, using the code that was explained above. Once you have obtained the contents of a webpage, in the form of an HTML document, you can process its contents effectively by transforming it into a BeautifulSoup object. This object has a ‘find_all()’ function, which needs to be associated with a specific HTML tag. This function returns all occurrences of the tag that is mentioned. To extract all the links on a webpage, for instance, you need to extract all the ‘a’ tags which have the ‘href’ attribute. 

The code scrapes data from a page on Project Gutenberg. It extract all the titles and the urls of the books which are listed on [Project Gutenberg’s Philosophy Bookshelf](https://www.gutenberg.org/wiki/Philosophy_(Bookshelf))

In [None]:
from bs4 import BeautifulSoup
import urllib.request
import re


soup = ""

## Project Gutenberg Bookshelves can be found at 
## https://www.gutenberg.org/wiki/Category:Bookshelf

url = 'https://www.gutenberg.org/wiki/Philosophy_(Bookshelf)'



request = urllib.request.urlopen(url)
bytes = request.read()
contents = bytes.decode("utf-8")
request.close()


soup = BeautifulSoup(contents,"lxml")


links = soup.find_all("a")

for l in links:
    linktext = l.string
    url = l.get("href")

    if re.search('gutenberg' , str(url) , re.IGNORECASE):
        print(f"{linktext}: {url}")


< [Regular expressions](https://tdm.universiteitleiden.nl/Python/Regular%20expressions.html) | [Table of contents](https://tdm.universiteitleiden.nl/Python/) | [NLTK](https://tdm.universiteitleiden.nl/Python/NLTK.html) >