< [Regular expressions](Regular%20expressions.html) | [Table of contents](index.html) | [NLTK](NLTK.html) >

# Data acquisition

Data science projects typically start with the acquisition of data. In many cases, such data sets consist of secondary data made available by commercial or non-commercial organisations. This part of the tutorial explains how you can obtain such online data sets.

## Direct downloads

If the resources that you are interested in are available directly via the web (i.e. via HTTP or HTTPS), you can download these files by making use of the `requests` library. As is the case for all libraries, the `requests` library needs to be imported before you can use it. 

In [None]:
import requests

To download data from a certain web address, you can make a GET request. In Python, such a request can be sent using the `get()` method in `requests`, as demonstrated below. Evidently, it is important that you are online when you run the code below.

In [None]:
response = requests.get( 'http://www.universiteitleiden.nl')

This method results in a so-called Response object. It is an object which represents information about the downloaded web resource. In the example above, the result of the method is assigned to a variable named `response`.

Once this Response object has been created successfully, you can produce various pieces of information about the resource that was downloaded. The variable `status_code`, for instance, indicates the status code that was returned by the server. The status code 200 indicates that the download was successful and the infamous status code 404 indicates that the file was not found.

In the status code is indeed 200, the `text` variable of the Response object contains full contents of the downloaded website. If you want to make sure that the contents are stored in a specific character encoding scheme, it is advisable to specify this explicitly using the `encoding` variable.

When you run the code that is given below, the contents of the webpage that is specified at the beginning (or more specifically, the HTML code that was created to build the webpage) becomes available as a string, assigned to the variable named `contents`.

In [None]:
import requests

response = requests.get('http://www.universiteitleiden.nl')
print( response.status_code )

if response.status_code == 200:
    response.encoding = 'utf-8' 
    print (response.text)


Using the `requests` library, you can basically download any type of file from the web, as long as it is retrievable via HTTP(s). The code below, for instance, downloads a specific text from the Project Gutenberg website.

In [None]:
url = "http://www.gutenberg.org/files/98/98-0.txt"

response = requests.get(url)

if response:
    response.encoding = 'utf-8' 
    print (response.text) 


Note that the `if` keyword in the code above does not explicitly test whether the response code is 200. The Response object, which is created when you use the `get()` method from requests, automatically returns `True` when the status code is 200.

The `requests` library can also be used to retrieve data that can b accessed via an API.

## Acquiring data via APIs

Organisations which aim to make their data available for reuse often do this through an Application Programming Interface (API). An API, simply put, is a software application which can process online requests for information. It enables organisations to share some of the data that they have in a strucured format, so that these other external parties can make use of these data in new applications. 

The communication between the sender and the recipient of such requests needs to take place according to a specific protocol; the requests need to formulated according to certain rules. 

For many APIs, you need to create an access key before you can send requests. This is the case, for instance, for the Twitter API. There are also many APIs which are fully open, however, such as the Wikipedia API. You can send request to this API without having to provide an access key. 

To find information about Wikipedia pages via the Wikipedia API, for instance, you need to send a number of parameters to the endpoint of the API, which is available at [https://en.wikipedia.org/w/api.php](https://en.wikipedia.org/w/api.php). You need to provide values for the following parameters:

`
action = opensearch
search = [search term]
limit = [number]
format = [json or xml ]
`

The search term, to be provided after ‘search’, is the word that must occur in the title of the Wikipedia page. If the ‘limit’ parameter is omitted, the API will return 10 results by default.

The following API call returns 20 Wikipedia pages whose titles contain the word ‘Leiden’. The call returns the requested data in the JSON format.

[https://en.wikipedia.org/w/api.php?action=opensearch&search=Leiden&limit=20&namespace=0&format=json](https://en.wikipedia.org/w/api.php?action=opensearch&search=Leiden&limit=20&namespace=0&format=json)

The `json()` method parses the JSON data.

In [None]:
import requests


baseUrl = 'https://en.wikipedia.org/w/api.php?action=opensearch'


search= 'leiden' 
limit = 20 
format= 'json'


apiCall = '{}&search={}&limit={}&format={}'.format( baseUrl , search , limit, format )
print(apiCall)


response = requests.get( apiCall )
wikiResults = response.json()


for i in range( 0 , len(wikiResults[1]) ):
    print( 'Title: ' + wikiResults[1][i] )
    print( 'Tagline: ' + wikiResults[2][i] )
    print( 'Url: ' + wikiResults[3][i] + '\n')


## Webscraping

When a website does not offer access to its structured data via a well-defined API, it may be an option to acquire the data that can be viewed on a site by making use of web scraping. It is a process in which a computer program tries to process the contents of given webpage, and to extract the data values that are needed. The function of a Web scraping application can be compared to that of a web crawler or a bot. The aim of such an application is generally to copy information on a web page and to paste it into a local database.

One of the libraries that you can use in Python for scraping online resources is `Beautiful Soup`.

To scrape webpages, you firstly need to download them. This can be done using the code that was explained above. Once you have obtained the contents of a webpage, in the form of an HTML document, you can process its contents effectively by transforming it into a BeautifulSoup object. This object has a `find_all()` function, which you can use to find all occurrences of a specific HTML tag. To extract all the links on a webpage, for example, you need to extract all the `a` tags which have the `href` attribute.

The code below scrapes data from a page on Project Gutenberg. It extract all the titles and the urls of the books which are listed on Project Gutenberg’s Philosophy Bookshelf)

In [None]:
from bs4 import BeautifulSoup
import requests
import re

## Project Gutenberg Bookshelves can be found at 
## https://www.gutenberg.org/wiki/Category:Bookshelf
url = 'https://www.gutenberg.org/wiki/Philosophy_(Bookshelf)'
response = requests.get( url )

soup = BeautifulSoup( response.text ,"lxml")

links = soup.find_all("a")

for l in links:
    linktext = l.string
    url = l.get("href")

    if re.search('gutenberg' , str(url) , re.IGNORECASE):

        url = re.sub( r'^[/]*' , '' , url )
        print(f"{linktext}: {url}")

Note that the code above also makes use of the `re` library, to examine the form of the URLs and to get rid of unwanted characters.

Once you have created a list of URLs using the method outlined above, you can also download all the texts that were found, using the `get()` method from `requests` library.

< [Regular expressions](Regular%20expressions.html) | [Table of contents](index.html) | [NLTK](NLTK.html) >