# 15. Downloading files


Projects in field of *Text Mining* typically start with the acquisition of texts. Such data sets may consist of secondary data made available on the web by commercial or non-commercial organisations. Machine-readable texts can be downloaded from a wide range of sources:

* [Project Gutenberg](https://www.gutenberg.org/)
* [Distant Reading E-COST](https://github.com/distantreading/distantreading.github.io)
* [DBNL](https://dbnl.nl/)
* [Text Creation Partnership](https://github.com/textcreationpartnership/Texts)
* [WikiData](https://www.wikidata.org/)
* [Folger Shakespeare Digital Library](https://shakespeare.folger.edu/download/)
* [Oxford Text Archive](https://ota.bodleian.ox.ac.uk/repository/xmlui/)
* [Open Subtitles](https://www.opensubtitles.com/en)
* [TextGrid Repository](https://textgridrep.org/)
* [Internet Archive](https://archive.org/details/opensource)
* [Open Library](https://openlibrary.org/)
* [Delpher](https://www.delpher.nl/)
* [Archive of NRC (Dutch Newspaper)](https://www.nrc.nl/index/archief/)
* [Digital Library of the Netherlands](https://www.dbnl.org/)


The manual acquisition of texts may be tedious if the data collection consists of many files. In such situations, you can choose to write a program to carry out the downloads. One of the libraries you can work with for this purpose is [`requests`](https://requests.readthedocs.io/). 

As is the case for all libraries, the `requests` library needs to be imported before you can use it. 

In [1]:
import requests

The `requests` library can be used to make requests according to the [Hypertext Transfer Protocol (HTTP)](https://en.wikipedia.org/wiki/HTTP), which was developed to enable the exchange of information between computers. The computer that can provide information is typically referred to as a *server*, and the computer that requests information from this server is referred to as a *client*. In the HTTP protocol, the GET method is used to request data from a specified server. 

In Python, such a GET request can be sent to a server using the `get()` method in `requests`, as demonstrated below. Evidently, it is important that you are online when you run this code.

In [2]:
response = requests.get( 'https://www.universiteitleiden.nl')

This method returns a so-called `Response` object. It is an object which represents information about the downloaded web resource. In the example above, the result of the method is assigned to a variable named `response`.

Once this `Response` object has been created successfully, you can use various pieces of information about the resource that was requested.
The property `status_code`, for instance, indicates the HTTP status code that was returned by the server.
The status code 200 indicates that the request was successful. The infamous status code 404 indicates that the file was not found.

If the status code is indeed 200, the contents of the resource is accessible in the response's `content` property. This property holds the contents as bytes, however. When we downloaded a webpage, we typically want to work with the data as text. To obtain this text, we can work with the `text` property of the `Response` object. It contains the full contents of the downloaded resource as a string.

Note that `requests` may not always understand a file's [character encoding](https://www.w3.org/International/questions/qa-what-is-encoding) automatically. You can set the correct character encoding explicitly using the `encoding` property.

When you run the code that is given below, the contents of the webpage that is specified in the `get()` method (or, more precisely, the HTML code that was created to build the webpage) becomes available as a string, assigned to the variable named `contents`.

In [3]:

contents = ""
response = requests.get('https://www.universiteitleiden.nl')
print( response.status_code )

if response.status_code == 200:
    response.encoding = 'utf-8'
    contents = response.text
    print (contents)


200
<!DOCTYPE html>
<html lang="nl" data-version="1.185.00" >
<head>
















<!-- standard page html head -->

    <title>Home - Universiteit Leiden</title>
        <meta name="google-site-verification" content="o8KYuFAiSZi6QWW1wxqKFvT1WQwN-BxruU42si9YjXw"/>
        <meta name="google-site-verification" content="hRUxrqIARMinLW2dRXrPpmtLtymnOTsg0Pl3WjHWQ4w"/>

        <link rel="canonical" href="https://www.universiteitleiden.nl/"/>
<!-- icons -->
    <link rel="shortcut icon" href="/design-1.0/assets/icons/favicon.ico"/>
    <link rel="icon" type="image/png" sizes="32x32" href="/design-1.0/assets/icons/icon-32px.png"/>
    <link rel="icon" type="image/png" sizes="96x96" href="/design-1.0/assets/icons/icon-96px.png"/>
    <link rel="icon" type="image/png" sizes="195x195" href="/design-1.0/assets/icons/icon-195px.png"/>

    <link rel="apple-touch-icon" href="/design-1.0/assets/icons/icon-120px.png"/> <!-- iPhone retina -->
    <link rel="apple-touch-icon" sizes="180x180"
       

Using the `requests` library, you can basically download any type of file from the web, as long as it is retrievable via HTTP(s). 

[Project Gutenberg](http://gutenberg.org). which was mentioned above, is an online repository containing tens of thousands of machine readable texts in a variety of formats. For Text Mining projects, the  plain TXT format , with characters encoded accordingh to UTF-8 encoding, is usually the most convenient format. 

If you know the URL of a specific TXT file on *Project Gutenberg*, you can retrieve the contents of the online file using `requests`. To download the file, you can make use of the `open()` function in the 'write' mode. 

The code below downloads the file with the url https://www.gutenberg.org/files/4300/4300-0.txt. This file contains the full text of James Joyces's novel Ulysses. The full text of the novel is firstly assigned to a string named `full_text`, and, as a next step, the full text is saved to the disk using the `open()` function.  

In [5]:
text_url = 'https://www.gutenberg.org/files/4300/4300-0.txt'
title = 'Ullyses'

response = requests.get(text_url)

if response:
    response.encoding = 'utf-8' 
    full_text = response.text 
    out = open( f"{title}.txt" , 'w' , encoding = 'utf-8')
    out.write( full_text.strip() )
    out.close()


Note that the `if` keyword in the code above does not explicitly test whether the response code is 200. The Response object, which is created when you use the `get()` method from requests, automatically returns `True` when the status code is 200.



### Exercise 10.1.

The list below contains a number of URLs. They are the web addresses of texts created for the [Project Gutenberg](https://www.gutenberg.org) website.

```
urls = [ 'https://www.gutenberg.org/files/580/580-0.txt' ,
'https://www.gutenberg.org/files/1400/1400-0.txt' ,
'https://www.gutenberg.org/files/786/786-0.txt' ,
'https://www.gutenberg.org/files/766/766-0.txt' 
]
```

Write a program in Python that can download all the files in this list and stores them in the current directory.

As filenames, use the same names that are used by Project Gutenberg (e.g. '580-0.txt' or '1400-0.txt').

The basename in a URL can be extracted using the [`os.path.basename()`](https://docs.python.org/3/library/os.path.html#os.path.basename) function.


In [None]:
import requests
import os.path

# Recreate the given list using copy and paste
urls = [  
]

# We use a for-loop to take the same steps for each item in the list:
for url in urls:
    # 1. Download the file contents
    
    # 1a. Force the textual contents to be interpreted as UTF-8 encoded, because the website does not send the text encoding
    
    # 2. Use basename() to get a suitable filename
    
    # 3. Open the file in write mode and write the downloaded file contents to the file
    
    # 4. Close the file
    
    

When you download text files from Project Gutenberg, it is important to bear in mind that the files all contain a 'boilerplate', before and after the actual full text. These headers and footers contain some legal texts, and often some information about the digitisation process. 

The Gutenberg header and footer obviously need to be removed from the file before you start to analyse the text. 

The function `remove_pg_boilerplate()`, defined below, removes the boilerplate, based on the string that are used at the end of the header ('START OF THE PROJECT GUTENBURG EBOOK') and at the beginning of the footer ('END OF THE PROJECT GUTENBURG EBOOK'). The function selects all the text in between these two strings. 

In [None]:
import re 

def remove_pg_boilerplate(complete_file):
    
    lines = re.split( r'\n' , complete_file )
    read_mode = 0 
    full_text = ''
    
    for line in lines:
        #print(line)
        if read_mode == 1:
            full_text += line + '\n'
            
        if re.search( r'\*{3,}\s+START\s+OF\s+TH(E|IS)\s+PROJECT\s+GUTENBERG\s+EBOOK' ,  str(line) , re.IGNORECASE ):
            read_mode = 1
        if re.search( r'\*{3,}\s+END\s+OF\s+TH(E|IS)\s+PROJECT\s+GUTENBERG\s+EBOOK' ,  str(line) , re.IGNORECASE ):
            read_mode = 0
            
    full_text = full_text.strip()
    if re.search( r'^Produced by' , full_text , re.IGNORECASE ):
        full_text = full_text[ full_text.index('\n') : len(full_text) ]
    return full_text

In the code below, the result of the `remove_pg_boilerplate()` function is asigned to a varibable named `cleaned_text`. 

In [None]:
cleaned_text = remove_pg_boilerplate(full_text) 

The code below defines a dictionary containing both the URLs and the titles of a number of books available at Project Gutenberg. The files that are listed in this dictionary can all be downloaded using the steps that have been discussed. 

In [None]:
import requests
import re

gutenberg_files = {
    'http://www.gutenberg.org/files/158/158-0.txt':'Emma',
    'http://www.gutenberg.org/files/161/161-0.txt':'Sense and Sensibility',
    'http://www.gutenberg.org/files/1342/1342-0.txt':'Pride and Prejudice'
}

for url in gutenberg_files:
    print("Downloading " + gutenberg_files[url] + " ...")
    response = requests.get(url)
    title = re.sub( r'\s+' , '_' ,  gutenberg_files[url] )

    if response:
        response.encoding = 'utf-8'
        full_text = remove_pg_boilerplate(response.text)           
        out = open( title , 'w' , encoding = 'utf-8')
        out.write( full_text.strip() )
        out.close()

print('\nDone!')    

## Extract text from PDF files

The notebooks written for this tutorial assume that texts are stored as TXT files with the characters encoded as UTF-8. If you have downloaded texts in other formats, you need to convert these text into TXT first. 

There are a number of packages that you can use to convert PDF files, into TXT. One of these is `PyPDF2`.

This library is not part of the Anaconda distribution of Python, so it needs to be installed before you can use it. 

In [None]:
# Remove the hashed in the two lines that follow!

#import sys
#!pip install PyPDF2

Using `PyPDF2`, you can create a `pdfReader` object, containing a method named `extractText()` which can convert pages in PDF files into plain text files. 

Succes cannot be guaranteed, however. If a page in the PDF contains multiple columns, some of the lines may get mixed up. If the file uses a character encoding system other than UTF-8, there can also be some isues with the text of the file. The text that is created by the script will probably need to be edited manually, unfortunately. 

In [None]:
import PyPDF2  
import re

pdf_url = 'https://scholarlypublications.universiteitleiden.nl/access/item%3A2729408/view'

file = 'downloaded.pdf'

response = requests.get(pdf_url)
if response:
    out = open( file , 'wb')
    out.write(response.content)
    out.close()


print( f'Reading {file} ...')
    
pdf_obj = open( file , 'rb')

filename = file[  : file.rindex('.') ] + '.txt'  

out = open( filename , 'w' , encoding = 'utf-8' )
pdfReader = PyPDF2.PdfFileReader(pdf_obj)  

print( f'The PDF file has {pdfReader.numPages} pages.\n' )  

for i in range(0,pdfReader.numPages):
    page_obj = pdfReader.getPage(i)  

    txt = page_obj.extractText()
    txt = re.sub('\n\n' , '\n' , txt)

    out.write(txt)  


pdf_obj.close()
out.close


## Exercise 15.1

1. Download all the text files that are listed in the following dictionary. 

`
    gutenberg_files = {
    'https://www.gutenberg.org/files/98/98-0.txt' :
        'A Tale of Two Cities',
    'https://www.gutenberg.org/files/580/580-0.txt':
        'The Pickwick Papers'
   }
`

Save these files in the folder named 'Corpus'. Make sure that the Gutenberg boilerplates are removed from the texts. 

In [None]:
import os

gutenberg_files = {
    'https://www.gutenberg.org/files/98/98-0.txt' :
        'A Tale of Two Cities',
    'https://www.gutenberg.org/files/580/580-0.txt':
        'The Pickwick Papers'
   }

