# Introduction

## Acquiring the texts for your corpus

Projects in field of Text and Data Mining (TDM) typically start with the preparation of a corpus. Machine-readable texts can be downloaded from a wide range of sources:

* [Project Gutenberg](https://www.gutenberg.org/)
* [Distant Reading E-COST](https://github.com/distantreading/distantreading.github.io)
* [DBNL](https://dbnl.nl/)
* [Text Creation Partnership](https://github.com/textcreationpartnership/Texts)
* [WikiData](https://www.wikidata.org/)
* [Folger Shakespeare Digital Library](https://shakespeare.folger.edu/download/)
* [Oxford Text Archive](https://ota.bodleian.ox.ac.uk/repository/xmlui/)
* [OpenSubtitles](opensubtitles.com)
* [TextGrid Repository](https://textgridrep.org/)
* [Internet Archive](https://archive.org/details/opensource)
* [Open Library](https://openlibrary.org/)
* [Delpher](https://www.delpher.nl/)
* [Archive of NRC (Dutch Newspaper)](https://www.nrc.nl/index/archief/)
* [Digital Library of the Netherlands](https://www.dbnl.org/)


## Downloading a text from Project Gutenberg

[Project Gutenberg](http://gutenberg.org), for example, is an online repository containing tens of thousands of machine readable texts in a variety of formats. 

For TDM projects, the  plain TXT format , with characters encoded accordingh to UTF-8 encoding, is usually the most convenient format.  If you know the URL of a specific TXT file on *Project Gutenberg*, you can download this text using the `requests` library. 

The code below downloads the file with the url [https://www.gutenberg.org/files/4300/4300-0.txt](https://www.gutenberg.org/files/4300/4300-0.txt). This file contains the full text of James Joyces's novel *Ulysses*. 

In [None]:
import requests

text_url = 'https://www.gutenberg.org/files/4300/4300-0.txt'


response = requests.get( text_url )

if response:
    response.encoding = 'utf-8'
    full_text = response.text

If you have managed to run this code successfully, the string named `full_text` should contain the full contents of the file that was downloaded. We can easily calculate the total number of characters in the text file by making use of `len()`. |

In [None]:
print(len(full_text))

When you download text files from Project Gutenberg, it is important to bear in mind that the files all contain a 'boilerplate', before and after the actual full text. These headers and footers contain some legal texts, and often some information about the digitisation process. 

The Gutenberg header and footer obviously needs to be removed from the file before you start to analyse the text. 

The function `remove_pg_boilerplate()`, defined below, removes the boilerplate, based on the string that are used at the end of the header ('START OF THE PROJECT GUTENBURG EBOOK') and at the beginning of the footer ('END OF THE PROJECT GUTENBURG EBOOK'). The function selects all the text in between these two strings. 

In [None]:
import re 

def remove_pg_boilerplate(complete_file):
    
    lines = re.split( r'\n' , complete_file )
    read_mode = 0 
    full_text = ''
    
    for line in lines:
        #print(line)
        if read_mode == 1:
            full_text += line + '\n'
            
        if re.search( r'\*{3,}\s+START\s+OF\s+TH(E|IS)\s+PROJECT\s+GUTENBERG\s+EBOOK' ,  str(line) , re.IGNORECASE ):
            read_mode = 1
        if re.search( r'\*{3,}\s+END\s+OF\s+TH(E|IS)\s+PROJECT\s+GUTENBERG\s+EBOOK' ,  str(line) , re.IGNORECASE ):
            read_mode = 0
            
    full_text = full_text.strip()
    if re.search( r'^Produced by' , full_text , re.IGNORECASE ):
        full_text = full_text[ full_text.index('\n') : len(full_text) ]
    return full_text

In the code below, the result of the `remove_pg_boilerplate()` function is asigned to a varibable named `cleaned_text`. 

In [None]:
cleaned_text = remove_pg_boilerplate(full_text) 

The text can be aved on your computer using the `open()` function in the write ('`w`') mode. 

In [None]:
with open( 'ulysses.txt' , 'w' , encoding = 'utf-8') as fh:
    fh.write( cleaned_text )

The code below defines a dictionary containing both the URLs and the titles of a number of books available at Project Gutenberg. The files that are listed in this dictionary can all be downloaded using the steps that have been discussed. 

In [None]:
import requests
import re

gutenberg_files = {
    'http://www.gutenberg.org/files/158/158-0.txt':'Emma',
    'http://www.gutenberg.org/files/161/161-0.txt':'Sense and Sensibility',
    'http://www.gutenberg.org/files/1342/1342-0.txt':'Pride and Prejudice'
}

for url in gutenberg_files:
    print("Downloading " + gutenberg_files[url] + " ...")
    response = requests.get(url)
    title = re.sub( r'\s+' , '_' ,  gutenberg_files[url] )

    if response:
        response.encoding = 'utf-8'
        full_text = remove_pg_boilerplate(response.text)           
        out = open( title , 'w' , encoding = 'utf-8')
        out.write( full_text.strip() )
        out.close()

print('\nDone!')    

## Extract text from PDF files

The notebooks written for this tutorial assume that texts are stored as TXT files with the characters encoded as UTF-8. If you have downloaded texts in other formats, you need to convert these text into TXT first. 

There are a number of packages that you can use to convert PDF files, into TXT. One of these is `PyPDF2`.

This library is not part of the Anaconda distribution of Python, so it needs to be installed before you can use it. 

In [None]:
import sys
!pip install PyPDF2

Using `PyPDF2`, you can create a `pdfReader` object, containing a method named `extractText()` which can convert pages in PDF files into plain text files. 

Succes cannot be guaranteed, however. If a page in the PDF contains multiple columns, some of the lines may get mixed up. If the file uses a character encoding system other than UTF-8, there can also be some isues with the text of the file. The text that is created by the script will probably need to be edited manually, unfortunately. 

In [None]:
import PyPDF2  
import re

pdf_url = 'https://scholarlypublications.universiteitleiden.nl/access/item%3A2729408/view'

response = requests.get(pdf_url)
if response:
    out = open('downloaded.pdf' , 'wb')
    out.write(response.content)
    out.close()


print( f'Reading {file} ...')
    
pdf_obj = open( file , 'rb')

filename = file[  : file.rindex('.') ] + '.txt'    
out = open( filename , 'w' , encoding = 'utf-8' )

pdfReader = PyPDF2.PdfFileReader(pdf_obj)  

print( f'The PDF file has {pdfReader.numPages} pages.\n' )  

for i in range(0,pdfReader.numPages):
    page_obj = pdfReader.getPage(i)  

    txt = page_obj.extractText()
    txt = re.sub('\n\n' , '\n' , txt)

    out.write(txt)  


pdf_obj.close()
out.close

# Exercises

## Exercise 1.1

1. Download all the text files that are listed in the following dictionary. 

`
    gutenberg_files = {
    'https://www.gutenberg.org/files/98/98-0.txt' :
        'A Tale of Two Cities',
    'https://www.gutenberg.org/files/580/580-0.txt':
        'The Pickwick Papers'
   }
`

Save these files in a folder named 'Texts'. In Python, you can make new folders using the `os` package, as follows:

`
os.mkdir('Texts')
`








In [None]:
import os

gutenberg_files = {
    'https://www.gutenberg.org/files/98/98-0.txt' :
        'A Tale of Two Cities',
    'https://www.gutenberg.org/files/580/580-0.txt':
        'The Pickwick Papers'
   }

if not os.path.exists('Texts'): 
    os.mkdir('Texts')

## Exercise 1.2.

Potentially, you can acquire texts using Web Scraping. 

The webpage below offers access to the complete work of H.P. Lovecraft. 
http://www.hplovecraft.com/writings/texts/

Can you write code in Python to download all the texts that are listed on this page?  