# Introduction

## File organisation 

This is the first in a series of notebooks discussing a number of basic analyses in the field of text and data mining. These notebooks explain to you how you can perform these analyses, step by step.

As first and vital step, you need to create a working directory in which you can store all the texts and all the code for your project. Of course you are free to name this directory in whichever way you like. Once you have you created such a folder on your computer, you need to make sure that the various notebooks are stored in the root of your working directory.    

* Open the [gitHub resository that was made for this course](https://github.com/peterverhaar/dtdp2020). 
* In github, click on "Clone" > "Download ZIP" and save the zipped folder in the root of your working directory. 
* Unzip the downloaded folder. You working directory should now contain the notebooks that you need, as well as a subdirectory named "Corpus".



## Acquiring the texts for your corpus

The "Corpus" folder you obtained as you downloaded the github repository for this course contains a number of sample texts, but, in your text and data mining project, you obviously want to work with your own texts. As a next step, you can delete the sample text files obtained by default and replace these with your own texts.

You can download digitised or born-digital text files from a variety of sources:

* [Project Gutenberg](https://www.gutenberg.org/)
* [Distant Reading E-COST](https://github.com/distantreading/distantreading.github.io)
* [DBNL](https://dbnl.nl/)
* [Text Creation Partnership](https://github.com/textcreationpartnership/Texts)
* [WikiData](https://www.wikidata.org/)
* [Folger Shakespeare DIgital Library](https://shakespeare.folger.edu/download/)

Note that the code that was developed for this course assume that all the texts are saved as plain machine-readable TXT files, with all characters encoded according to the UTF-8 scheme. 

When you download text files from Project Gutenberg, it is important to bear in mind that the files all contain a 'boilerplate' which includes the full user licence. This boilerplate obviously need to be removed from the file before you start to analyse it.   

To make the process of data acquisition transparent and reproducible, it can be useful to work with code which performs the downloads. Web resources can be dowloaded using Python's `requests` library. The code below firstly defines a dictionary which stored both the URLs and the titles of a number of books available at Project Gutenberg. After this, it also downloads all of these texts. In this process of acquiring the files, the boilerplates are also removed from the texts. The code makes use of a regular expression for this prupose.

In [1]:
import requests
import re

gutenberg_files = {
    'http://www.gutenberg.org/files/158/158-0.txt':'Emma',
    'http://www.gutenberg.org/files/161/161-0.txt':'Sense and Sensibility',
    'http://www.gutenberg.org/files/1342/1342-0.txt':'Pride and Prejudice'
}


for url in gutenberg_files:
    print("Downloading " + gutenberg_files[url] + " ...")
    response = requests.get(url)
    title = re.sub( r'\s+' , '_' ,  gutenberg_files[url] )

    if response:
        response.encoding = 'utf-8'
        lines = re.split( r'\n' , response.text )
        read_mode = 0 
        full_text = ''
        
        for line in lines:
            if read_mode == 1:
                full_text += line + '\n'
            
            if re.search( r'\*{3,}\s+START\s+OF\s+TH(E|IS)\s+PROJECT\s+GUTENBERG\s+EBOOK' ,  str(line) , re.IGNORECASE ):
                read_mode = 1
            if re.search( r'\*{3,}\s+END\s+OF\s+TH(E|IS)\s+PROJECT\s+GUTENBERG\s+EBOOK' ,  str(line) , re.IGNORECASE ):
                read_mode = 0
        full_text = full_text.strip()
        if re.search( r'^Produced by' , full_text , re.IGNORECASE ):
            full_text = full_text[ full_text.index('\n') : len(full_text) ]

            
        out = open( title , 'w' , encoding = 'utf-8')
        out.write( full_text.strip() )
        out.close()

print('\nDone!')    

Downloading Emma ...
Downloading Sense and Sensibility ...
Downloading Pride and Prejudice ...

Done!


If the number of files that you want to download is so large that it is becomes inefficent to create a dictionary manually, you may also choose to make use of [a CSV file listing ALL the titles which can currently be found on Project Gutenberg](https://raw.githubusercontent.com/peterverhaar/introduction_to_dh/main/gutenberg_metadata.csv). At the time of writing, this CSV file describes about 60.000 texts. The code below creates a Python dictionary on the basis of this large CSV file. 

I the code below, this full file is filtered, however. As you can see in the code, the dictionary that is created includes files only when the `author` variable contains the term 'Dickens'. 

The large CSV file also includes the subject headings that have been assigned at Project Gutenberg. Using this *subject* variable, you can also choose to download, for instance, all the texts on Project Gutenberg in the Gothic genre.   

In [None]:
import pandas as pd
import re

gutenberg_files = dict()

github = 'https://raw.githubusercontent.com/peterverhaar/introduction_to_dh/main/'


md = pd.read_csv( github + 'gutenberg_metadata.csv')

for index,row in md.iterrows():
   
    if re.search( r'Dickens' , str( row['author'] ) , re.IGNORECASE ):
        gutenberg_files[row['title']] = row['url']
        
    ''' 
    if re.search( r'Gothic' , str( row['subject'] ) , re.IGNORECASE ):
        gutenberg_files[row['title']] = row['url']
    '''        
  
        
for text in gutenberg_files:
    print( text, gutenberg_files[ text ] )

If you want to download multiple texts from the [Text Creation Partnership](https://textcreationpartnership.org/), you can probably reuse large parts of the code that is given above. The project has similarly made available [a CSV file listing all of its texts](https://raw.githubusercontent.com/textcreationpartnership/Texts/master/TCP.csv). 


## Metadata

The techniques associated with Text and Data Mining can be used, among other purposes, to examine the basic syntactic and lexical properties of texts. You can collect data, for instance, about the average number of words per sentence, or about the total number of adjectives. Once you have data about such aspects, you can use such metrics to explore whether the clusters that can be created using formal similarities coincide, in some way or another, with other categorisations, such as those based on genre, historical period, text type or thematic concerns. 

To be able to explore such correlations, it is necessary, obviously, to have explicit data about the categories that you want to examine. Before you start analysing the texts in your project, it is useful to create a separate metadata file, in the CSV format, in which you capture all the categories that you want to study. 

The code that is given in the cell below can be used to create a basic template for the CSV metadata file. As you can see, it creates a file named "metadata.csv" on your computer, and it makes a header with two columns: 'title' and 'class'. 


Next, using the `listdir` method from the `os` module, the code lists all the files in your corpus (i.e. all the TXT files saved in your **Corpus** directory). The program adds these to the metadata file, as values of the 'title' column. Note that the code also removes the '.txt' extension, in an attempt to make the strings look like an actual title. 

In [None]:
import os
import re
dir = 'Corpus'

md = open( 'metadata.csv' , 'w' , encoding= 'utf-8' )

md.write( 'title,class\n')

for file in os.listdir(dir):
    if re.search( 'txt$' , file ):
        title = re.sub( r'[.]txt$' , '' , file )
        md.write( title + ',\n' )
        
md.close()


Importantly, when you only run the code above, the metadata CSV file will still be incomplete. You still need to add the appropriate values for the 'class' column. This is something which you will neeed to do manually, unless you can develop a method for extracting the data about your categories automatically. If you want to work with more than one categorical variable, you need to edit the header, and, obviously, you also need to supply values for this additional column. 

The values that you assign at this stage will be used in the other notebooks in this course.

## tdm module

Many of the analyses that are discussed in the notebooks in this section are based on core data processing or data cleaning operations, such as word tokenisation (i.e. the division of a full text into its individual words) or the calculation of word frequencies. As it is inconvenient and inefficient to repeat the full code of such methods each time they are needed, the code that you can use for these basic operations have been saved collectively in a module named '**tdm**'. Concretely, this module is simply a Python file containing all of these methods. You should make sure that this module is saved in the same folder as the notebooks in this course. If you downloaded the zipped repository from github, this module is probably in the right directory already. 

To learn more about the logic implemented in this module, you can evidently open the `tdm` module in a code editor and study the code it contains. If you feel the need to modify the code for some reason, you are free to do this on your own computer.

When you want to make use of these methods, you firstly need to import the `tdm` module, as follows: 

In [None]:
import tdm

Next, open the notebook named [Vocabulary.ipynb](Vocabulary.ipynb) to learn how you can systematically analyse the vocabulary of one of the texts in your corpus. 