# Corpora

This notebook serves to demonstrate corpus loading and preprocessing without the aid of NLTK's corpus reader. We will use [ECHOE](https://github.com/ECHOEProject/echoe) as an example, and for the purposes of this demo we will process its distributed plaintext corpus rather than work directly from XML.

In [1]:
import os,glob
# The following import requires you to install the `gitpython` package,
# *and* needs you to have Git installed and configured.
# Alternatively, download it into `echoe/` manually,
# comment out the following line, and skip running the first cell of code below:
from git import Repo

First we'll determine whether we have cloned the ECHOE repository before (in which case we only want to update it to the latest state) or not (in which case we'll want to clone it from remote).

Don't worry about this part unless you're already curious how to wield Git within Python.

In [2]:
remote = 'https://github.com/ECHOEProject/echoe.git'
local = 'echoe'
# Only clone if the target folder doesn't already exist:
if not(os.path.exists(local)):
    repo = Repo.clone_from(remote, local)
# Else, just update the working copy from remote:
else:
    repo = Repo(local)
    assert isinstance(repo, Repo)
    repo.remotes.origin.pull()
assert not repo.bare


ECHOE line divisions are meaningful (though not entirely objective): each line contains either a manuscript rubric or a syntactical unit (roughly, a sentence or clause), or it is empty to separate between two "chunks" of text or between a rubric and the text body. Depending on your needs, it may make sense to use the `.readlines()` object function to read in lines separately. But for today, let's say our needs are the following:

- Each ECHOE document (i.e. homily or saint's life) should be its own data container;
- Each data container should be a single list of tokens;
- We should still be able to identify which container is which ECHOE document, so the overarching data container should be a dictionary with keys like `049B.41`;
- We want to get rid of the ECHOE references found at the start of each line of body text (e.g. "56.33.11:");
- We want a normalized, case folded text, but ECHOE's plaintext corpus as distributed is already in the desired state.

In this case it still makes sense to use `.readlines()`, just so we can filter out the segment references. So let's get started:

In [3]:
echoe = dict()
# Compile a list of files to be read in,
# and run the rest of the cell on a per-file basis:
for file in sorted(glob.glob('echoe/plaintext/*.txt')):
    # Keep the base file name as an identifier:
    identifier = os.path.basename(file).replace('.txt', '')
    # Read in the document one line at a time:
    segments = open(file).readlines()
    # Initialize an empty list of tokens:
    tokens = []
    # Run the next few lines on a per-line basis:
    for segment in segments:
        # If the line begins with an identifier, discard the identifier:
        if ': ' in segment:
            segment = segment.split(': ')[1]
        # Tokenize the rest of the line and add the output to the list of tokens:
        tokens.extend(segment.rstrip(' \n').split())
    # Finally, store the list of tokens in the dictionary, using the identifier as a key:
    echoe[identifier] = tokens

Having read in our tokens, we can now access them as follows (limiting to 20 for convenience):

In [4]:
echoe['049B.11'][:20]

['to',
 'eallum',
 'folce',
 'leofan',
 'men',
 'ælcne',
 'þara',
 'ic',
 'bidde',
 'þe',
 'godes',
 'ege',
 'hæbbe',
 'þæt',
 'he',
 'understande',
 'his',
 'agene',
 'þearfe',
 'gelæste']

Clearly, ECHOE's distributed plaintext corpus is already fully normalized and case folded. When working with a corpus that is not, you can define a normalization function. The most efficient point to do this is immediately after reading the file or its lines (i.e. after `.readlines()` above), because then you have the smallest number of strings to process. But now that we're here, we can normalize on a per-token basis:

In [5]:
# Create a dict with characters we want replaced:
substitutions = {
    'ę': 'æ',
    'ƿ': 'w',
    'ẏ': 'y',
    'v': 'u',
    'j': 'i'
}

# Write a function carrying out the desired operations:
def normalize(token):
    # Lowercase:
    token = token.lower()
    for k,v in substitutions.items():
        # Carry out replacements:
        token = token.lower().replace(k, v)
    return token
        
# Let's use list comprehension to create a new list for each document:
echoe_normalized = dict()
for k,v in echoe.items():
    new_doc = [normalize(token) for token in v]
    echoe_normalized[k] = new_doc

In [6]:
echoe_normalized['049B.11'][:20]



['to',
 'eallum',
 'folce',
 'leofan',
 'men',
 'ælcne',
 'þara',
 'ic',
 'bidde',
 'þe',
 'godes',
 'ege',
 'hæbbe',
 'þæt',
 'he',
 'understande',
 'his',
 'agene',
 'þearfe',
 'gelæste']

In our case it looks the same. If you want to test that it works, replace `.lower()` above with `.upper()`, or substitute out common vowels in the substitutions dictionary.

Should we now want to run any routines on the corpus as a whole, without distinguishing between the documents, we can collect all our tokens into a single list:

In [7]:
all_tokens = []
for doc in echoe.keys():
    all_tokens.extend(echoe[doc])

print(f"ECHOE contains {len(all_tokens)} total tokens.")

ECHOE contains 546959 total tokens.


But some of the most interesting approaches now available to us involve the comparison of documents: for length, lexical diversity, etc. To this end, you may want to read up on [Pandas](https://realpython.com/pandas-dataframe/) first, to learn about the functional equivalent of spreadsheets, and [Matplotlib](https://realpython.com/python-matplotlib-guide/) next, to learn how to visualize the patterns you find.