# Using Gazetteers to Extract Sets of Keywords from Free-Flowing Texts #

For [The Programming Historian](https://programminghistorian.org/en/lessons/extracting-keywords) by Adam Crymble. Modifed to Python 3 and Jupyter Notebooks for Historical Informatics at University of Southern Denmark.

## Lesson Goals ##

If you have a copy of a text in electronic format stored on your computer, it is relatively easy to keyword search for a single term. Often you can do this by using the built-in search features in your favourite text editor. However, scholars are increasingly needing to find instances of many terms within a text or texts. For example, a scholar may want to use a gazetteer to extract all mentions of English placenames within a collection of texts so that those places can later be plotted on a map. Alternatively, they may want to extract all male given names, all pronouns, stop words, or any other set of words. Using those same built-in search features to achieve this more complex goal is time consuming and clunky. This lesson will teach you how to use Python to extract a set of keywords very quickly and systematically from a set of texts.

## Content ##

The present tutorial will show users how to extract all mentions of English and Welsh county names from a series of 6,692 mini-biographies of individuals who began their studies at the University of Oxford during the reign of James I of England (1603-1625). These records were transcribed by British History Online, from the printed version of Alumni Oxonienses, 1500-1714. These biographies contain information about each graduate, which includes the date of their studies and the college(s) they attended. Often entries contain additional information when known, including date or birth and death, the name or occupation of their father, where they originated, and what they went on to do in later life. The biographies are a rich resource, providing reasonably comparable data about a large number of similar individuals (rich men who went to Oxford). The 6,692 entries have been pre-processed by the author and saved to a CSV file with one entry per row.

In this tutorial, the dataset involves geographical keywords. Once extracted, these placenames could be geo-referenced to their place on the globe and then mapped using digital mapping. This might make it possible to discern which colleges attracted students from what parts of the country, or to determine if these patterns changed over time. For a practical tutorial on taking this next step, see the lesson by Fred Gibbs mentioned at the end of this lesson. Readers may also be interested in georeferencing in QGIS 2.0, also available from the Programming Historian.

This approach is not limited to geographical keywords, however. It could also be used to extract given names, prepositions, food words, or any other set of terms defined by the user. This process could therefore be useful for someone seeking to isolate individual entries containing any of these keywords, or for someone looking to calculate the frequency of their keywords within a corpus of texts. This tutorial provides pathways into textual or geospatial analyses, rather than research answers in its own right.

## Data set: Mini-biographies from University of Oxford (1603-25) ##

We us the pandas library to inspect our data. Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:
   - Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
   - Ordered and unordered (not necessarily fixed-frequency) time series data.
   - Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
   - Any other form of observational / statistical data sets. The data actually need not be labeled at all to be  - placed into a pandas data structure

In [None]:
import os
import pandas as pd

#data_path = os.path.join("..","data","Alumni_Oxonienses_Jas1.csv")
data_path = os.path.join("..","data","Alumni_Oxonienses_Jas1_utf8.csv")

#data = pd.read_csv(data_path,encoding="latin-1")
data = pd.read_csv(data_path)

# column (variable) names
print(data.columns)

# first 10 rows
print(data.head(10))

# last 10 rows
print(data.tail(10))

# first entry for variable "Details"
print(data["Details"].iloc[0])

## Build your gazetteer ##

In order to extract the relevant place names, we first have to decide what they are. We need a list of places, often called a *gazetteer*. Many of the place names mentioned in the records are shortforms, such as ‘Wilts’ instead of ‘Wiltshire’, or ‘Salop’ instead of ‘Shropshire’. Getting all of these variations may be tricky. For a start, let’s build a basic gazetteer of English counties.

Create a text file called `gazetteer.txt` and using the entries listed on the Wikipedia page listed above, add each county to a new line on the text file. It should look something like this:

```
Bedfordshire
Berkshire
Buckinghamshire
Cambridgeshire
Cheshire
Cornwall
Cumberland
Derbyshire
Devon
Dorset
Durham
Essex
Gloucestershire
Hampshire
Herefordshire
Hertfordshire
Huntingdonshire
Kent
Lancashire
Leicestershire
Lincolnshire
Middlesex
Norfolk
Northamptonshire
Northumberland
Nottinghamshire
Oxfordshire
Rutland
Shropshire
Somerset
Staffordshire
Suffolk
Surrey
Sussex
Warwickshire
Westmorland
Wiltshire
Worcestershire
Yorkshire
```

## import and export `.txt` files ##

### create texts.txt file ###

Here we create a flat file from one column in the spreadsheet, but you could also have copy-pasted data from other sources

In [None]:
texts = data["Details"].tolist()

file_path = os.path.join("..","data","texts.txt")
with open(file_path, 'w') as f:
    for text in texts:
        f.write("{}\n".format(text))# \n for new line, i.e., one line pr. entry

### load `gazetteer.txt` ###

Extract keywords from `gazetteer.txt` and store in `list()` object

In [None]:
#Import the keywords
file_path = os.path.join("..","resources","gazetteer.txt")

with open(file_path,"r") as fname:
    keywords = fname.read().lower().split("\n")

print("The gazetteer contains {} locations:".format(len(keywords)))

print(keywords)

print("The gazetteer is stored in at {} object".format(type(keywords)))


### load `texts.txt` ###

In [None]:
#Import the texts you want to search
file_path = os.path.join("..","data","texts.txt")
with open(file_path,"r") as fname:
    texts = fname.read().lower().split("\n")

for text in texts[:10]:
    print(text)

## Preprocessing ##

When matching strings, you have to make sure the punctuation doesn't get in the way. Technically, `London.` is a different string than `London` or `;London` because of the added punctuation. These three strings which all mean the same thing to us as human readers will be viewed by the computer as distinct entities. To solve that problem, the easiest thing to do is just to remove all of the punctuation. You can do this with regular expressions, and Doug Knox and Laura Turner O’Hara have provided great introductions at Programming Historian for doing so.

### tokenize and remove unwanted punctuation ###
To keep things simple, this program will just replace the most common types of punctuation with nothing instead (effectively deleting punctuation).

In [None]:
for text in texts:
    #for each text:
    tokens = text.split()
    for token in tokens:
        #remove punctuation that will interfere with matching
        token = token.replace(",", "")
        token = token.replace(".", "")
        token = token.replace(";", "")        

## String matching ##

As the words from our text are already in a list called `tokens`, and all of our keywords are in a list called `keywords`, all we have to do now is check our texts for the keywords.

First, we need somewhere to store details of any matches we have. Immediately after the `for text in texts:` line, at one level of indentation, add the following two lines of code:

```
matches = 0
stored_matches = list()
```

Indentation is important in Python. The above two lines should be indented one tab deeper than the for loop above it. That means the code is to run every time the for loop runs - it is part of the loop.

The `stored_matches` variable is a blank list, where we can store our matching keywords. The `matches` variable is known as a `flag`, which we will use in the next step when we start printing the output.

To do the actual matching, add the following lines of code to the bottom of your program, again minding the indentation (2 levels from the left margin), making sure you save:

```
        #if a keyword match is found, store the result.
        if token in keywords:
            if token in stored_matches:
                continue
            else:
                stored_matches.append(token)
            matches += 1
    print(matches)
```

In [None]:
for text in texts:
    matches = 0
    stored_matches = list()
    #for each text:
    tokens = text.split()
    
    for token in tokens:
        #remove punctuation that will interfere with matching
        token = token.replace(",", "")
        token = token.replace(".", "")
        token = token.replace(";", "")
        
        if token in keywords:
            
            if token in stored_matches:
                continue
                
            else:
                stored_matches.append(token)
            
            matches += 1
            
    print(matches)

## Output results ##

If you have got to this stage, then your Python script is already finding the matching keywords from your gazetteer. All we need to do now is print them out to the command output pane in a format that’s easy to work with.

Add the following lines to your program, minding the indentation as always:

```
    #if there is a stored result, print it out
    if matches == 0:
        print("")
    else:
        match_string = ""
        for matches in stored_matches:
            match_string = match_string + matches + "\t"
        
        print(match_string)
```


In [None]:
for text in texts:
    matches = 0
    stored_matches = list()
    #for each text:
    tokens = text.split()
    
    for token in tokens:
        #remove punctuation that will interfere with matching
        token = token.replace(",", "")
        token = token.replace(".", "")
        token = token.replace(";", "")
        
        if token in keywords:
            
            if token in stored_matches:
                continue
                
            else:
                stored_matches.append(token)
            
            matches += 1
    if matches == 0:
        print("")
    else:
        match_string = ""
        for matches in stored_matches:
            match_string = match_string + matches + "\t"
            
        print(match_string)


## Output matches to file ##
Finalize script by appending each match to `.txt` file

```
    f = open('output.txt', 'a')
    f.write(matchString)
    f.close()
```

In [None]:
file_path = os.path.join("..","data","output.txt")

for text in texts:
    matches = 0
    stored_matches = list()
    #for each text:
    tokens = text.split()
    
    for token in tokens:
        #remove punctuation that will interfere with matching
        token = token.replace(",", "")
        token = token.replace(".", "")
        token = token.replace(";", "")
        
        if token in keywords:
            
            if token in stored_matches:
                continue
                
            else:
                stored_matches.append(token)
            
            matches += 1
    if matches == 0:
        print("")
    else:
        match_string = ""
        for matches in stored_matches:
            match_string = match_string + matches + "\t"
            
        print(match_string)
        
        with open(file_path,"a") as fname:
            fname.write(match_string)


Note the `a` instead of the `r` we used earlier. This 'appends' the text to the file called `output.txt`, which will be saved in your in the directory on the file path in `file_path`. You will have to take care, because running the program several times will continue to append all of the outputs to this file, creating a very long file. We can solve this by 'appending' the matches in `match_string` to a list object `output` and exporting the list to a `output.txt` file

In [None]:
file_path = os.path.join("..","data","output.txt")
output = list()

for text in texts:
    matches = 0
    stored_matches = list()
    #for each text:
    tokens = text.split()
    
    for token in tokens:
        #remove punctuation that will interfere with matching
        token = token.replace(",", "")
        token = token.replace(".", "")
        token = token.replace(";", "")
        
        if token in keywords:
            
            if token in stored_matches:
                continue
                
            else:
                stored_matches.append(token)
            
            matches += 1
    if matches == 0:
        continue
    else:
        match_string = ""
        for matches in stored_matches:
            match_string = match_string + matches
            
        output.append(match_string)

print(output)

with open(file_path, 'w') as fname:
    for match_string in output:
        fname.write("{}\n".format(match_string))