# Papal text analysis
A scratchbook on the work done with the text scraped from the vatical web site.

## Named entity recognition (NER)
The Stanford NER tagger has been used to tag the text, with the 3-class model english.all.3class.distsim.crf.ser, which recognizes LOCATION, PERSON and ORGANIZATION. This model has been trained on the CoNLL 2003, MUC 6 and MUC 7 data sets (and some additional data as stated in [https://nlp.stanford.edu/software/CRF-NER.shtml].

Stanford NER tagger citation/reference:
```
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf
```
The tagging of the Vatican text is done with a short Python script that uses the `nltk.tag.StanfordNERTagger` wrapper around the  Stanford tagger (which is written in Java).  The script processes each file in these steps:

* Read the text and split it into a sequence of words
* Ask the NER tagger to tag the sequence of words
* Extract the entities tagged as LOCATION, PERSON or ORGANIZATION. Merge consequtive entities of the same class into a *phrase*.
* count number of occurrences of each word (or phrase)
* write result to file, one line for each combination word and class, and with word count together with fields specifying the pope, type of document, year and name of dokument.

### How to train a new model

The used model, english.all.3class.distsim.crf.ser, has some shortcomings though, when it comes to recognizing religious entities. To improve the result you can either train a new model or assign classes to the unrecognized by hand.

Instructions on how to train a new model can be found at [http://nlp.stanford.edu/software/crf-faq.shtml#b]. There are some constraints, though, the important being that it's not possible to extend an existing model. The data used to create Stanford's official models is also not available. A new model must hence be created from scratch, but some of the tagged datasets found at [http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html] can be used as a starting point. 

1. Select some existing tagged dataset (if exists)
2. Find resources that can enable automatic class assignments (gazetteers for names, locations and organizations)
3. Create additional tagged datasets that includes the additional proper nouns you wish to train
    1. Select and create a big enough corpus that include targeted proper nouns (assuming named entity recognition)
    2. Split the corpus into smaller documents
    3. Tokenize the corpus using the PTBTokenizer:
> java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer corpus.txt > corpus.tok
    4. Annotate the corpus using gazetteers automatically as well as manually.

The tagged datasets must be transformed into to a tab-separated file with the format that Stanford NER accepts. As stated in the docs, the parser is not very forgiving. The word and class must be seperated by a single tab, and there cannot be any extra spaces etc.

4. Create a *properties file* that specifies the options to be used in the training. (The properties can also be entered as command line options.) a sample properties file file can be found at [https://nlp.stanford.edu/software/crf-faq.shtml#b]:

> ```
> trainFile = your-trainded-corpus-file.tsv
> serializeTo = your-ner-model.ser.gz
> map = word=0,answer=1
> maxLeft=1
> useClassFeature=true
> useWord=true
> useNGrams=true
> noMidNGrams=true
> maxNGramLeng=6
> usePrev=true
> useNext=true
> useDisjunctive=true
> useSequences=true
> usePrevSequences=true
> useTypeSeqs=true
> useTypeSeqs2=true
> useTypeySequences=true
> wordShape=chris2useLC
> ```

5. Next step is to actually create the new model:

> java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop corpus.prop

## Compute co-occurrence of NER entities

This section computes word-word co-occurrences for the Vatican text material. Two words are co-occurring if they exist in the same document. The co-occurence are computed as the product of the number of times the two words occur in the same document.
### Import the Excel file
Load the he NER data computed by `NER_runner.py` into a Panda DataFrame. Note that `pandas` (and dependencies) and the `xldr` packaged must be installed for this to work.

In [1]:
import pandas as pd

df = pd.read_excel('../data/NER-john-paul-ii/john-paul-ii.xlsx', 'Sheet1')

List some columns for the first few records from the loaded data. Note that count is the number of occurrences for the entity in the document (i.e. a word-document co-occurrence).

In [2]:
df

Unnamed: 0,Document,Year,Genre,Pope,Entity,Classifier,Count
0,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Church,ORGANIZATION,2
1,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Mother of God,PERSON,1
2,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Elizabeth,PERSON,1
3,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Dear Brothers,ORGANIZATION,1
4,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Jesus Christ,PERSON,1
5,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Christ,PERSON,1
6,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Vatican Basilica,LOCATION,1
7,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Pope,PERSON,1
8,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Jesus,PERSON,1
9,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,1978,angelus,john-paul-ii,Virgin Mary,PERSON,1


In [100]:
df[['Document', 'Entity', 'Classifier', 'Count']].head()

Unnamed: 0,Document,Entity,Classifier,Count
0,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,Church,ORGANIZATION,2
1,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,Mother of God,PERSON,1
2,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,Elizabeth,PERSON,1
3,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,Dear Brothers,ORGANIZATION,1
4,john-paul-ii_en_angelus_1978_hf-jp-ii-ang-1978...,Jesus Christ,PERSON,1


What we want is to compute *word-word* co-occurrence matrix within the same document. To do this we first join all the records from the same document with eachother. This will produce a cartesian product of all possible entity-pairs in the same document. The result grouped by document and entity pairs (n.b. this step is actually not necessary since the data is already grouped by document and entity).

In [3]:
df2 = (pd.merge(df, df, on='Document')).query('Entity_x < Entity_y')\
        .groupby(['Document', 'Entity_x', 'Classifier_x', 'Entity_y', 'Classifier_y', 'Year_x', 'Genre_x' ])\
            [['Count_x', 'Count_y']].sum()

Let's compute the co-occurrence value as the product of the number of times each entity occurs in the document. Add the value as a new column in the dataframe. (The word counts are removed to prevent them to be written to file.)

In [4]:
df2['Count'] = df2['Count_x'] * df2['Count_y']
df2.pop('Count_x')
df2.pop('Count_y')

df2.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Count
Document,Entity_x,Classifier_x,Entity_y,Classifier_y,Year_x,Genre_x,Unnamed: 7_level_1
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,Church,ORGANIZATION,1978,angelus,2
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,Dear Brothers,ORGANIZATION,1978,angelus,1
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,Elizabeth,PERSON,1978,angelus,1
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,Jesus,PERSON,1978,angelus,1
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,Jesus Christ,PERSON,1978,angelus,1
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,John Paul,PERSON,1978,angelus,1
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,John XXIII,PERSON,1978,angelus,2
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,Mother of God,PERSON,1978,angelus,1
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,Pope,PERSON,1978,angelus,1
john-paul-ii_en_angelus_1978_hf-jp-ii-ang-19781029,Christ,PERSON,Vatican Basilica,LOCATION,1978,angelus,1


Write the result to a text file:

In [5]:
df2.to_csv('../data/NER-john-paul-ii/john-paul-ii-co-occurrence.csv',  sep=';')