# Example of PDF Processing 
# Convert information in CAMEO PDF to ontology classes

## Ethnicities

This notebook uses xpdf's pdftotext tool to create a machine-readable version of the list of ethnicities in the document, CAMEO.Manual.1.1b3.pdf (retrieved from https://www.gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf). The list is found in Table 5.1, starting on page 121, of the document. The command to extract the table is:
```
pdftotext -f 121 -l 137 -table CAMEO.Manual.1.1b3.pdf
```

pdftotext can be downloaded from https://www.xpdfreader.com/download.html.

The pdftotext output is as shown below. When processing it (to convert the information to RDF), we must remove/account for the blank lines, page numbers, 'continued ...' text, table headers, etc.

```
CHAPTER 5.     CAMEO ETHNIC CODING SCHEME                                        114

                      Table 5.1:     CAMEO Ethnic Group Codes

Ethnic Group Name                    Code  Selected Countries

Abkhaz (Abkhazians)                  abk   GEO, DEU, RUS, SYR, TUR, UKR

Aboriginal-Australians (Aborigines)  abr   AUS

Acehnese (Achinese)                  ace   IDN, MYS

. . .

Basque                             baq   ARG, CHL, CRI, CUB, BOL, BRA, ESP,

                                         FRA, MEX, URY, USA, VEN

                                                        continued on next page
<NP>CHAPTER 5.    CAMEO ETHNIC   CODING  SCHEME                                    116

Ethnic Group Name                    Code  Selected Countries

Baster                               bst   NAM

Batak                                btk   IDN


. . .
```

Note that the entries for Yao ('Yao (Africa)' and 'Yao (Asia) (Dao)') were modified to be 'African Yao (Yao)' and 'Asian Yao (Dao)' to account for the second parentheses. There was only one occurrence of that problematic pattern, so it was hand-corrected. In the final ttl, the label 'Yao' was added to the :AsianYao class.

In [1]:
import re

# Process CAMEO.ethnicity.txt 
# Since the file is small, it is just loaded directly into memory instead of being processed one line at a time
with open('CAMEO.ethnicity.txt', 'r') as cam_file:
    contents = cam_file.read().split('\n')

with open('ethnicity.ttl', 'w') as ethnic_out:
    # Write prefixes
    ethnic_out.write('@prefix : <urn:ontoinsights:ontology:dna:> .\n')
    ethnic_out.write('@prefix dna: <urn:ontoinsights:ontology:dna:> .\n')
    ethnic_out.write('@prefix owl: <http://www.w3.org/2002/07/owl#> .\n')
    ethnic_out.write('@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n\n')
    for line in contents:
        # Ignore blank line
        if len(line) == 0:
            continue
        # Ignore lines with page numbers, "continued on ..." text, lines that are continuations of country codes, and table headers
        if 'continued on ' in line or 'Ethnic Group ' in line or re.match(r'\s', line) or re.search(r'\d', line):
            continue
        
        # Check if there is more than one name (if there is an alternate name in parentheses)
        names = list()
        if '(' in line:
            name_name_code = line.split('(')
            names.append(name_name_code[0].strip())
            name_code = name_name_code[1].split(')')
            names.append(name_code[0])
            code = name_code[1].split()[0]
        else:
            # Account for a space in the ethnic name (e.g., 'East Indian')
            # Processing assumes that names are capitalized but codes are not
            name_code = line.split()
            name = ''
            for item in name_code:
                if item[0].isupper():
                    name += f' {item}'
                else:
                    code = item
                    # Found code, so we are done
                    break;
                    
            names.append(name.strip())
        if len(names) == 1:
            ethnic_out.write(f':CameoEthnicity_{code} a owl:Class;\n  rdfs:subClassOf :Ethnicity ;\n  rdfs:label "{names[0]}" .\n\n')
        else:
            ethnic_out.write(f':CameoEthnicity_{code} a owl:Class;\n  rdfs:subClassOf :Ethnicity ;\n  rdfs:label "{names[0]}", "{names[1]}" .\n\n')


## Religions

Still working with the CAMEO.Manual.1.1b3.pdf, the 'Religions' table is now processed. The command to extract the table is:
```
pdftotext -f 161 -l 174 -table CAMEO.Manual.1.1b3.pdf
```

The pdftotext output is as shown below. Similar to the above, we must remove/account for the blank lines, page numbers, 'Continued ...' text, table headers, etc.

```
                                      Table    8.1:    Directory     of  all  Religious  Codes  (v.1.0)

Heirarchical    Code  Religion and Comments

REL                   Unspecified Religious

ATH                   Agnostic/Atheist

ATH010                Freethought

BAH                   Bahai Faith              inc. all non-schismatic Bahai

...
```

Note that cleanup of the text file was done as part of pre-processing to remove double quotation marks, to change 'a.k.a.' to commas, to make indentation consistent, to remove names (with numeric codes) that have no distinguishing details (such as 'REL - Unspecified Religious' and 'JEW001 - (any) ecumenical Jewish organization'), and to make the use of parentheses, slashes, 'and'/'or' and similar conventions consistent.

In [2]:
import re

# Process CAMEO.religion.txt
# Since the file is small, it is just loaded directly into memory instead of being processed one line at a time
with open('CAMEO.religion.txt', 'r') as cam_file:
    contents = cam_file.read().split('\n')

with open('religion.ttl', 'w') as rel_out:
    # Write prefixes
    rel_out.write('@prefix : <urn:ontoinsights:ontology:dna:> .\n')
    rel_out.write('@prefix dna: <urn:ontoinsights:ontology:dna:> .\n')
    rel_out.write('@prefix owl: <http://www.w3.org/2002/07/owl#> .\n')
    rel_out.write('@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n\n')
    
    last_codes = ['', '', '', '']    # Last 'superclass' code for 0-2 indents
    # Note that there is an implicit indent due to a 6-character code with numbers - So, there are 2 indent levels, but 3 code levels
    for line in contents:
        # Ignore blank line
        if len(line) == 0:
            continue
        # Ignore lines with certain text (from header, footer or table header) or only spaces and numbers (a page number)
        if 'Table ' in line or 'CHAPTER 8' in line or 'Continued ' in line or 'Religion and Comments' in line or re.match("^[0-9 ]+$", line):
            continue
            
        # Get the CAMEO code, labels, ... for the religion 
        code = line.strip().split()[0]
        number_spaces = len(line) - len(line.lstrip(' '))
        first_digit = re.search(r'\d', line)
        # Get names - which start after the code
        name_string = line.split(code, 1)[1].strip()
        # Split names by commas
        orig_names = name_string.split(',')
        names = list()
        # Process all the names, taking out additional spaces and capitalizing
        for orig in orig_names:
            names.append(' '.join(orig.split()).title())
        
        # Get details of subclassing
        if not number_spaces:
            # No spaces, so superclass is the first 3 or 6 letters
            if not first_digit:
                last_codes[0] = code
                if len(code) == 3:
                    last_codes[0] = code[:3]
                    rel_out.write(f':CameoReligion_{code} a owl:Class ;\n  rdfs:subClassOf :ReligiousActivities ;\n')
                else:
                    last_codes[0] = code[:6]
                    last_codes[1] = code
                    rel_out.write(f':CameoReligion_{code} a owl:Class ;\n  rdfs:subClassOf :CameoReligion_{code[:3]} ;\n')
            else:
                if first_digit.start() == 3:
                    last_codes[0] = code[:3]
                    last_codes[1] = code
                    rel_out.write(f':CameoReligion_{code} a owl:Class ;\n  rdfs:subClassOf :CameoReligion_{code[:3]} ;\n')
                else:
                    last_codes[0] = code[:6]
                    last_codes[1] = code
                    rel_out.write(f':CameoReligion_{code} a owl:Class ;\n  rdfs:subClassOf :CameoReligion_{code[:6]} ;\n')
        else:
            # Indented - Either 1-3 levels (Don't need to save the code of the 3rd level, since hierarchy does not go deeper)
            if number_spaces <= 5:   # First level
                last_codes[2] = code
                rel_out.write(f':CameoReligion_{code} a owl:Class ;\n  rdfs:subClassOf :CameoReligion_{last_codes[1]} ;\n')
            elif number_spaces <= 10:  # Second level
                last_codes[3] = code
                rel_out.write(f':CameoReligion_{code} a owl:Class ;\n  rdfs:subClassOf :CameoReligion_{last_codes[2]} ;\n')
            else:    # Third level
                rel_out.write(f':CameoReligion_{code} a owl:Class ;\n  rdfs:subClassOf :CameoReligion_{last_codes[3]} ;\n')
        # Write out names
        label_string = '", "'.join(names)
        rel_out.write(f'  rdfs:label "{label_string}" .\n\n')
