# Coding actors

Actors are counted using NER. Two categories are then used: PERSON and ORGANIZATION.

The process has three steps:
1. Summarising the unique actors and listing these in a separate data file
2. Coding them 
3. Counting them automatically

#### Category selection
The choice is made to select only PERSON and ORGANIZATION for coding. The CoreNLP NER is good enough to separate these. If they are confused, they often seem to fit in each other's categories. The MISC category catches whatever is left that is obviously not a date, number, ordinal or location. If there are false negatives, they should be there. The category, however, is fairly large. Therefore, for initial coding the PERSON and ORGANIZATION tags should suffice.

#### Counting
Counting is rather tedious in this case, because names of actors might overlap and therefore lead to double counting. Consider the following few options. These strings overlap in multiple situation. Firstly, all overlap with 'Council', leading to double counting. Secondly, 'Council of Ministers' overlaps with 'Council of Ministers Committee of Permanent Representatives', leading to triple counting.

```
Council
Council Presidency
Council of 16
Council of Europe
Council of General Affairs
Council of Ministers
Council of Ministers Committee of Permanent Representatives
Council of the European Union
Council of the Union
```

This is overcome using `re.findall()`, which finds all non-overlapping matches. The regular expression is built up using or statements (e.g. `'Council Presidency|Council'`), where regex **takes only the first match** from the pattern if they overlap. Therefore, to find the unique matches, ignoring overlapping substrings, the different entities have to be sorted by length before bulding the pattern. To clarify, consider the following:

```Python
sentence = 'This sentence is about the Council Presidency'
print(re.findall('Council Presidency|Council',sentence))
# ['Council Presidency']
print(re.findall('Council|Council Presidency',sentence))
# ['Council']
```

The former is correct, hence the sorting. The pattern is precompiled in order to save computation time.

In [1]:
import pandas as pd
import json
from stanfordcorenlp import StanfordCoreNLP
from ner_methods import *
import re
import codingtools

corenlp_ram = '2g'
settings_file = 'D:/thesis/settings.json'

In [2]:
#Preparation

#Read settings
settings = json.loads(open(settings_file).read())["settings"]

#Read data
df = pd.read_csv(settings['data_csv'])

In [3]:
#Unique named entities
uniqueEntities = set()

counter = 0
errorList = []

with StanfordCoreNLP(settings['corenlp_dir'], memory = corenlp_ram) as nlp:
    for text in df['TEXT']:
        try: #Sometimes texts give errors, e.g. when different alphabets are used. Therefore list those.
            uniqueEntities.update(getUniqueEntities(text,nlp))
        except:
            errorList.append(str(counter))
        counter += 1

print("Errors occurred on indices:",", ".join(errorList))

##Sort entities into categories
entity_map = {}

for (entity,entity_type) in uniqueEntities:
    if entity_type not in entity_map: #Set element to the correct data format to then add elements
        entity_map[entity_type] = set()
        
    entity_map[entity_type].add(entity)
    
print("Entities were found in the following categories:",", ".join(entity_map.keys()))

##Save output of uncoded actors to CSV, sorted alphabetically
with open(settings['person_csv'],'w+') as f:
    for person in sorted(entity_map['PERSON']):
        f.write(person+'\n')

with open(settings['org_csv'],'w+') as f:
    for org in sorted(entity_map['ORGANIZATION']):
        f.write(org+'\n')

Errors occurred on indices: 0, 31, 32, 44, 45, 87
Entities were found in the following categories: MISC, PERSON, PERCENT, DATE, ORGANIZATION, NUMBER, LOCATION, DURATION, TIME, ORDINAL, MONEY, SET


In [11]:
# Coding actors
import codingtools

categories = ['EU political actor','National political actor','Other nationality, political','Non-political']
to_code = entity_map['PERSON']

coding = codingtools.codingTool(to_code,categories)

In [14]:
coding.code()

A Jupyter Widget