# Coding actors

Actors are counted using NER. Two categories are then used: PERSON and ORGANIZATION.

The process has three steps:
1. Summarising the unique actors and listing these in a separate data file
2. Coding them (See separate notebook `Coding actors.ipynb`) 
3. Counting them automatically

#### Category selection
The choice is made to select only PERSON and ORGANIZATION for coding. The CoreNLP NER is good enough to separate these. If they are confused, they often seem to fit in each other's categories. The MISC category catches whatever is left that is obviously not a date, number, ordinal or location. If there are false negatives, they should be there. The category, however, is fairly large. Therefore, for initial coding the PERSON and ORGANIZATION tags should suffice.

#### Counting
Counting is rather tedious in this case, because names of actors might overlap and therefore lead to double counting. Consider the following few options. These strings overlap in multiple situation. Firstly, all overlap with 'Council', leading to double counting. Secondly, 'Council of Ministers' overlaps with 'Council of Ministers Committee of Permanent Representatives', leading to triple counting.

```
Council
Council Presidency
Council of 16
Council of Europe
Council of General Affairs
Council of Ministers
Council of Ministers Committee of Permanent Representatives
Council of the European Union
Council of the Union
```

This is overcome using `re.findall()`, which finds all non-overlapping matches. The regular expression is built up using or statements (e.g. `'Council Presidency|Council'`), where regex **takes only the first match** from the pattern if they overlap. Therefore, to find the unique matches, ignoring overlapping substrings, the different entities have to be sorted by length before bulding the pattern. To clarify, consider the following:

```Python
sentence = 'This sentence is about the Council Presidency'
print(re.findall('Council Presidency|Council',sentence))
# ['Council Presidency']
print(re.findall('Council|Council Presidency',sentence))
# ['Council']
```

The former is correct, hence the sorting. The pattern is precompiled in order to save computation time.

In [1]:
import pandas as pd
import json
from ner_methods import *
import spacy
import re
import codingtools

corenlp_ram = '2g'
settings_file = 'D:/thesis/settings.json'

In [9]:
#Preparation

#Read settings
settings = json.loads(open(settings_file).read())["settings"]

#Read data
df = pd.read_csv(settings['data_csv'])

In [10]:
#Unique named entities
uniqueEntities = set()

counter = 0
errorList = []

nlp = spacy.load('en_core_web_sm')
for text in df['TEXT']:
    try: #Sometimes texts give errors, e.g. when different alphabets are used. Therefore list those.
        uniqueEntities.update(getUniqueEntities(text,nlp))
    except:
        errorList.append(str(counter))
    counter += 1

print("Errors occurred on indices:",", ".join(errorList))

##Sort entities into categories
entity_map = {}

for (entity,entity_type) in uniqueEntities:
    if entity_type not in entity_map: #Set element to the correct data format to then add elements
        entity_map[entity_type] = set()
        
    entity_map[entity_type].add(entity)
    
print("Entities were found in the following categories:",", ".join(entity_map.keys()))

##Save output of uncoded actors to CSV, sorted alphabetically
with open(settings['person_csv'],'w+',encoding = 'utf-8') as f:
    for person in sorted(entity_map['PERSON']):
        f.write(person+'\n')

with open(settings['org_csv'],'w+',encoding = 'utf-8') as f:
    for org in sorted(entity_map['ORG']):
        f.write(org+'\n')

Errors occurred on indices: 
Entities were found in the following categories: DATE, CARDINAL, ORG, GPE, NORP, PERSON, LOC, EVENT, PERCENT, MONEY, WORK_OF_ART, FAC, LAW, PRODUCT, ORDINAL, TIME, QUANTITY, LANGUAGE


In [73]:
# Counting actors

# Assuming coding is done using the extension from codingtools, this initiates a copy of that instance of codingTool
with open(settings['codes'], 'r', encoding = 'utf-8') as f:
    coding = codingtools.from_json(f.read())
codes = coding.coded
categories = coding.categories

#As mentioned above, keys should be sorted by length for correct matching of overlapping cases
keys = sorted(list(codes.keys()), key = len, reverse = True)
keys_pattern = re.compile('|'.join(keys))

df['ACTORS'] = df['TEXT'].map(lambda x: set(re.findall(keys_pattern,x)))

for category in categories:
    df['count_'+category] = df['ACTORS'].map(lambda x: len([1 for element in x if (codes[element] == category)]))

In [72]:
#Output to csv
df.to_csv(settings['data_csv'])

In [14]:
entity_map['ORG']

{' Brussels: Council of the European Union',
 ' The European Commission',
 '"Partnership for Europe',
 '0 Department for Exiting the European Union Policy on Businesses',
 '1 Dyson Department',
 '1 Foods Department for Exiting the European Union Policy on Foods',
 '2B EN Council',
 '3 Krall Department for Exiting the European Union',
 '4 Department for Exiting the European Union',
 '4 Department for Exiting the European Union Policy on Businesses',
 '6-month EU Council',
 '6/15 PV\\1138471EN.docx EN Results',
 '7 Group Department for Exiting the European Union Policy on Life Sciences',
 '8 Creative Industries Department for Exiting Federation the European Union Policy on Creative',
 '8 Group Department',
 'AB Discussion',
 'AB Foods Discussion on Department for Exiting the European Union Policy on Foods',
 'ACER Administrative Board',
 'ACP',
 'ADD 1 Compilation of Comments of the Draft Conduct of the Call for Contribution',
 'ADD 1 Draft Framework',
 'ADD 1 UD',
 'ADDED',
 'AFCO',
 'A

In [19]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(df['TEXT'][2])

In [10]:
type(nlp) == spacy.lang.en.English

True