# Coding actors

Actors are counted using NER. Two categories are then used: PERSON and ORGANIZATION.

The process has three steps:
1. Summarising the unique actors and listing these in a separate data file
2. Coding them (See separate notebook `Coding actors.ipynb`) 
3. Counting them automatically

#### Category selection
The choice is made to select only PERSON and ORGANIZATION for coding. The CoreNLP NER is good enough to separate these. If they are confused, they often seem to fit in each other's categories. The MISC category catches whatever is left that is obviously not a date, number, ordinal or location. If there are false negatives, they should be there. The category, however, is fairly large. Therefore, for initial coding the PERSON and ORGANIZATION tags should suffice.

#### Counting
Counting is rather tedious in this case, because names of actors might overlap and therefore lead to double counting. Consider the following few options. These strings overlap in multiple situation. Firstly, all overlap with 'Council', leading to double counting. Secondly, 'Council of Ministers' overlaps with 'Council of Ministers Committee of Permanent Representatives', leading to triple counting.

```
Council
Council Presidency
Council of 16
Council of Europe
Council of General Affairs
Council of Ministers
Council of Ministers Committee of Permanent Representatives
Council of the European Union
Council of the Union
```

This is overcome using `re.findall()`, which finds all non-overlapping matches. The regular expression is built up using or statements (e.g. `'Council Presidency|Council'`), where regex **takes only the first match** from the pattern if they overlap. Therefore, to find the unique matches, ignoring overlapping substrings, the different entities have to be sorted by length before bulding the pattern. To clarify, consider the following:

```Python
sentence = 'This sentence is about the Council Presidency'
print(re.findall('Council Presidency|Council',sentence))
# ['Council Presidency']
print(re.findall('Council|Council Presidency',sentence))
# ['Council']
```

The former is correct, hence the sorting. The pattern is precompiled in order to save computation time.

In [6]:
import pandas as pd
import json
from ner_methods import *
import spacy
import re
import codingtools

spacy_model = 'nl_core_news_sm'
settings_file = 'D:/thesis/settings - nl.json'

In [3]:
#Preparation

#Read settings
settings = json.loads(open(settings_file).read())["settings"]

#Read data
df = pd.read_csv(settings['data_csv'])

In [7]:
#Unique named entities
uniqueEntities = set()

counter = 0
errorList = []

nlp = spacy.load(spacy_model)
for text in df['TEXT']:
    try: #Sometimes texts give errors, e.g. when different alphabets are used. Therefore list those.
        uniqueEntities.update(getUniqueEntities(text,nlp))
    except:
        errorList.append(str(counter))
    counter += 1

print("Errors occurred on indices:",", ".join(errorList))

##Sort entities into categories
entity_map = {}

for (entity,entity_type) in uniqueEntities:
    if entity_type not in entity_map: #Set element to the correct data format to then add elements
        entity_map[entity_type] = set()
        
    entity_map[entity_type].add(entity)
    
print("Entities were found in the following categories:",", ".join(entity_map.keys()))

##Save output of uncoded actors to CSV, sorted alphabetically
with open(settings['person_csv'],'w+',encoding = 'utf-8') as f:
    for person in sorted(entity_map['PER']):
        f.write(person+'\n')

with open(settings['org_csv'],'w+',encoding = 'utf-8') as f:
    for org in sorted(entity_map['ORG']):
        f.write(org+'\n')

Errors occurred on indices: 
Entities were found in the following categories: MISC, ORG, LOC, PER


KeyError: 'PERSON'

In [4]:
# Counting actors

# Assuming coding is done using the extension from codingtools, this initiates a copy of that instance of codingTool
with open(settings['codes'], 'r', encoding = 'utf-8') as f:
    coding = codingtools.from_json(f.read())
codes = coding.coded
categories = coding.categories

#As mentioned above, keys should be sorted by length for correct matching of overlapping cases
keys = sorted(list(codes.keys()), key = len, reverse = True)
keys_pattern = re.compile('|'.join(keys))

df['ACTORS'] = df['TEXT'].map(lambda x: set(re.findall(keys_pattern,x)))

for category in categories:
    df['count_'+category] = df['ACTORS'].map(lambda x: len([1 for element in x if (codes[element] == category)]))

FileNotFoundError: [Errno 2] No such file or directory: 'D:/thesis/nl/actors/coded.json'

In [72]:
#Output to csv
df.to_csv(settings['data_csv'])

In [12]:
len(entity_map['ORG'])

1161

In [13]:
doc = nlp(df['TEXT'][2])

In [43]:
import numpy as np
df[[isinstance(x,float) for x in df['MEDIUM']]]

Unnamed: 0.1,Unnamed: 0,BYLINE,DATE,DATELINE,HEADLINE,HIGHLIGHT,LANGUAGE,LENGTH,LOAD-DATE,MEDIUM,SECTION,TEXT,MONTH,YEAR,DAY,DATE_dt
0,0,BERT LANTING,28 december 2017,,Europa is terug,,,560,27 December 2017,,Opinie en Debat; Blz. 21,Commentaar Komende thema's: De geschiedenis ne...,12,2017,28,2017-12-28
1,1,,28 december 2017,,rechtsstaat Europese Unie dient pal te staan ...,,,569,"December 27, 2017",,Opinie; Blz. 18,Commentaar Het oog op de toekomst van de Europ...,12,2017,28,2017-12-28
2,2,,28 december 2017,,rechtsstaat Europese Unie dient pal te staan ...,,,569,"December 28, 2017",,Opinie; Blz. 18,Commentaar Het oog op de toekomst van de Europ...,12,2017,28,2017-12-28
3,3,Ruurd Ubels nd.nl/economie beeld anp / Remko d...,14 november 2017,,'Ik ken geen Brits bedrijf dat uit Europese Un...,,,530,13 November 2017,,"Blz. 2,3",HIGHLIGHT: Cees Oudshoorn heeft Theresa May ve...,11,2017,14,2017-11-14
4,4,Van een onzer verslaggevers,17 oktober 2017,,Scheurende tuinmuur blijkt niet onderheid,,,968,14 November 2017,,--Maak een keuze--,HIGHLIGHT: Funderingspalen die wel op de teken...,10,2017,17,2017-10-17
5,5,Van een onzer verslaggevers,17 oktober 2017,,Scheurende tuinmuur blijkt niet onderheid,,,968,14 November 2017,,--Maak een keuze--,HIGHLIGHT: Funderingspalen die wel op de teken...,10,2017,17,2017-10-17
6,6,Gabor Landman · directeur van de Stichting Eur...,5 oktober 2017,,Oekraïne heeft niks met Europese waarden,,,-1,4 October 2017,,,"SECTION: Blz. 12,13 LENGTH: 399 woorden HIGHLI...",10,2017,5,2017-10-05
7,7,,15 september 2017,,Nuttige voorzet voor noodzakelijk debat,,,372,"September 14, 2017",,IN HET NIEUWS; Blz. 2,Het is tijd de Europese samenwerking te verste...,9,2017,15,2017-09-15
8,8,,16 augustus 2017,,Kort,,,297,15 August 2017,,Economie; Blz. 16,Kort luchtvaart brexit websupermarkt Air Berli...,8,2017,16,2017-08-16
9,9,,16 augustus 2017,,Kort,,,297,15 August 2017,,Economie; Blz. 16,Kort luchtvaart brexit websupermarkt Air Berli...,8,2017,16,2017-08-16


TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''