# Named Entity Recognition

### Spacy
Spacy has been used during the UROP for NER. It will be further tested now to check if it is a robust solution. 

In [1]:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()

A big part of the UROP involved analysing the data/Steele_Dossier.txt. It has some 17 dossiers from an FBI agent toward secret joint efforts by Trump-Russia in the most recent US presidential elections. 

In [2]:
# load Steele Dossier text
f = open('/Users/anishkrishnavallapuram/Desktop/FYP-19-20/data/Steele_dossier.txt', 'r')
dossier = f.read()
f.close()

In [3]:
# sample report
# all reports are separated in this manner in the .txt
texts = dossier.split('-----------------------------------------------------------------------------------')
print(texts[0])

COMPANY INTELLIGENCE REPORT 2016/080 
[ 6.20.2016 ]


US PRESIDENTIAL ELECTION: REPUBLICAN CANDIDATE DONALD TRUMP'S ACTIVITIES IN RUSSIA AND COMPROMISING RELATIONSHIP WITH THE KREMLIN 


Summary 

- Russian regime has been cultivating, supporting and assisting TRUMP for at least 5 years. Aim, endorsed by PUTIN, has been to encourage splits and divisions in western alliance 

- So far TRUMP has declined various sweetener real estate business deals offered him in Russia in order to further the Kremlin's cultivation of him. However he and his inner circle have accepted a regular flow of intelligence from the Kremlin, including on his Democratic and other political rivals Former top Russian intelligence officer claims FSB has compromised TRUMP through his activities in Moscow sufficiently to be able to blackmail him. According to several knowledgeable sources, his conduct in Moscow has included perverted sexual acts which have been arranged/monitored by the FSB 

- A dossier of compromisin

### Sample Spacy NER tagging

In [4]:
doc = nlp(texts[0]) # performs the NER tagging
displacy.render(nlp(texts[0]), jupyter=True, style='ent') # displacy is a visualisation tool from spacy

### Shortfalls

Well, there are still a few problems, `Aim` is tagged as a `PERSON` in the first point of `Summary` section. 
The tagger also does not recognise the entire noun phrase of an entity `Russian intelligence officer` is only considered as `Russian`.

## Robust Risk Minimisation

This is a simple linear classification model based on this paper https://www.aclweb.org/anthology/W03-0434. It highly emphasises on simple yet extensive feature engineering and obtains surprisingly high precision, recall scores.

In [5]:
# conll-2003 training dataset
f = open('/Users/anishkrishnavallapuram/Desktop/FYP-19-20/data/conll-2003/eng.train.txt')
train = f.read()
f.close()

In [6]:
# sentences
sents = []
for i in train.split('\n\n'):
    tokens = i.split('\n')
    sents.append(' '.join([t.split(' ')[0] for t in tokens]))
sents

['-DOCSTART-',
 'EU rejects German call to boycott British lamb .',
 'Peter Blackburn',
 'BRUSSELS 1996-08-22',
 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .',
 "Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .",
 '" We do n\'t support any such recommendation because we do n\'t see any grounds for it , " the Commission \'s chief spokesman Nikolaus van der Pas told a news briefing .',
 'He said further scientific study was required and if it was found that action was needed it should be taken by the European Union .',
 'He said a proposal last month by EU Farm Commissioner Franz Fischler to ban sheep brains , spleens and spinal cords from the human and animal food chains was a highly

In [7]:
# Developing all 9 out of 10 features suggested in the paper
import pandas as pd

In [8]:
displacy.render(nlp(sents[4]), jupyter=True, style='ent')

In [9]:
# Using conll-2003 by -DOCSTART-
docs = '\n'.join(sents).split('-DOCSTART-')[1:]
displacy.render(nlp(docs[2]), jupyter=True, style='ent')

In [10]:
print(sents[1])

EU rejects German call to boycott British lamb .


In [11]:
len(docs)

946

#### Preprocessing conll-2003
1. Tokens themselves, in a window of ±2.
2. C The previous two predicted tags, and
3. the conjunction of the previous tag and the current token.
4. Initial capitalization of tokens in a window of ±2.
5. More elaborated word type information: initial capitalization, all capitalization, all digitals, or digitals containing punctuations.
6. Token prefix (length three and four), and token suffix (length from one to four).
7. POS tagged information provided in shared the task.
8. chunking information provided in the shared task: we use a bag-of-word representation of the chunk at the current token.

In [20]:
data = pd.DataFrame(columns=['token', 'window', 'prevTags', 'initCap', 'wordType', 'prefix3', 'prefix4', 'suffix1', 'suffix2', 'suffix3', 'suffix4', 'posTag', 'chunkInfo'])