# Named Entity Extraction

`nltk` is a language toolkit for python. Entity extraction relies on a dictionary to classify words. Here we download a library that will tag english words.

In [1]:
import nltk 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
import pandas as pd

[nltk_data] Error loading punkt: <urlopen error [Errno -2] Name or
[nltk_data]     service not known>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [Errno -2] Name or service not known>
[nltk_data] Error loading maxent_ne_chunker: <urlopen error [Errno -2]
[nltk_data]     Name or service not known>
[nltk_data] Error loading words: <urlopen error [Errno -2] Name or
[nltk_data]     service not known>


In [2]:
trolls = pd.read_csv("data/trolls.csv")
X, y = (trolls["Comment"], trolls["Insult"])

samples = X.values.tolist()

Below we have the main code. The only new bit is the tagging and combining of words.

The tagging process takes a word and converts it into a type. Types are denoted by a code.

Once the words are tagged, they are then grouped. Note that tags of the same type are not necessarily merged. It depends on the rules provided by the chunker (which are provided by another dictionary of settings).

In [3]:
def extract_NE(sample, debug=False):
    # Split sentences
    sentences = nltk.sent_tokenize(sample)
    # Split words
    tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
    # Tag words
    tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
    # Combine tags
    chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=False)
    
    if debug:
        print("input:\n", sample, "\n")
        print("sentences:\n", sentences, "\n")
        print("tokens:\n", tokenized_sentences, "\n")
        print("tagged:\n", tagged_sentences, "\n")
    
    def extract_entity_names(t):
        entity_names = []
        if hasattr(t, 'label') and t.label:
            if t.label() in ["NE", "ORGANIZATION", "PERSON", "LOCATION"]:
                entity_names.append(' '.join([child[0] for child in t]))
            else:
                for child in t:
                    entity_names.extend(extract_entity_names(child))
        return entity_names

    entity_names = []
    debug_chunks = []
    for tree in chunked_sentences:
        debug_chunks.append(tree)
        entity_names.extend(extract_entity_names(tree))

    if debug:
        print("chunked:\n", debug_chunks, "\n")
        print(set(entity_names))
    return set(entity_names)

In [4]:
extract_NE(samples[182], True)

input:
 "Quote from Teresa May "That is why we have been trying to deport him to Jordan, his home country." TRYING - Since when does a Sovereign have problems TRYING to deport diseased minded scum like him. Stuff the EU laws and if the UK is \\'fined\\' by the EUSSR for doing this then don\\'t pay it ... and then stop all financial payments to this corrupt experiment. It doesn\\'t work, its broke so listen to the PEOPLE Camoron and have an in or out referendum .. NOW!" 

sentences:
 ['"Quote from Teresa May "That is why we have been trying to deport him to Jordan, his home country."', 'TRYING - Since when does a Sovereign have problems TRYING to deport diseased minded scum like him.', "Stuff the EU laws and if the UK is \\\\'fined\\\\' by the EUSSR for doing this then don\\\\'t pay it ... and then stop all financial payments to this corrupt experiment.", "It doesn\\\\'t work, its broke so listen to the PEOPLE Camoron and have an in or out referendum ..", 'NOW!"'] 

tokens:
 [['``', 'Qu

{'EUSSR', 'PEOPLE Camoron', 'Teresa May', 'UK'}

## Tasks

- Think about how to take this data and classify it.
- Try and create a classifier. Does it work better? Worse? Why?