### Using NLTK for Named Entity Recognition

In [1]:
# import text
import nltk 

f = open('Emma_Austen.txt','r')
text = f.read()

#### Preprocessing
- Next, we will first split the text into sentences using a sentence segmenter "nltk.sent_tokenize"
- Each sentence will be further sibdivided into words using a word tokenizer "nltk.word_tokenize"
- Next, each sentence will be tagged with part-of-speech tags using nltk.pos_tag, which will prove very helppful in the next step, name entity detection.


In [2]:
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]

#### Chunking
- Now, we have done the preprocessing of the raw text. Next, we will search for mentions of potentially entities in each sentence. The basic technique we will use for entity detection is chunking.Chunking aims at grouping elements of the sequence, without any differentiation between obtained groups. For example, noun phrase chunking or verb group chunking. 
- We will begin by considering the task of noun phrase chunking, or NP-chunking, where we search for chunks corresponding to individual noun phrases. We will use part-of-speech tags help us for the NP chunking.
    - In order to create an NP-chunker, we will first define a chunk grammar, consisting of rules that indicate how sentences should be chunked. 
    - In this case, we will define a simple grammar with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). 
    - Using this grammar, we create a chunk parser.

In [3]:
# Use NLTK’s currently recommended named entity chunker to chunk the given list of tagged sentences, 
# each consisting of a list of tagged tokens.
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)

In [4]:
# define a function to extract entity names in a chunked sentence
def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label: # if the tree(t) has an attribute called "label" and its value is not none. 
        if t.label() == 'NE': # if the label is "NE", add this node to entity_names
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

In [5]:
# create a list to store the entity names
entity_names = []

# iterate the chunked sentences to extract entity names
for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))

# Print all entity names
print (entity_names)

# Print unique entity names
print (set(entity_names))

['Emma Woodhouse', 'Miss Taylor', 'Mr. Woodhouse', 'Emma', 'Miss Taylor', 'Miss Taylor']
{'Miss Taylor', 'Emma Woodhouse', 'Mr. Woodhouse', 'Emma'}
