### Codio Activity 18.2: Named Entities

**Expected Time = 45 minutes**

**Total Points = 30**

This activity focuses on extracting named entities from text.  The named entities will be extracted using the `nltk` library.  You will read in the full text of Newton's *Principia* and identify the entities labeled as places.  

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [20]:
import nltk
from nltk import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /Users/odeanmaye/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/odeanmaye/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/odeanmaye/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/odeanmaye/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /Users/odeanmaye/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package words to /Users/odeanmaye/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading p

True

[Back to top](#-Index)

### Problem 1

#### Opening a `.txt` file.

**5 Points**

Use the `open` function to open the text file with the Principia by Isaac Newton using the filepath given below.  Assign the text using the `readlines()` function to assign the text as a list of lines to the variable `principia` below. 

In [4]:
filepath = '../data/Philosophiae_Naturalis_Principia_Mathematica.txt'

In [8]:
### GRADED
with open(filepath) as f:
    principia = f.readlines()

### ANSWER CHECK
print(type(principia))

<class 'list'>


[Back to top](#-Index)

### Problem 2

#### Tokenizing the text. 

**5 Points**

Using the `principia` variable from the previous question, combine the `' '.join()` and the`word_tokenize` functions to create a list of tokens named `tokens` below.

In [10]:
### GRADED

# YOUR CODE HERE
text = ' '.join(principia)
tokens = word_tokenize(text)

### ANSWER CHECK
print(type(tokens))
print(tokens[:5])

<class 'list'>
['Philosophiae', 'Naturalis', 'Principia', 'Mathematica', 'Isaacus']


[Back to top](#-Index)

### Problem 3

#### Part of Speech Tags 

**5 Points**

Use the `nltk.pos_tag` function with argument equal to `tokens` to create the part of speech tagged corpus of the principia text.  Assign the tagged text to the variable `words_pos` below.

In [16]:
### GRADED
words_pos = nltk.pos_tag(tokens)

### ANSWER CHECK
print(type(words_pos))
print(words_pos[:5])

<class 'list'>
[('Philosophiae', 'NNP'), ('Naturalis', 'NNP'), ('Principia', 'NNP'), ('Mathematica', 'NNP'), ('Isaacus', 'NNP')]


[Back to top](#-Index)

### Problem 4

#### Named Entities

**5 Points**

Use a `for` loop to iterate over the chunked words with POS tags. To achieve this, use the `nltk.ne_chunk` function with aurgument equal to `words_pos` to return a tree of chunks, where each chunk can be either a single word or a named entity.

Inside the `for` loop, use the `hasattr` function to check if the current word chunk has a label attribute. If the condition is satified, create a list (`named_entities`) of tuples in the form (word, entity type).

Assign these tuples to the list `named_entities` below.

In [22]:
### GRADED
named_entities = []

chunks = nltk.ne_chunk(words_pos)

for chunk in chunks:
    if hasattr(chunk, 'label'):
        entity = ' '.join(c[0] for c in chunk)
        entity_type = chunk.label()
        named_entities.append((entity, entity_type))

### ANSWER CHECK
print(type(named_entities))
print(named_entities[:5])

<class 'list'>
[('Philosophiae', 'GSP'), ('Naturalis Principia Mathematica Isaacus Newtonus', 'PERSON'), ('Wikisource', 'GPE'), ('INDEX Tituli', 'ORGANIZATION'), ('Auctoris', 'GPE')]


[Back to top](#-Index)

### Problem 5

#### Removing People

**5 Points**

Use the `named_entities` list to include only entities labeled `GPE` and create a list of these words lowercased as `places` below.

In [24]:
### GRADED

# YOUR CODE HERE
places = [entity[0].lower() for entity in named_entities if entity[1] == 'GPE']

### ANSWER CHECK
print(type(places))
print(places[:5])

<class 'list'>
['wikisource', 'auctoris', 'umbilico', 'orbibus', 'orbibus']


[Back to top](#-Index)

### Problem 6

#### Removing stopwords

**5 Points**

Use the list `places` to remove all stopwords.  Assign these words as a list to `no_stops` below.

In [26]:
from nltk.corpus import stopwords

In [28]:
### GRADED

# YOUR CODE HERE
no_stops = [word for word in places if word not in stopwords.words('english')]

### ANSWER CHECK
print(type(no_stops))
print(no_stops)

<class 'list'>
['wikisource', 'auctoris', 'umbilico', 'orbibus', 'orbibus', 'superficiebus', 'mediis', 'fluida']
