### Codio Activity 18.2: Named Entities

**Expected Time = 45 minutes**

**Total Points = 30**

This activity focuses on extracting named entities from text.  The named entities will be extracted using the `nltk` library.  You will read in the full text of Newton's *Principia* and identify the entities labeled as places.  

#### Index

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)
- [Problem 6](#-Problem-6)

In [1]:
import nltk
from nltk import word_tokenize

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("maxent_ne_chunker")
nltk.download("words")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/codespace/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

[Back to top](#-Index)

### Problem 1

#### Opening a `.txt` file.

**5 Points**

Use the `open` function to open the text file with the Principia by Isaac Newton using the filepath given below.  Assign the text using the `readlines()` function to assign the text as a list of lines to the variable `principia` below. 

In [2]:
filepath = "data/Philosophiae_Naturalis_Principia_Mathematica.txt"

In [3]:
### GRADED
with open(filepath) as f:
    principia = f.readlines()


### ANSWER CHECK
print(type(principia))
print(principia)

<class 'list'>
['Philosophiae Naturalis Principia Mathematica\n', '\n', '\n', '\t\t\t\t\tIsaacus Newtonus\n', '\n', '\n', '\n', '\n', '\n', '1687\n', '\n', '\n', '\n', '\n', '\n', 'Exported from Wikisource on April 3, 2022\n', '\n', '\n', '\n', '\n', '\n', 'INDEX\n', '\n', '\n', 'Tituli pagina\n', '\n', 'Auctoris præfatio ad lectorem\n', '\n', 'Viri Præstantissimi\n', '\n', 'Definitiones\n', '\n', 'Axiomata, sive Leges Motus\n', '\n', '\n', '\n', '\n', '\n', 'DE MOTU CORPORUM LIBER PRIMUS\n', '\n', '\n', 'SECT. I. DE Methodo rationum primarum & ultima rum.\n', '\n', 'SECT. II. De inventione Virium centripetarum.\n', '\n', 'SECT. III. De motu corporum in Conicis sectionibus eccentri cis.\n', '\n', 'SECT. IV. De inventione Orbium Elliptieorum, Parabolieorum & Hyperbolieorum ex Umbilico dato.\n', '\n', 'SECT. V. De inventione Orbium ubi Umbilicus neuter datur.\n', '\n', 'SECT. VI. De inventione Motuum in Orbibus datis.\n', '\n', 'SECT. VII. De corporum Ascensu & Descensu rectilineo.\n', '

[Back to top](#-Index)

### Problem 2

#### Tokenizing the text. 

**5 Points**

Using the `principia` variable from problem 1, combine the `' '.join()` function with `word_tokenize` to create a list of tokens named `tokens` below.

In [4]:
### GRADED
tokens = word_tokenize(" ".join(principia))

### ANSWER CHECK
print(type(tokens))
print(tokens[:5])

<class 'list'>
['Philosophiae', 'Naturalis', 'Principia', 'Mathematica', 'Isaacus']


[Back to top](#-Index)

### Problem 3

#### Part of Speech Tags 

**5 Points**

Use the `pos_tag` function to create the part of speech tagged corpus of the principia text.  Assign the tagged text to the variable `words_pos` below.

In [5]:
### GRADED
words_pos = nltk.pos_tag(tokens)

### ANSWER CHECK
print(type(words_pos))
print(words_pos[:5])

<class 'list'>
[('Philosophiae', 'NNP'), ('Naturalis', 'NNP'), ('Principia', 'NNP'), ('Mathematica', 'NNP'), ('Isaacus', 'NNP')]


[Back to top](#-Index)

### Problem 4

#### Named Entities

**5 Points**

Use the tagged words in `words_pos` to create a list of tuples in the form (word, entity type) if the word has a named entity label.  Assign these tuples to the list `named_entities` below.

In [6]:
named_entities = []
for word in nltk.ne_chunk(words_pos):
    if hasattr(word, "label"):
        named_entities.append((" ".join(c[0] for c in word.leaves()), word.label()))

In [7]:
named_entities

[('Philosophiae', 'GSP'),
 ('Naturalis Principia Mathematica Isaacus Newtonus', 'PERSON'),
 ('Wikisource', 'GPE'),
 ('INDEX Tituli', 'ORGANIZATION'),
 ('Auctoris', 'GPE'),
 ('Viri Præstantissimi Definitiones Axiomata', 'PERSON'),
 ('Leges Motus DE', 'PERSON'),
 ('MOTU', 'ORGANIZATION'),
 ('CORPORUM', 'ORGANIZATION'),
 ('DE Methodo', 'ORGANIZATION'),
 ('De', 'PERSON'),
 ('Virium', 'PERSON'),
 ('De', 'PERSON'),
 ('Conicis', 'GSP'),
 ('De', 'PERSON'),
 ('Orbium Elliptieorum', 'PERSON'),
 ('Parabolieorum', 'ORGANIZATION'),
 ('Umbilico', 'GPE'),
 ('Orbium', 'PERSON'),
 ('De', 'PERSON'),
 ('Orbibus', 'GPE'),
 ('De', 'PERSON'),
 ('Ascensu', 'PERSON'),
 ('De', 'PERSON'),
 ('Viribus', 'PERSON'),
 ('De Motu', 'PERSON'),
 ('Orbibus', 'GPE'),
 ('Motu Apsidum', 'PERSON'),
 ('De Motu', 'PERSON'),
 ('Superficiebus', 'GPE'),
 ('Funependulorum Motu', 'PERSON'),
 ('De Motu', 'PERSON'),
 ('Viribus', 'PERSON'),
 ('De', 'PERSON'),
 ('Sphærieorum Viribus', 'PERSON'),
 ('De', 'PERSON'),
 ('Sphærieorum Viribu

[Back to top](#-Index)

### Problem 5

#### Removing People

**5 Points**

Use the `named_entities` list to include only entities labeled `GPE` and create a list of these words lowercased as `places` below.

In [8]:
### GRADED
places = [entity[0].lower() for entity in named_entities if entity[1] == "GPE"]

### ANSWER CHECK
print(type(places))
print(places[:5])

<class 'list'>
['wikisource', 'auctoris', 'umbilico', 'orbibus', 'orbibus']


[Back to top](#-Index)

### Problem 6

#### Removing stopwords

**5 Points**

Use the list `places` to remove all stopwords.  Assign these words as a list to `no_stops` below.

In [9]:
from nltk.corpus import stopwords

In [10]:
### GRADED
no_stops = [w for w in places if w not in stopwords.words("english")]


### ANSWER CHECK
print(type(no_stops))
print(no_stops)

<class 'list'>
['wikisource', 'auctoris', 'umbilico', 'orbibus', 'orbibus', 'superficiebus', 'mediis', 'fluida']
