# Text Processing in Python: First Look

With Python, you can **extract information** from text without having to read it. In this demo, you'll see some of the text processing tasks you can do with Python's **Natural Language Toolkit** ([NLTK](https://www.nltk.org/)). 

You'll see examples of:
- Preparing text for analysis
- Extracting information from text

## Tokenizing

Text is unstructured data. By tokenizing, you can split it into predicatble units. 

You can tokenize by word:

In [8]:
from nltk.tokenize import word_tokenize

example = "The tall tree fell over during the storm."
word_tokenize(example)

['The', 'tall', 'tree', 'fell', 'over', 'during', 'the', 'storm', '.']

You can tokenize by sentence:

In [3]:
from nltk.tokenize import sent_tokenize

example = "The tall tree fell over during the storm. I'm going to climb on it."
sent_tokenize(example)

['The tall tree fell over during the storm.', "I'm going to climb on it."]

## Tagging Parts of Speech

The **parts of speech** are the roles that words play when you use them together in sentences:
- Noun
- Verb
- Adjective
- Adverb
- Pronoun
- Preposition
- Conjuntion
- Interjection

When you label tokens as being a particular part of speech, it's easier to see how they relate to each other so you can see patterns:

In [4]:
from nltk.tag import pos_tag

example = "The tall tree fell over during the storm."
words_in_example = word_tokenize(example)
pos_tag(words_in_example)

[('The', 'DT'),
 ('tall', 'JJ'),
 ('tree', 'NN'),
 ('fell', 'VBD'),
 ('over', 'IN'),
 ('during', 'IN'),
 ('the', 'DT'),
 ('storm', 'NN'),
 ('.', '.')]

You can check the docs to see what the tag codes mean:
- **DT:** determiner
- **JJ:** adjective or numeral, ordinal
- **NN:** noun
- **VBD:** verb, past tense
- **IN:** preposition or conjunction, subordinating 

## Named Entity Recognition

**Named entities** refer to specific places, people, or organizations. You can use NER to see what your text is about.

Create a string to find named entities in:

In [5]:
text = """
Bloedel Conservatory is a domed lush paradise located in Queen Elizabeth Park 
atop the City of Vancouver’s highest point. More than 100 exotic birds, and 
500 exotic plants and flowers thrive within its temperature-controlled environment."""

Create a function to extract named entities:

In [6]:
def extract_ne(text):
     words = word_tokenize(text)
     tags = pos_tag(words)
     tree = ne_chunk(tags, binary=True)
     return set(
         " ".join(i[0] for i in t)
         for t in tree
         if hasattr(t, "label") and t.label() == "NE"
     )

This function gathers all the named entities, with no repeats. It tokenizes by word, applies part of speech tags to those words, and then extracts named entities based on those tags. Since it has `binary=True`, you won't see what types on named entities they are (such as location, person, organization).

In [7]:
from nltk.chunk import ne_chunk

extract_ne(text)

{'Bloedel Conservatory', 'Queen Elizabeth Park', 'Vancouver'}

## Next Steps

If you want to learn more, you can check out:
- [nltk.org](nltk.org)
- [Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit](https://www.nltk.org/book/)
- [Natural Language Processing With Python's NLTK Package](https://realpython.com/nltk-nlp-python/)
- [spaCy](https://spacy.io/)