# Basics of Analyizing Text with NLTK

# 1. Tokenization

We startwith some piece of text that we want to analyze, this could be something like a tweet. Or a pandas series where each entry is a tweet, etc...

In [3]:
text = "Cyprus, officially the Republic of Cyprus, is an island country in the Eastern Mediterranean and the third largest and third most populous island in the Mediterranean. Cyprus is located south of Turkey, west of Syria and Lebanon, northwest of Israel, north of Egypt, and southeast of Greece. Cyprus is a major tourist destination in the Mediterranean. With an advanced, high-income economy and a very high Human Development Index, the Republic of Cyprus has been a member of the Commonwealth since 1961 and was a founding member of the Non-Aligned Movement until it joined the European Union on 1 May 2004. On 1 January 2008, the Republic of Cyprus joined the eurozone."
text

'Cyprus, officially the Republic of Cyprus, is an island country in the Eastern Mediterranean and the third largest and third most populous island in the Mediterranean. Cyprus is located south of Turkey, west of Syria and Lebanon, northwest of Israel, north of Egypt, and southeast of Greece. Cyprus is a major tourist destination in the Mediterranean. With an advanced, high-income economy and a very high Human Development Index, the Republic of Cyprus has been a member of the Commonwealth since 1961 and was a founding member of the Non-Aligned Movement until it joined the European Union on 1 May 2004. On 1 January 2008, the Republic of Cyprus joined the eurozone.'

## Sentence Tokenization

In [2]:
from nltk import sent_tokenize, word_tokenize
sentences = sent_tokenize(text.lower())
sentences

['cyprus, officially the republic of cyprus, is an island country in the eastern mediterranean and the third largest and third most populous island in the mediterranean.',
 'cyprus is located south of turkey, west of syria and lebanon, northwest of israel, north of egypt, and southeast of greece.',
 'cyprus is a major tourist destination in the mediterranean.',
 'with an advanced, high-income economy and a very high human development index, the republic of cyprus has been a member of the commonwealth since 1961 and was a founding member of the non-aligned movement until it joined the european union on 1 may 2004. on 1 january 2008, the republic of cyprus joined the eurozone.']

<br>

## Exercise 1: Write a Loop that prints out each `sentence` in the `sent_tokenization` of `text`

#### Example
-----
**Sentence:**

"Cyprus, officially the Republic of Cyprus, is an island country in the Eastern Mediterranean and the third largest and third most populous island in the Mediterranean. Cyprus is located south of Turkey, west of Syria and Lebanon, northwest of Israel, north of Egypt, and southeast of Greece.:

**Sent_tokens:**

`cyprus, officially the republic of cyprus, is an island country in the eastern mediterranean and the third largest and third most populous island in the mediterranean.

`-------`

cyprus is located south of turkey, west of syria and lebanon, northwest of israel, north of egypt, and southeast of greece.`

<br>

In [4]:
# Solution

<br>

## Word Tokenization

In [None]:
tokens = word_tokenize(sentences[2])
tokens

<br>
<br>
<hr>
<br>
<br>

# 2. POS Tagging

`pos_tag` expects the `input` to be a set of `tokens`, these could be further divided into `sentences` or a `single sentence.`

In [None]:
from nltk import pos_tag

In [None]:
tags = pos_tag(tokens)
tags

To access documentation for tags, for example for `NN`:

In [None]:
import nltk.help
nltk.help.upenn_tagset('NN')

<br>

### Exercise 2: Write a Loop that prints out each `token` in `tags` AND the `documentation` for its matching tag.

#### Example
-----

`cyprus:
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...`


`is:
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...`

In [None]:
for token, tag in tags:
    print("-----------------------------")
    print(token)
    print("-----------------------------")
    nltk.help.upenn_tagset(tag)
    print()
    print()

<br>
<br>
<hr>
<br>
<br>

# 3. Chunking and Entity Recognition

This is the information we want to mine when performing chunk analysis.

---

* GPE: Geo-Political Entity
* GSP: Geo-Socio-Political group
* FACILITY
* LOCATION
* ORGANIZATION
* PERSON

## Entity Recognition for a single Sentence

In [None]:
from nltk import ne_chunk

### We have a single Sentence

In [None]:
sentence = "Mark and John are working at Google."

### Tokenize every word in the sentence

In [None]:
words = word_tokenize(sentence)
print(words)

### Attach a part_of speech tag to each word in a sentence

In [None]:
pos_words = pos_tag(words)
pos_words

### Find the entities in the entire sentence

In [None]:
entities_in_sent = ne_chunk(pos_words)
print(entities_in_sentence)

<br>

## Entity Recognition for a Text with `n` sentences.

### We have a piece of text with many sentences

In [None]:
text

## Exercise 3

### 1) Sentence Tokenization
a. `sent_tokenize` `text` into a `list` of several `sentences`

b. save to variable named `sentences`

In [None]:
sentences = sent_tokenize(text)
# sentences[:1]

### 2) Word Tokenization

a. `word_tokenize` each `sentence` seperately, this should yield a `list` of `lists`.

b. Save to variable `tokenized_words_in_sentences`

In [None]:
tokenized_words_in_sentences = [word_tokenize(sent) for sent in sentences]
# tokenized_words_in_sentences[:1]

### 3) POS_Tags

a. Attach a `pos_tag` to every `word` in each `tokenized_sentence` in `tokenized_words_in_sentences`

b. `pos_tag()` assumes the input will be a sentence and will tag each word contained.

b. Save to variable named `pos_sentences`

In [None]:
pos_sentences = [pos_tag(tokenized_sent) for tokenized_sent in tokenized_words_in_sentences]
# pos_sentences[:1]

### 4) Entity Extraction

a) Find the entities for each sentence in `pos_sentences`

b) Use `ne_chunk_sents` instead of `ne_chunk`, Since we are now passing a `list` of pos_tagged sentences and NOT a single pos_tagged sentence.

c) ne_chunk_sents returns a `generator` make sure to wrap the output in a `list()` when saving to variable.

d) Save to variable `chunked_sentences`

In [None]:
from nltk import ne_chunk_sents

In [None]:
chunked_sentences = list(ne_chunk_sents(pos_sentences))

In [None]:
for chunk in chunked_sentences:
    print(chunk)
    print()

<br>

## Visualizing the entities present in a piece of text 

In [None]:
from collections import defaultdict

In [None]:
ner_categories = defaultdict(int)

# populate ner_categories
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            

ner_categories

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt

# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(l) for l in labels]

# Create the pie chart
fig = plt.figure()
ax = fig.gca()
ax.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

<br>