### Natural language processing (NLP) is a field of computer science that is focused on developing applications and services that are able to understand human languages. Practical examples of NLP include speech recognition, speech translation, understanding complete sentences, understanding synonyms of matching words, and writing complete grammatically correct sentences and paragraphs.

#### In this tutorial we will first be exploring NLTK (natural language toolkit), one of the NLP libraries. NLTK is a leading platform for building Python programs to work with human language data. It can be used for classification, tokenization, stemming, tagging, parsing, and much more. We will be exploring just a few things that can be done with this library. But before we do anything, we must import all of the necessary libraries. 

In [1]:
from elasticsearch import Elasticsearch
import pandas as pd 
import numpy as np
import nltk
from datetime import *
from dateutil.relativedelta import *
import nltk
import tokenize
from nltk import word_tokenize
from nltk.corpus import stopwords
from nameparser import HumanName
from bs4 import BeautifulSoup
from nltk.corpus import names
import sklearn
import string
from nltk.tokenize import SpaceTokenizer
from nltk.text import Text
import spacy
import boto3

#### We can now access our data by placing our desired csv file in an s3 bucket on AWS and downloading the file from that bucket. Then we will read in the file and save it as a dataframe. 

In [2]:
bucket = 'data-science-tutorials'
key = 'nltk_practice.csv'

s3 = boto3.resource('s3')

s3.Bucket(bucket).download_file(key,key)

df = pd.read_csv('./nltk_practice.csv')
transcript = df['transcript']

#### In order to extract anything from a corpus, we must first process the transcripts. The function ie_preprocess loops through each transcript and gets rid of stop words (such as I, am, we, the, etc..) and tokenizes each word so that they are separated by commas. Once tokenized, it loops through each word and applies the nltk function 'pos_tag', which assignes a part of speech to each word.  

In [3]:
#assign Parts of Speech to each word
def ie_preprocess(document):
    pos_words = []
    
    for transcript in document:
        document = ' '.join([i for i in transcript.split()])
        sentences = nltk.sent_tokenize(document) #tokenizes into sentences
        sentences = [nltk.word_tokenize(sent) for sent in sentences] #tokenizes into word
        for sent in sentences:
            pos_words.append(nltk.pos_tag(sent)) #assign parts of speech label to each word
    return pos_words

#### We will now call ie_preprocess() on the transcripts, getting an output of each word and its part of speech. 'NNP' refers to proper noun. [Here](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b) is a referrence for each part of speech.

In [4]:
processed_transcript = ie_preprocess(transcript)
print('\n\nWORD,Parts of Speech:')
print(processed_transcript, "\n")




WORD,Parts of Speech:
[[('Thank', 'NNP'), ('you', 'PRP'), ('for', 'IN'), ('calling', 'VBG'), ('REI', 'NNP'), ('headquarters', 'NN'), ('.', '.')], [('If', 'IN'), ('you', 'PRP'), ('would', 'MD'), ('like', 'VB'), ('to', 'TO'), ('speak', 'VB'), ('with', 'IN'), ('a', 'DT'), ('customer', 'NN'), ('service', 'NN'), ('representative', 'NN'), ('.', '.')], [('Please', 'NNP'), ('press', 'VB'), ('one', 'CD'), ('if', 'IN'), ('you', 'PRP'), ('would', 'MD'), ('like', 'VB'), ('to', 'TO'), ('say', 'VB'), ('hello', 'NN'), ('.', '.')], [('Sorry', 'NNP'), ('I', 'PRP'), ('called', 'VBD'), ('up', 'RP'), ('.', '.')], [('I', 'PRP'), ('may', 'MD'), ('direct', 'VB'), ('your', 'PRP$'), ('call', 'NN'), ('.', '.')], [('Okay', 'NNP'), ('hold', 'NN'), ('on', 'IN'), ('please', 'NN'), ('.', '.')], [('You', 'PRP'), ('.', '.')], [('Hello', 'NNP'), ('.', '.')], [('Please', 'NNP'), ('leave', 'VB'), ('a', 'DT'), ('message', 'NN'), ('for', 'IN'), ('Nolan', 'NNP'), ('Cross', 'NNP'), ('.', '.')], [('Hey', 'NNP'), ('Mack', 'N

#### The function below is similar to ie_preprocess, except that instead of assigning a part of speech to each word, it assigns either 'PERSON', 'GDP, or 'ORGANIZATION' to each word. 

#### The basic technique used for entity detection is chunking, which segments and labels multi-token sequences. The smaller boxes show the word-level tokenization and part-of-speech tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk. Like tokenization, which omits whitespace, chunking usually selects a subset of the tokens. Also like tokenization, the pieces produced by a chunker do not overlap in the source text. ![alt text](data/chunking.png "Chunking")


In [5]:
#assign either GDP, Person, or Organization to each word
def extract_named_entries(document):
    named_entities = []
    sentences = ie_preprocess(document) 
    for tagged_sentences in sentences:
        ne_chunked_sents = nltk.ne_chunk(tagged_sentences)
        for tagged_tree in ne_chunked_sents:
            if hasattr(tagged_tree, 'label'):
                entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) #
                entity_type = tagged_tree.label() # get NE category
                named_entities.append((entity_name, entity_type))
    return named_entities

#### Here we will call extract_named_entries() which gives an output of each word and its category.

In [6]:
better_names = extract_named_entries(transcript)
print('\n\nWORD, Category:')
print(better_names, "\n")



WORD, Category:
[('REI', 'ORGANIZATION'), ('Please', 'GPE'), ('Sorry', 'PERSON'), ('Okay', 'GPE'), ('Hello', 'GPE'), ('Please', 'GPE'), ('Nolan Cross', 'PERSON'), ('Hey', 'PERSON'), ('Cross', 'GPE'), ('George', 'PERSON'), ('Mike Lang', 'PERSON'), ('Ray Slater', 'PERSON'), ('Kinda', 'GSP'), ('Weiss', 'GPE'), ('Bye', 'GPE'), ('Okay', 'GPE'), ('Brooke Weiss', 'PERSON'), ('Fine', 'GPE'), ('Okay', 'GPE'), ('Okay', 'GPE'), ('ASAP', 'ORGANIZATION'), ('Mel', 'PERSON'), ('Mel', 'ORGANIZATION'), ('Fargo', 'PERSON'), ('Bye', 'GPE'), ('Hello', 'GPE'), ('Tech', 'PERSON'), ('Tech', 'PERSON'), ('Okay', 'GPE'), ('Okay', 'GPE'), ('Okay', 'GPE'), ('Yeah', 'GPE'), ('Alright', 'GPE'), ('Bye', 'GPE'), ('Karen Lancaster', 'PERSON'), ('Security Care Network', 'ORGANIZATION'), ('Please', 'GPE'), ('Devon', 'PERSON'), ('Tech', 'PERSON'), ('Emergency', 'GPE'), ('Again', 'GPE'), ('Devon', 'ORGANIZATION'), ('Yancy', 'PERSON'), ('Kelly', 'PERSON'), ('Hello', 'GPE'), ('Please', 'GPE'), ('Taylor Neely', 'PERSON'), 

#### The function extract_names creates a list of names by looking for each word that has a the label 'PERSON' and appending it to a list. 

In [7]:
#extract the words that have a "person" token]
#loop through already tagged transcripts
#loop through pos_tag's within tagged transcripts
#once 'PERSON' label is found, append each words to a list and return it
def extract_names(document):
    names = []
    sentences = ie_preprocess(document) 
    for tagged_sentence in sentences:
        for chunk in nltk.ne_chunk(tagged_sentence): 
            if type(chunk) == nltk.tree.Tree:
                if chunk.label() == 'PERSON':
                    names.append(' '.join([c[0] for c in chunk]))
    return names

#### We can call extract_names to receive a list of all the names found in the transcript. As you can see not all of these outputs are Human names, such as 'Hey', 'Google', and 'Sorry', so there is still much room for improvement.

In [8]:
names = extract_names(transcript)
print('\n\nNAMES:')
print(names, "\n")



NAMES:
['Sorry', 'Nolan Cross', 'Hey', 'George', 'Mike Lang', 'Ray Slater', 'Brooke Weiss', 'Mel', 'Fargo', 'Tech', 'Tech', 'Karen Lancaster', 'Devon', 'Tech', 'Yancy', 'Kelly', 'Taylor Neely', 'Hey', 'Taylor Neely', 'Barry University', 'Perry University', 'Kim', 'Heather Corel', 'Heather', 'Daniel Gory', 'Tech', 'Paul', 'Hey', 'Catherine', 'Tech', 'Google Edward', 'John', 'Gabriel', 'Jay Walsh', 'Brett', 'Rob', 'John', 'John', 'Paul', 'Paul Mable', 'John', 'Rob', 'John', 'Alright', 'John', 'David', 'Hey', 'Okay', 'John Mall', 'Edward', 'Okay', 'Edwards', 'Cook', 'Bob', 'Abel', 'David', 'Kelly', 'Tech', 'John', 'Alison', 'Hey', 'Penn State University', 'Connie', 'Sorry', 'Heather', 'Hey', 'Daniel Gory', 'Tech', 'Hey', 'Ethan Allen', 'Center', 'Ethan Alan', 'Avery Stacy Patino', 'Happy', 'Matt', 'Stacy Dave Christina', 'Hello', 'Gary Thompson', 'Laura', 'Katie', 'Katie Grace', 'Catherine', 'Katie', 'Kevin', 'Edward Facebook', 'Sorry', 'Marcella', 'Happy', 'Matt', 'Nadia Austin', 'Aust

#### Another thing we can do with the nltk library is to search for a particular word or phrase. A concordance method shows us every occurrence of a given word, together with some context. 

In [9]:
## Test the function
for indiv_transcript in transcript:
    tokens = nltk.word_tokenize(indiv_transcript)
    text = nltk.Text(tokens)
    text.concordance('you') # default text.concordance output

Displaying 4 of 4 matches:
 you for calling REI headquarters . If you
 you would like to speak with a customer s
 representative . Please press one if you would like to say hello . Sorry I cal
ect your call . Okay hold on please . You . Hello . Please leave a message for 
Displaying 18 of 18 matches:
kay hold on one second . Okay . Thank you . Hi I 'm Brooke Weiss later . Fine f
gon na Carmen I 'm sorry to hear have you received any payment since May since 
 not in my bank statement I will tell you ASAP . I 'm sorry man . That would be
rry man . That would be great . Thank you very much . Mel Mel now . If you did 
hank you very much . Mel Mel now . If you did n't get it maybe I did n't hey I 
go 's bill pay . So sometimes when it you know it does n't necessarily look lik
 look like a check . I do n't know if you 've ever seen these things that come 
 ever seen these things that come but you know it 's printed out by the bank an
now it 's printed out by the bank and you got somehow tear

#### NLTK is not the only NLP library. There is another very useful one: spaCy, which features models for tagging, parsing, and entity recognition. While they are similar, there are different reasons to use each one. To learn more about the differences between the two, here is an informative [link](https://medium.com/@akankshamalhotra24/introduction-to-libraries-of-nlp-in-python-nltk-vs-spacy-42d7b2f128f2).  For the rest of the tutorial, we will be exploring the spaCy library by extracting parts of speech, categories and other gramatical structures.

#### spaCy: part of speech tagging
#### prints out word, base form of the word, part of speech, abreviated part of speech, more in depth part of speech, shape of word 

In [10]:
#introduction to spacy
#part-of-speech tagging
nlp = spacy.load('en')

doc = nlp(transcript[0])
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
        token.shape_)

Thank thank VERB VBP ROOT Xxxxx
you -PRON- PRON PRP dobj xxx
for for ADP IN prep xxx
calling call VERB VBG pcomp xxxx
REI rei PROPN NNP compound XXX
headquarters headquarters NOUN NN dobj xxxx
. . PUNCT . punct .
If if ADP IN mark Xx
you -PRON- PRON PRP nsubj xxx
would would VERB MD aux xxxx
like like VERB VB ROOT xxxx
to to PART TO aux xx
speak speak VERB VB xcomp xxxx
with with ADP IN prep xxxx
a a DET DT det x
customer customer NOUN NN compound xxxx
service service NOUN NN compound xxxx
representative representative NOUN NN pobj xxxx
. . PUNCT . punct .
Please please INTJ UH intj Xxxxx
press press VERB VB ROOT xxxx
one one NUM CD dobj xxx
if if ADP IN mark xx
you -PRON- PRON PRP nsubj xxx
would would VERB MD aux xxxx
like like VERB VB advcl xxxx
to to PART TO aux xx
say say VERB VB xcomp xxx
hello hello INTJ UH intj xxxx
. . PUNCT . punct .
Sorry sorry INTJ UH intj Xxxxx
I -PRON- PRON PRP nsubj X
called call VERB VBD ROOT xxxx
up up PART RP prt xx
. . PUNCT . punct .
I -PRON- PRON P

#### spaCy: noun chunks

In [11]:
#noun chunks
#prints "base noun phrases" - flat phrases that have a noun as their head. 

doc = nlp(transcript[0])
for token in doc.noun_chunks:
    print(token.text, token.root.text, token.root.dep_,
        token.root.head.text)

you you dobj Thank
REI headquarters headquarters dobj calling
you you nsubj like
a customer service representative representative pobj with
you you nsubj like
I I nsubj called
I I nsubj direct
your call call dobj direct
You You ROOT You
a message message dobj leave
Nolan Cross Cross pobj for
Mack Mack nsubj calling
Cross Cross pobj from


#### spaCy: Assigns the label people, numbers (cardinal), date, time, and organization
#### prints word, starting character, ending character, and label.

In [12]:
doc = nlp(transcript[0])
for ent in doc.ents: 
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

REI 22 25 ORG
one 120 123 CARDINAL
Nolan Cross 259 270 ORG
Mack 276 280 PERSON
Cross 294 299 ORG


In [13]:
for indiv_transcript in transcript:
    doc = nlp(indiv_transcript)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            print(ent.text)

Mack
George Mike Lang
Kelly
Sally Ray Slater
Kinda
Bye
Brooke Weiss
Mel Mel
Bye
Bye
Karen Lancaster
Karen
Devon
Kelly
Taylor
Shelly
Taylor
Perry University
Kim
Heather Corel
Daniel Gory
Paul
Reception
Jack
Catherine
Bye-bye
Google Edward
John Walsh
Jay Walsh
John Jeff
Gimme
John
Paul
Paul Mable
John
John
John
Bye
Bye
David
Kelly
My Edwards
John Mall
Edward
Bye
Cook
Bob
Edward
Bye
Abel
Bye
David Gonzales
Kelly
Bye
Bye
Walkers
John
Prima
Alison
Connie
Heather Torres
Daniel Gory
Bye
Ethan Allen
Ethan Allen
Ethan Alan Headquarters
Ethan
morning Stone
Avery Stacy Patino
Ethan
Matt
Stacy Dave Christina
Gary Thompson
Laura
Katie
Katie Grace
Sherry
Catherine
Bye
Katie
Kevin
Edward Facebook
Irvin
Marcella
Matt
Nadia Austin
Nadia Hoss
Mike
Bye-bye
Mike
Daniel Gory
Rick
Mike
Bye
Chris
Chris
David
Ims
Bye
Violet Bailey
Jeremy Thomas


#### While this list of names is not perfect, if we compare it to the list of names created with the NLTK library, it is clear to see that this list is more accurate. 

#### Overall, I prefer using spaCy over NLTK. It is faster and has a more a accurate part of speech tagger. Spacy also has integrated word vectors and a fast & accurate part-of-speech tagger + dependency parser. Everything is predefined and you don't need to spend time writing new pos -taggers that you might have to with NLTK