<h2> ======================================================</h2>
 <h1>MA477 - Theory and Applications of Data Science</h1> 
  <h1>Lesson 15: Tokenization, Speech Tagging, Chunking</h1> 
 
 <h4>Dr. Valmir Bucaj</h4>
 United States Military Academy, West Point 
AY20-2
<h2>======================================================</h2>

<h2>Lecture Outline</h2>

<ul>
    <li>Tokenization</li>
    <li> Normalizing</li>
    <li>Tagging Part of Speech</li>
    <li>Chunking</li>
    
    
</ul>

In [1]:
import nltk

<h2>Tokenization</h2>

Tokenization is the process of breaking a text down into words or sentences. When dealing with text, typically they don't come already broken down into words or sentences, so it's up to us to do so. NLTK has a built-in method that easily breaks text down into words or sentences. 

Below we'll use some text describing the coronavirus as our example

In [14]:
text="""What are Coronaviruses? Coronaviruses (CoV) are a large family of viruses that cause illness 
ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS-CoV) and 
Severe Acute Respiratory Syndrome (SARS-CoV). A novel coronavirus (nCoV) is a new strain that has not 
been previously identified in humans! 
Coronaviruses are zoonotic; meaning they are transmitted between animals and people.  

Detailed investigations found that SARS-CoV was transmitted from civet cats to humans and MERS-CoV from 
dromedary camels to humans. Several known coronaviruses are circulating in animals that have not yet infected 
humans. 

Common signs of infection include respiratory symptoms, fever, cough, shortness of breath and breathing
difficulties. In more severe cases, infection can cause pneumonia, severe acute respiratory syndrome, kidney 
failure and even death. 
"""

First we'll tokenize the text into words. In other words, we will break-down the text at each whitespace and punctuation signs.

In [15]:
word_tokens=nltk.word_tokenize(text)

In [16]:
word_tokens[:15]

['What',
 'are',
 'Coronaviruses',
 '?',
 'Coronaviruses',
 '(',
 'CoV',
 ')',
 'are',
 'a',
 'large',
 'family',
 'of',
 'viruses',
 'that']

We can also tokenize by sentences. That is, we can break-down text at every punctuation mark that indicates the end of a sentence(e.g a period, question mark, exclamation mark etc.)

In [17]:
sents_tokens=nltk.sent_tokenize(text)

In [18]:
sents_tokens

['What are Coronaviruses?',
 'Coronaviruses (CoV) are a large family of viruses that cause illness \nranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS-CoV) and \nSevere Acute Respiratory Syndrome (SARS-CoV).',
 'A novel coronavirus (nCoV) is a new strain that has not \nbeen previously identified in humans!',
 'Coronaviruses are zoonotic; meaning they are transmitted between animals and people.',
 'Detailed investigations found that SARS-CoV was transmitted from civet cats to humans and MERS-CoV from \ndromedary camels to humans.',
 'Several known coronaviruses are circulating in animals that have not yet infected \nhumans.',
 'Common signs of infection include respiratory symptoms, fever, cough, shortness of breath and breathing\ndifficulties.',
 'In more severe cases, infection can cause pneumonia, severe acute respiratory syndrome, kidney \nfailure and even death.']

If we wanted to tokenize into words each of the sentences then one way to do so is via a list comprehension

In [20]:
tokens=[nltk.word_tokenize(sent) for sent in sents_tokens]

In [22]:
tokens[:3]

[['What', 'are', 'Coronaviruses', '?'],
 ['Coronaviruses',
  '(',
  'CoV',
  ')',
  'are',
  'a',
  'large',
  'family',
  'of',
  'viruses',
  'that',
  'cause',
  'illness',
  'ranging',
  'from',
  'the',
  'common',
  'cold',
  'to',
  'more',
  'severe',
  'diseases',
  'such',
  'as',
  'Middle',
  'East',
  'Respiratory',
  'Syndrome',
  '(',
  'MERS-CoV',
  ')',
  'and',
  'Severe',
  'Acute',
  'Respiratory',
  'Syndrome',
  '(',
  'SARS-CoV',
  ')',
  '.'],
 ['A',
  'novel',
  'coronavirus',
  '(',
  'nCoV',
  ')',
  'is',
  'a',
  'new',
  'strain',
  'that',
  'has',
  'not',
  'been',
  'previously',
  'identified',
  'in',
  'humans',
  '!']]

<h2>Text Normalization</h2>

When performing text analysis often we want only to look at the words and get rid of all the punctuation and other meaningless characters. The process of extracting only the words out of a text is typically knonw as text normalization.

In [28]:
covid_1=tokens[1]

In [29]:
covid_1

['Coronaviruses',
 '(',
 'CoV',
 ')',
 'are',
 'a',
 'large',
 'family',
 'of',
 'viruses',
 'that',
 'cause',
 'illness',
 'ranging',
 'from',
 'the',
 'common',
 'cold',
 'to',
 'more',
 'severe',
 'diseases',
 'such',
 'as',
 'Middle',
 'East',
 'Respiratory',
 'Syndrome',
 '(',
 'MERS-CoV',
 ')',
 'and',
 'Severe',
 'Acute',
 'Respiratory',
 'Syndrome',
 '(',
 'SARS-CoV',
 ')',
 '.']

In [30]:
for word in covid_1:
    if word.isalpha():
        print(word)

Coronaviruses
CoV
are
a
large
family
of
viruses
that
cause
illness
ranging
from
the
common
cold
to
more
severe
diseases
such
as
Middle
East
Respiratory
Syndrome
and
Severe
Acute
Respiratory
Syndrome


Let's normalize the entire `text`. You can either use a `for` loop or use list `comprehension`

In [36]:
text_norm=[[word for word in item if word.isalpha()] for item in tokens]

In [41]:
text_norm[1]

['Coronaviruses',
 'CoV',
 'are',
 'a',
 'large',
 'family',
 'of',
 'viruses',
 'that',
 'cause',
 'illness',
 'ranging',
 'from',
 'the',
 'common',
 'cold',
 'to',
 'more',
 'severe',
 'diseases',
 'such',
 'as',
 'Middle',
 'East',
 'Respiratory',
 'Syndrome',
 'and',
 'Severe',
 'Acute',
 'Respiratory',
 'Syndrome']

Often times when doing text analysis, such as frequency distribution, we don't want to distinguish between Data, data, DATA, dAta, etc. In other words, we don't want them to count as different tokens but rather as the same token. To avoid things like this, we often may want to convert the entire text to either lower or upper case.

For example, let's turn everything in our text tokens into lower case

In [44]:
tokens_lower=[[word.lower() for word in item if word.isalpha()] for item in tokens]

In [45]:
tokens_lower

[['what', 'are', 'coronaviruses'],
 ['coronaviruses',
  'cov',
  'are',
  'a',
  'large',
  'family',
  'of',
  'viruses',
  'that',
  'cause',
  'illness',
  'ranging',
  'from',
  'the',
  'common',
  'cold',
  'to',
  'more',
  'severe',
  'diseases',
  'such',
  'as',
  'middle',
  'east',
  'respiratory',
  'syndrome',
  'and',
  'severe',
  'acute',
  'respiratory',
  'syndrome'],
 ['a',
  'novel',
  'coronavirus',
  'ncov',
  'is',
  'a',
  'new',
  'strain',
  'that',
  'has',
  'not',
  'been',
  'previously',
  'identified',
  'in',
  'humans'],
 ['coronaviruses',
  'are',
  'zoonotic',
  'meaning',
  'they',
  'are',
  'transmitted',
  'between',
  'animals',
  'and',
  'people'],
 ['detailed',
  'investigations',
  'found',
  'that',
  'was',
  'transmitted',
  'from',
  'civet',
  'cats',
  'to',
  'humans',
  'and',
  'from',
  'dromedary',
  'camels',
  'to',
  'humans'],
 ['several',
  'known',
  'coronaviruses',
  'are',
  'circulating',
  'in',
  'animals',
  'that',


Sometimes we don't want to distinguish between words such as dog and dogs, or woman and women, lie and lying or liar etc. In other words, maybe we only care if the words have the same stem so to speak. In this case, we may try to first use some normalization technique that corrects for the different references to essentially the same word.

This may be accomplished via what's known as `Stemmers`. As we will shortly see though, they are imperfect and not always yield great results, so we have to be cautious when using stemmers. 

Let's begin by creating a list first

In [46]:
list1=['cat','cats','dog','dogs','doggies','woman','women','lie','liar','lying','week','weekly','break','breaking']

In [47]:
porter=nltk.PorterStemmer()

In [51]:
list1_porter=[porter.stem(word) for word in list1]

In [52]:
list1_porter

['cat',
 'cat',
 'dog',
 'dog',
 'doggi',
 'woman',
 'women',
 'lie',
 'liar',
 'lie',
 'week',
 'weekli',
 'break',
 'break']

SO, the Porter Stemmer did fairly well, however it missed some. Let's see if another stemmer can do better:

In [50]:
lan=nltk.LancasterStemmer()

In [53]:
list1_lan=[lan.stem(word) for word in list1]

In [57]:
list1_lan

['cat',
 'cat',
 'dog',
 'dog',
 'doggy',
 'wom',
 'wom',
 'lie',
 'liar',
 'lying',
 'week',
 'week',
 'break',
 'break']

In an attempt to increase the accuracy of the stemmers, we can try to use them sequentially, one after the other:

In [58]:
[lan.stem(word) for word in list1_porter]

['cat',
 'cat',
 'dog',
 'dog',
 'dogg',
 'wom',
 'wom',
 'lie',
 'liar',
 'lie',
 'week',
 'weekl',
 'break',
 'break']

In [59]:
[porter.stem(word) for word in list1_lan]

['cat',
 'cat',
 'dog',
 'dog',
 'doggi',
 'wom',
 'wom',
 'lie',
 'liar',
 'lie',
 'week',
 'week',
 'break',
 'break']

<h2>Tagging Parts of Speech</h2>

Often times when performing text analysis, and especially text summary etc., it may become very important knowing the parts of speech. So, being able to correctly identify the parts of speech and tag them accordingly may be a crucial step in providing a sophisticated solution to a problem concerning a large amount of text. For example maybe we want to extract only nouns from a text, or we want to count how many adjectives per sentence are there etc....tagging the text with the appropriate speech tag allowes us to do all of these and much more.

In [129]:
text2=text_norm[3]

In [130]:
text2

['Coronaviruses',
 'are',
 'zoonotic',
 'meaning',
 'they',
 'are',
 'transmitted',
 'between',
 'animals',
 'and',
 'people']

In [131]:
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\valmir.bucaj\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\valmir.bucaj\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [132]:
text2_tag=nltk.pos_tag(text2)

In [133]:
text2_tag

[('Coronaviruses', 'NNS'),
 ('are', 'VBP'),
 ('zoonotic', 'JJ'),
 ('meaning', 'NN'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('transmitted', 'VBN'),
 ('between', 'IN'),
 ('animals', 'NNS'),
 ('and', 'CC'),
 ('people', 'NNS')]

To understand what each of these abbreviations mean we can do so as follows:

In [80]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

<font color='red' size=4>Exercise</font>

Find the most common nouns in the text above.

In [153]:
print(text)

What are Coronaviruses? Coronaviruses (CoV) are a large family of viruses that cause illness 
ranging from the common cold to more severe diseases such as Middle East Respiratory Syndrome (MERS-CoV) and 
Severe Acute Respiratory Syndrome (SARS-CoV). A novel coronavirus (nCoV) is a new strain that has not 
been previously identified in humans! 
Coronaviruses are zoonotic; meaning they are transmitted between animals and people.  

Detailed investigations found that SARS-CoV was transmitted from civet cats to humans and MERS-CoV from 
dromedary camels to humans. Several known coronaviruses are circulating in animals that have not yet infected 
humans. 

Common signs of infection include respiratory symptoms, fever, cough, shortness of breath and breathing
difficulties. In more severe cases, infection can cause pneumonia, severe acute respiratory syndrome, kidney 
failure and even death. 



In [154]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\valmir.bucaj\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\universal_tagset.zip.


True

In [159]:
text_tag=[word for word in nltk.pos_tag(nltk.word_tokenize(text), tagset='universal') if word[0].isalpha() and
         word[1]=='NOUN']

In [160]:
text_tag

[('Coronaviruses', 'NOUN'),
 ('Coronaviruses', 'NOUN'),
 ('CoV', 'NOUN'),
 ('family', 'NOUN'),
 ('viruses', 'NOUN'),
 ('cold', 'NOUN'),
 ('diseases', 'NOUN'),
 ('Middle', 'NOUN'),
 ('East', 'NOUN'),
 ('Respiratory', 'NOUN'),
 ('Syndrome', 'NOUN'),
 ('Severe', 'NOUN'),
 ('Acute', 'NOUN'),
 ('Respiratory', 'NOUN'),
 ('Syndrome', 'NOUN'),
 ('coronavirus', 'NOUN'),
 ('strain', 'NOUN'),
 ('humans', 'NOUN'),
 ('Coronaviruses', 'NOUN'),
 ('animals', 'NOUN'),
 ('people', 'NOUN'),
 ('investigations', 'NOUN'),
 ('civet', 'NOUN'),
 ('cats', 'NOUN'),
 ('humans', 'NOUN'),
 ('camels', 'NOUN'),
 ('humans', 'NOUN'),
 ('coronaviruses', 'NOUN'),
 ('animals', 'NOUN'),
 ('humans', 'NOUN'),
 ('Common', 'NOUN'),
 ('signs', 'NOUN'),
 ('infection', 'NOUN'),
 ('respiratory', 'NOUN'),
 ('symptoms', 'NOUN'),
 ('cough', 'NOUN'),
 ('shortness', 'NOUN'),
 ('breath', 'NOUN'),
 ('breathing', 'NOUN'),
 ('difficulties', 'NOUN'),
 ('cases', 'NOUN'),
 ('infection', 'NOUN'),
 ('pneumonia', 'NOUN'),
 ('respiratory', 'NOUN'

In [161]:
fd=nltk.FreqDist([word[0] for word in text_tag])

In [162]:
fd.most_common(3)

[('humans', 4), ('Coronaviruses', 3), ('Respiratory', 2)]

In [134]:
cfd=nltk.ConditionalFreqDist((len(word[0]),word[0]) for word in text_tag)

In [151]:
[cfd[i].most_common(1) for i in range(3,8)]

[[('CoV', 1)],
 [('cold', 1)],
 [('Acute', 1)],
 [('humans', 4)],
 [('animals', 2)]]

In [149]:
cfd[5].most_common()

[('Acute', 1),
 ('civet', 1),
 ('signs', 1),
 ('cough', 1),
 ('cases', 1),
 ('death', 1)]

<font color='red' size=5>Exercise 1</font>

As we know, often the same word may be used as a different part of speech. 

  Find all the parts of speech used for the words <b>well, like</b> and <b>out</b>, of any case (upper or lower).

<font size=5 color='red'>Exercise 2</font>

In the text Alice:

<ul>
    <li>Find all the cases where there was a choice between two nouns. For example, <b> water</b> or <b>food</b></li>
    <li> Find all the cases where there is a noun followed by the word <b>and</b> and another noun. For example, <b>apple</b> and <b>sword</b></li>
  </ul>
  

<h2>Chunking</h2>

Often times words come in pairs, for example <b> New Orleans, Coffee Shop, Star Wars, Coffee Table, TV Stand </b> etc. so we don't want to tokenize them separately, but rather we would want to keep them together so as not to loose meaning. We can do so via chunking, by specifying the type of word structures we'd like to chunk together. 

In [350]:
text="""We are taking a Data Science course at the Military Academy in the state of New York. We brought five
coffee tables via two separate jet planes from New Orleans.We ran into cbs news station"""

Let's tag and tokenize to see what kind of tags our words of interest have

In [351]:
text_tag=nltk.pos_tag(nltk.word_tokenize(text))

In [352]:
text_tag

[('We', 'PRP'),
 ('are', 'VBP'),
 ('taking', 'VBG'),
 ('a', 'DT'),
 ('Data', 'NNP'),
 ('Science', 'NNP'),
 ('course', 'NN'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('Military', 'NNP'),
 ('Academy', 'NNP'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('state', 'NN'),
 ('of', 'IN'),
 ('New', 'NNP'),
 ('York', 'NNP'),
 ('.', '.'),
 ('We', 'PRP'),
 ('brought', 'VBD'),
 ('five', 'CD'),
 ('coffee', 'NN'),
 ('tables', 'NNS'),
 ('via', 'IN'),
 ('two', 'CD'),
 ('separate', 'JJ'),
 ('jet', 'NN'),
 ('planes', 'NNS'),
 ('from', 'IN'),
 ('New', 'NNP'),
 ('Orleans.We', 'NNP'),
 ('ran', 'VBD'),
 ('into', 'IN'),
 ('cbs', 'JJ'),
 ('news', 'NN'),
 ('station', 'NN')]

What we are looking for is a noun followed by a noun without anything in between separating them.

In [364]:
sequence="""
    chunk:
    {<NN>+}
    {<NNP>+}
    {<NNPS>+}
"""

In [365]:
chunk=nltk.RegexpParser(sequence)

In [368]:
results=chunk.parse(text_tag)

In [370]:
print(results)

(S
  We/PRP
  are/VBP
  taking/VBG
  a/DT
  (chunk Data/NNP Science/NNP)
  (chunk course/NN)
  at/IN
  the/DT
  (chunk Military/NNP Academy/NNP)
  in/IN
  the/DT
  (chunk state/NN)
  of/IN
  (chunk New/NNP York/NNP)
  ./.
  We/PRP
  brought/VBD
  five/CD
  (chunk coffee/NN)
  tables/NNS
  via/IN
  two/CD
  separate/JJ
  (chunk jet/NN)
  planes/NNS
  from/IN
  (chunk New/NNP Orleans.We/NNP)
  ran/VBD
  into/IN
  cbs/JJ
  (chunk news/NN station/NN))


<h2>Stop Words</h2>

When trying to extract only the meaningful words from a text, we want to get rid of what's known as non-descriptive or stop-words, such as `the, a, an, is ` etc.

In [377]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\valmir.bucaj\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [378]:
stop_words=nltk.corpus.stopwords.words('english')

In [379]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

<font color='red' size=5>Exercise</font>

Extract the top 10 most common words in the Alice book, that are descriptive to the text itself. 