<a href="https://colab.research.google.com/github/levitannin/Madlib-Workshop/blob/main/Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

Welcome, you made it!  This is a supplimentary notebook for the [Madlib Workshop](https://github.com/levitannin/Madlib-Workshop/blob/main/MadLib_Workshop.ipynb)!  This notebook works just fine as a standalone too.  Feel free to use this as a baseline for future endeavours!

In this notebook we are focused with the basics of Natural Language Processing and the **Natrual Language Toolkit** (NLTK) library.  The best place to learn about this tool is the NLTK Documentation! [Click Me to Learn!](https://www.nltk.org/)

Remember, we are working with Python 3 at this point so avoid looking into Python 2 versions of this library where you can.

Let's get started!

In [None]:
!pip install nltk

In [None]:
import nltk
# You may need this to download all of the nltk dependancies or extras!
nltk.download("book")

In [None]:
"""
Let's start with Tokenizing!

Tokenizing can be broken down into:
  Word Tokenizing == Break down Words
  Sentence Tokenizing == Break down Sentences

Other terms:
  Corpora == a body of text; see corpus
  Lexicon == dictionary; words and their meanings
"""

from nltk.tokenize import sent_tokenize, word_tokenize

example_text = "Hello Mr. Stranger, how are you doing today?  The weather is great and python is awesome.  Let's meet later for tea.  We can discuss why the sky is pinkish blue, and how that tells us not to eat cardboard."

print(sent_tokenize(example_text))
print(word_tokenize(example_text))
# NOTE: word tokenize will treat a punctuation as a 'word' by default.

# Below will print each word outide of the list.

for i in word_tokenize(example_text):
  print(i)

In [None]:
"""
Okay so now we understand tokenization, let's build on that!

Now that we have an example of text, let's figure out what the 'stop words' are.
We are expecting these words to be from the English language for now.
"""
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
print(stop_words)

words = word_tokenize(example_text)
filtered_sentence = [w for w in words if not w in stop_words]

print(filtered_sentence)

In [None]:
"""
We are on a roll!  Now let's dig into stemming and what this means for us.
What is a stem?  Well, think of the endings of words like 'ing', 'ed', etc.
We want the root (or stem, get it?) of a word instead of the extra fluff (leaves?)

To get those stems we need to do a little processing.  Why?
  If two words would be 'the same' but are in different tenses, this can cause
  clutter.  To avoid that, we use stemming.
"""

from nltk.stem import PorterStemmer

ps = PorterStemmer()

for w in words:
  print(ps.stem(w))

In [None]:
"""
Now we are getting into the fun stuff.  If you're following along with the
  Madlib Workshop, this is the part we'll really be using to organize the text.

At this stage we are focused on identifying parts of speech (PoS)
  and on the corpus or works within nltk.
"""

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import gutenberg, state_union
from nltk import pos_tag

#   Speech Tagging -- Identifying the different parts of speech.
#-----------------------------------------------------------------------------
#   Identify what items the imported corpus may have.

# Gutenberg is an excellent resource, highly recommend checking out the website.
print("\nText available from Gutenberg: \n")
print(gutenberg.fileids())
print("\nText available from State of the Union: \n")
print(state_union.fileids())

#   Useful tools we can use on the imported text.
#-----------------------------------------------------------------------------
#   The following will create a table to identify:
#       Average word length
#       Average sentence length
#       Frequency of a vocab word appearing
#   For each text in the Gutenberg corpus
print("Ave Word Len \t Ave Sent Len \t Vocab Occurance \t Title ")
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    
    print(round(num_chars / num_words), "\t\t", round(num_words / num_sents),
          "\t\t", round(num_words / num_vocab), "\t\t", fileid)

# Now let's just focus on one of the text.
paradise_sent = gutenberg.sents("milton-paradise.txt")
print(len(paradise_sent))

#   Print a sentence in the corpus after breaking into sents(ences)
print(paradise_sent[1313])

#   Identify the longest sentence in the chosen corpus.
longest_sent = max(len(s) for s in paradise_sent)
print(s for s in paradise_sent if len (s) == longest_sent)

#   Can find other text sources at: https://www.nltk.org/book/ch02.html
#   Raw gives the content of the file without any linguistic processing.
train_text = state_union.raw("1953-Eisenhower.txt")
sample_text = state_union.raw("1959-Eisenhower.txt")
#   If you want to re-train the Punkt for your purposes
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
token = custom_sent_tokenizer.tokenize(sample_text)

'''
POS tag list:
    CC      coordinating conjunction
    CD      cardinal digit
    DT      determiner
    EX      existential there (like: "there is" ... think of it like "there exists")
    FW      foreign word
    IN      preposition/subordinating conjunction
    JJ      adjective 'big'
    JJR     adjective, comparative 'bigger'
    JJS     adjective, superlative 'biggest'
    LS      list marker 1)
    MD      modal could, will
    NN      noun, singular 'desk'
    NNS     noun plural 'desks'
    NNP     proper noun, singular 'Harrison'
    NNPS    proper noun, plural 'Americans'
    PDT     predeterminer 'all the kids'
    POS     possessive ending parent's
    PRP     personal pronoun I, he, she
    PRP$    possessive pronoun my, his, hers
    RB      adverb very, silently,
    RBR     adverb, comparative better
    RBS     adverb, superlative best
    RP      particle give up
    TO      to go 'to' the store.
    UH      interjection errrrrrrrm
    VB      verb, base form take
    VBD     verb, past tense took
    VBG     verb, gerund/present participle taking
    VBN     verb, past participle taken
    VBP     verb, sing. present, non-3d take
    VBZ     verb, 3rd person sing. present takes
    WDT     wh-determiner which
    WP      wh-pronoun who, what
    WP$     possessive wh-pronoun whose
    WRB     wh-abverb where, when
'''
try:
    for i in token:#    You can specify here if you want to start at a certain level of the chunk.  IE [5:]
        words = word_tokenize(i)
        tag = pos_tag(words)
        print(tag)
        
except Exception as e:
    print(str(e))

In [None]:
"""
Now that we understand PoS, let's go into chunking.
This will let us identify or group words based on parts of speech.
"""

from nltk import ne_chunk, RegexpParser

try:
    for i in token:#    You can specify here if you want to start at a certain level of the chunk.  IE [5:]
        words = word_tokenize(i)
        tag = pos_tag(words)
        
        #   To find all versions of a POS use regular expressions to identify it.
        #   Example here are Adverbs (RB, RBR, RBS)
        #   . == any character other than new line
        #   ? == 0 or 1
        #   * == 0 or MORE
        #   | == or
        chunkGram = r""" Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?} """
        chunkParser = RegexpParser(chunkGram)
        chunked = chunkParser.parse(tag)
        #   Will generate a pop-up box with the chucks drawn out in a chart.
        #     Or would in an IDE rather than Google Colab, which doesn't have
        #       access to your computer's display.
        chunked.draw()
        
except Exception as e:
    print(str(e))

In [None]:
try:
    for i in token[6:]:#    You can specify here if you want to start at a certain level of the chunk.  IE [5:]
        words = word_tokenize(i)
        tag = pos_tag(words)
        chunkGram = r""" Chunk: {<.*>+} 
                                }<VB.?|IN|DT>+{"""
        chunkParser = RegexpParser(chunkGram)
        chunked = chunkParser.parse(tag)
        #   Will generate a pop-up box with the chucks drawn out in a chart.
        #     Or would in an IDE rather than Google Colab, which doesn't have
        #       access to your computer's display.
        chunked.draw()
        
except Exception as e:
    print(str(e))

This is the end!  You did it :)

If you're looking for more resources, check out these other colab notebooks created by others with an NLP focus:


*   https://colab.research.google.com/github/gal-a/blog/blob/master/docs/notebooks/nlp/nltk_preprocess.ipynb
*   https://colab.research.google.com/github/mhuckvale/pals0039/blob/master/Tutorial_NLTK.ipynb#scrollTo=02RtRYj_p0Xb 

