# Natural Language Processing - sentdex

## Tokenizing words and sentences
**Tokenizing:**
* word tokenizers
* sentence tokenizers 

**Lexicon and corperas:**
* corperas - body of text eg. medical journals, presidential speeches, english langauge
* lexicon - words and their meanings
    * eg. investor speak vs. regular english speak  

In [1]:
import nltk

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
example_text = 'Hello Mr. Smith, how are you doing today? The weather is great and python is awesome. The sky is pinkish-blue. You should not eat cardboard.'
print(sent_tokenize(example_text))
print(word_tokenize(example_text))

['Hello Mr. Smith, how are you doing today?', 'The weather is great and python is awesome.', 'The sky is pinkish-blue.', 'You should not eat cardboard.']
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'and', 'python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', 'not', 'eat', 'cardboard', '.']


## Stop words

In [4]:
from nltk.corpus import stopwords

In [5]:
example_sentence = 'This is an example showing off stop word filtration.'
stop_words = set(stopwords.words('english'))
#print(stop_words)

In [6]:
words = word_tokenize(example_sentence)
filtered_sentence = []
for w in words:
    if w not in stop_words:
        filtered_sentence.append(w)
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']


A common one liner for this block of code...

In [7]:
filtered_sentence = [w for w in words if not w in stop_words]
print(filtered_sentence)

['This', 'example', 'showing', 'stop', 'word', 'filtration', '.']


## Stemming
Different variations of words that carry the same meaning:
* I was taking a ride in the car.
* I was riding in the car. 

In [8]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

In [9]:
example_words = ['python', 'pythoner', 'pythoning', 'pythoned', 'pythonly']
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli


In [10]:
new_text = 'It is very important to be pythonly while pythoning with python. All pythoners have pythoned poorly at least once.'
words = word_tokenize(new_text)
for w in words:
    print(ps.stem(w), sep=' ', end=' ')

it is veri import to be pythonli while python with python . all python have python poorli at least onc . 

## Parts of Speech (POS) Tagging 
Labelling the part of speech to every word.

List of POS tag list at: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

In [11]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [12]:
train_text = state_union.raw('2005-GWBush.txt')
sample_text = state_union.raw('2006-GWBush.txt')

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [13]:
def process_content():
    try:
        for i in tokenized:
            words = word_tokenize(i)
            tagged = nltk.pos_tag(words)

            print(tagged)

    except Exception as e:
        print(str(e))

In [14]:
#process_content()

## Chunking 
The next step to figuring out the meaning of the sentence is to chunk uses part of speech tags and regular expressions. 

### Regular Expressions

https://www.tutorialspoint.com/python/python_reg_expressions.htm


A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings. 

In [15]:
def process_content():
    try:
        for i in tokenized:
            words = word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r'''Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}'''
            
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)

            print(chunked)

    except Exception as e:
        print(str(e))

### Understanding this regular expression
*r*'''Chunk: {<**RB.?**>*<**VB.?**><**NNP**>}'''

the period (.) denotes any character except for a new line, and question mark (?) denotes 0 or 1 as no part of speech tag is longer than 3 characters, and possesive pronouns can be. 

the asterix (*) means 0 or more so this chunk will chunk any form of adverb.

the plus (+) means one or more NNP. 

and a plausible (?) NN.

In [16]:
#process_content()

## Chinking 
A chink is what we wish to remove from the chunk (you chink from a chunk). 

In [17]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = word_tokenize(i)
            tagged = nltk.pos_tag(words)

            chunkGram = r'''Chunk: {<.*>+} 
                                    }<VB.?|IN|DT|TO>+{'''

            
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)

            print(chunked)

    except Exception as e:
        print(str(e))

In [18]:
#process_content()

## Named Entity Recognition
The NLTK library of named entitiy recognition is not amazing so might need other proceses in place to catch them.

Using binary=*True* means that we don't care what the named entity is.

In [25]:
def process_content():
    try:
        for i in tokenized[5:]:
            words = word_tokenize(i)
            tagged = nltk.pos_tag(words)

            named_ent = nltk.ne_chunk(tagged)
            #named_ent = nltk.ne_chunk(tagged, binary=True)

            print(named_ent)
    
    except Exception as e:
        print(str(e))

In [27]:
#process_content()

## Lemmatizing
This is similar to stemming, but the end result is a real word. 

The default POS for lemmatizing is a noun, so if you have something that is not a noun you have to pass through the POS tag.

In [28]:
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()

In [31]:
print(lem.lemmatize('cats'))
print(lem.lemmatize('cacti'))
print(lem.lemmatize('geese'))
print(lem.lemmatize('rocks'))
print(lem.lemmatize('python'))

cat
cactus
goose
rock
python


In [34]:
print(lem.lemmatize('better'))
print(lem.lemmatize('better', pos='a'))
print(lem.lemmatize('best', pos='a'))

better
good
best
