<a href="https://colab.research.google.com/github/jtao/dswebinar/blob/master/nlp/NLP_with_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing with NLTK

[Jian Tao](https://coehpc.engr.tamu.edu/people/jian-tao/), Texas A&M University

Sept 17, 2021

Converted from 

**Intro to natural language processing with Python**

Notebook by [Juan Cruz Martinez](https://livecodestream.dev/authors/bajcmartinez/)

## Setting up the Environment

In [1]:
import nltk

## Tokenization

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
from nltk.tokenize import word_tokenize
Text = "Good morning, How you doing? Are you coming tonight?"
Tokenized = word_tokenize(Text)
print(Tokenized)

['Good', 'morning', ',', 'How', 'you', 'doing', '?', 'Are', 'you', 'coming', 'tonight', '?']


In [4]:
from nltk.tokenize import sent_tokenize
Text = "Good morning, How you doing? Are you coming tonight?"
Tokenized = sent_tokenize(Text)
print(Tokenized)

['Good morning, How you doing?', 'Are you coming tonight?']


## Stop words

In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
from nltk.corpus import stopwords
stopwords = stopwords.words("english")
Text = ["Good", "morning", "How", "you", "doing", "Are", "you", "coming", "tonight"]
for i in Text:
   if i not in stopwords:
       print(i)

Good
morning
How
Are
coming
tonight


In [7]:
from nltk.corpus import stopwords
stopwords = stopwords.words("english")
','.join(stopwords)

"i,me,my,myself,we,our,ours,ourselves,you,you're,you've,you'll,you'd,your,yours,yourself,yourselves,he,him,his,himself,she,she's,her,hers,herself,it,it's,its,itself,they,them,their,theirs,themselves,what,which,who,whom,this,that,that'll,these,those,am,is,are,was,were,be,been,being,have,has,had,having,do,does,did,doing,a,an,the,and,but,if,or,because,as,until,while,of,at,by,for,with,about,against,between,into,through,during,before,after,above,below,to,from,up,down,in,out,on,off,over,under,again,further,then,once,here,there,when,where,why,how,all,any,both,each,few,more,most,other,some,such,no,nor,not,only,own,same,so,than,too,very,s,t,can,will,just,don,don't,should,should've,now,d,ll,m,o,re,ve,y,ain,aren,aren't,couldn,couldn't,didn,didn't,doesn,doesn't,hadn,hadn't,hasn,hasn't,haven,haven't,isn,isn't,ma,mightn,mightn't,mustn,mustn't,needn,needn't,shan,shan't,shouldn,shouldn't,wasn,wasn't,weren,weren't,won,won't,wouldn,wouldn't"

## Stemming Words

In [8]:
help(nltk.stem)

Help on package nltk.stem in nltk:

NAME
    nltk.stem - NLTK Stemmers

DESCRIPTION
    Interfaces used to remove morphological affixes from words, leaving
    only the word stem.  Stemming algorithms aim to remove those affixes
    required for eg. grammatical role, tense, derivational morphology
    leaving only the stem of the word.  This is a difficult problem due to
    irregular words (eg. common verbs in English), complicated
    morphological rules, and part-of-speech and sense ambiguities
    (eg. ``ceil-`` is not the stem of ``ceiling``).
    
    StemmerI defines a standard interface for stemmers.

PACKAGE CONTENTS
    api
    arlstem
    arlstem2
    cistem
    isri
    lancaster
    porter
    regexp
    rslp
    snowball
    util
    wordnet

FILE
    /usr/local/lib/python3.7/dist-packages/nltk/stem/__init__.py




In [9]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["Loving", "Chocolate", "Retrieved"]
for i in words:
   print(ps.stem(i))

love
chocol
retriev


## Counting Words

In [10]:
import nltk
words = ["men", "teacher", "men", "woman"]
FreqDist = nltk.FreqDist(words)
for i,j in FreqDist.items():
   print(i, "---", j)

men --- 2
teacher --- 1
woman --- 1


## Word groups

In [11]:
words = "Learning python was such an amazing experience for me"
word_tokenize = nltk.word_tokenize(words)
print(list(nltk.bigrams(word_tokenize)))

[('Learning', 'python'), ('python', 'was'), ('was', 'such'), ('such', 'an'), ('an', 'amazing'), ('amazing', 'experience'), ('experience', 'for'), ('for', 'me')]


In [12]:
word_tokenize = nltk.word_tokenize(words)
print(list(nltk.trigrams(word_tokenize)))

[('Learning', 'python', 'was'), ('python', 'was', 'such'), ('was', 'such', 'an'), ('such', 'an', 'amazing'), ('an', 'amazing', 'experience'), ('amazing', 'experience', 'for'), ('experience', 'for', 'me')]


In [13]:
word_tokenize = nltk.word_tokenize(words)
print(list(nltk.ngrams(word_tokenize, 4)))


[('Learning', 'python', 'was', 'such'), ('python', 'was', 'such', 'an'), ('was', 'such', 'an', 'amazing'), ('such', 'an', 'amazing', 'experience'), ('an', 'amazing', 'experience', 'for'), ('amazing', 'experience', 'for', 'me')]


## Lemmatization

In [14]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [15]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [16]:
from nltk.stem import WordNetLemmatizer
Lem = WordNetLemmatizer()
print(Lem.lemmatize("believes"))
print(Lem.lemmatize("retrieved"))

belief
retrieved


In [17]:
from nltk.stem import WordNetLemmatizer
Lem = WordNetLemmatizer()
print(Lem.lemmatize("believes", pos="v"))
print(Lem.lemmatize("retrieved", pos="v"))

believe
retrieve


## POS Taggers

In [18]:
nltk.download('averaged_perceptron_tagger')
words = "we work here"
word_tokenize = nltk.word_tokenize(words)
print(nltk.pos_tag(word_tokenize))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('we', 'PRP'), ('work', 'VBP'), ('here', 'RB')]


## Named Entity Recognition

In [19]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [20]:
Text = "tom is in london"
Tokenize = nltk.word_tokenize(Text)
POS_tags = nltk.pos_tag(Tokenize)
NameEn = nltk.ne_chunk(POS_tags)
print(NameEn)

(S tom/NN is/VBZ in/IN london/NN)


In [21]:
!pip3 install textblob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [22]:
from textblob import TextBlob
Joe_Biden_Tweet = "today is sunny"
Joe_Biden = TextBlob(Joe_Biden_Tweet)
print(Joe_Biden.sentiment)

Sentiment(polarity=0.0, subjectivity=0.0)


## Spelling Correction

In [23]:
from textblob import TextBlob
Text = "Smalle businesses neede relief"
spelling_mistakes = TextBlob(Text)
print(spelling_mistakes.correct())

Small business need relief
