### NLTK with Python : summary form realpython

In [1]:
# Tokenizing

You can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful.

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
mystr= 'my name is vivek. my village name is amauni'
sent_tokenize(mystr)

['my name is vivek.', 'my village name is amauni']

> The parser has segmented the string at the decimal point

In [4]:
word_tokenize(mystr)
# note that, dot . has been considered as a word

['my', 'name', 'is', 'vivek', '.', 'my', 'village', 'name', 'is', 'amauni']

In [5]:
mystr="it's vivekanand."
word_tokenize(mystr)

['it', "'s", 'vivekanand', '.']

    's  
    ,
    .
    has also been word tokenized, This happened because NLTK knows that 'It' and "'s" (a contraction of “is”) are two 
    distinct words

In [6]:
# stop words

Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [7]:
# nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

mystr= 'my name is vivek. my village name is amauni'
w = word_tokenize(mystr)
sw = set(stopwords.words("english"))

# create a new list to store the filtered words
flist = []
for i in w:
    if i not in sw:
        flist.append(i)
print(flist)

['name', 'vivek', '.', 'village', 'name', 'amauni']


In [8]:
# or
flist = [i for i in w if i.casefold() not in sw]
flist

['name', 'vivek', '.', 'village', 'name', 'amauni']

In [9]:
# create a list of the stemmed versions of the words
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
x = [stemmer.stem(word=i) for i in flist]
x

['name', 'vivek', '.', 'villag', 'name', 'amauni']

In [10]:
a='Before you can stem the words in that string, you need to separate all the words in it'
words = word_tokenize(a)
words

['Before',
 'you',
 'can',
 'stem',
 'the',
 'words',
 'in',
 'that',
 'string',
 ',',
 'you',
 'need',
 'to',
 'separate',
 'all',
 'the',
 'words',
 'in',
 'it']

Understemming and overstemming are two ways stemming can go wrong:

    Understemming:
    happens when two related words should be reduced to the same stem but aren’t. This is a false negative.
    
    Overstemming:
    happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a false positive.
    
The Porter stemming algorithm dates from 1979, so it’s a little on the older side. The Snowball stemmer, which is also called Porter2, is an improvement on the original 

In [11]:
# Tagging Parts of Speech

 POS tagging, is the task of labeling the words in your text according to their part of speech.
 
 In English, there are eight parts of speech:
 
     Noun	        Is a person, place, or thing	mountain, bagel, Poland
     Pronoun	    Replaces a noun	you, she, we
     Adjective	    Tells spcialtiy about what a noun is like	efficient, windy, colorful
     Verb	        Is an action or a state of being	learn, is, go
     Adverb	        Gives information about a verb, an adjective, or another adverb	efficiently, always, very
     Preposition	Gives information about how a noun or pronoun is connected to another word	from, about, at
     Conjunction	Connects two other words or phrases	so, because, and
     Interjection	Is an exclamation	yay, ow, wow

Some sources also include the category articles (like “a” or “the”) in the list of parts of speech, but other sources consider them to be adjectives. NLTK uses the word determiner to refer to articles.

In [12]:
# Now create some text to tag. You can use this Carl Sagan quote:

In [13]:
sagan_quote = """If you wish to make an apple pie from scratch,you must first invent the universe."""

In [14]:
words = word_tokenize(sagan_quote)

In [15]:
import nltk
nltk.pos_tag(words)

[('If', 'IN'),
 ('you', 'PRP'),
 ('wish', 'VBP'),
 ('to', 'TO'),
 ('make', 'VB'),
 ('an', 'DT'),
 ('apple', 'NN'),
 ('pie', 'NN'),
 ('from', 'IN'),
 ('scratch', 'NN'),
 (',', ','),
 ('you', 'PRP'),
 ('must', 'MD'),
 ('first', 'VB'),
 ('invent', 'VB'),
 ('the', 'DT'),
 ('universe', 'NN'),
 ('.', '.')]

All the words in the quote are now in a separate tuple, with a tag that represents their part of speech.

In [16]:
# Here’s how to get a list of tags and their meanings:
# nltk.help.upenn_tagset()

       **Tags** **Meanings**
         JJ	     Adjectives
         NN	     Nouns
         RB	     Adverbs
         PRP	  Pronouns
         VB	     Verbs

In [17]:
# jabberwocky_excerpt

In [18]:
jabberwocky_excerpt = """Twas brillig, and the slithy toves did gyre and gimble in the wabe:
all mimsy were the borogoves, and the mome raths outgrabe."""

In [19]:
words = word_tokenize(jabberwocky_excerpt)

In [20]:
nltk.pos_tag(words)

[('Twas', 'NNP'),
 ('brillig', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('slithy', 'JJ'),
 ('toves', 'NNS'),
 ('did', 'VBD'),
 ('gyre', 'NN'),
 ('and', 'CC'),
 ('gimble', 'JJ'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('wabe', 'NN'),
 (':', ':'),
 ('all', 'DT'),
 ('mimsy', 'NNS'),
 ('were', 'VBD'),
 ('the', 'DT'),
 ('borogoves', 'NNS'),
 (',', ','),
 ('and', 'CC'),
 ('the', 'DT'),
 ('mome', 'JJ'),
 ('raths', 'NNS'),
 ('outgrabe', 'RB'),
 ('.', '.')]

In [21]:
# help(nltk.pos_tag)
from nltk import pos_tag
pos_tag(word_tokenize("My name is Vivekanand"))

[('My', 'PRP$'), ('name', 'NN'), ('is', 'VBZ'), ('Vivekanand', 'NNP')]

### Lemmatizing

Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

A lemma is a word that represents a whole group of words, and that group of words is called a lexeme.

In [22]:
import nltk
# nltk.download('wordnet')

In [23]:
from nltk.stem import WordNetLemmatizer
lemma=WordNetLemmatizer()
# lemma = WordNetLemmatizer()
# lemma.lemmatize("scarves") >> scarf

In [24]:
mystr = "The friends of DeSoto love scarves."
words = word_tokenize(mystr)
words

['The', 'friends', 'of', 'DeSoto', 'love', 'scarves', '.']

In [25]:
# lemmatized= [lemma.lemmatize(i) for i in words]

In [26]:
# lemma.lemmatize("worst")  # >>'worst'

You got the result 'worst' because lemmatizer.lemmatize() assumed that "worst" was a noun. You can make it clear that you want "worst" to be an adjective:

In [27]:
# lemma.lemmatize("worst", pos="a") # The default parameter for pos is 'n' for noun

#### Chunking

While tokenizing allows you to identify words and sentences, chunking allows you to identify phrases.

A phrase is a word or group of words that works as a single unit to perform a grammatical function. Noun phrases are built around a noun.

Here are some examples:

    “A planet”
    “A tilting planet”
    “A swiftly tilting planet”
    
- Chunking makes use of POS tags to group words
- Chunks don’t overlap, so one instance of a word can be in only one chunk at a time.
- Before you can chunk,  create a string for POS tagging

In [28]:
# here is a quote from lord of the rings
x = "It's a dangerous business, Frodo, going out your door."

In [29]:
# 1.tokenize
tokenized = word_tokenize(x)
tokenized

['It',
 "'s",
 'a',
 'dangerous',
 'business',
 ',',
 'Frodo',
 ',',
 'going',
 'out',
 'your',
 'door',
 '.']

In [30]:
# 2. pos tag
# nltk.download("averaged_perceptron_tagger")
pos_tags = nltk.pos_tag(tokenized)
pos_tags

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

 In order to chunk, you first need to define a chunk grammar.
 
 A chunk grammar is a combination of rules on how sentences should be chunked. It often uses regular expressions, or regexes.

In [31]:
# Create a chunk grammar with one regular expression rule:

grammar = "NP: {<DT>?<JJ>*<NN>}"

NP stands for noun phrase. You can learn more about noun phrase chunking 

According to the rule you created, your chunks:

    Start with an optional (?) determiner ('DT')
    Can have any number (*) of adjectives (JJ)
    End with a noun (<NN>)

In [32]:
# Create a chunk parser with this grammar:
chunk_parser = nltk.RegexpParser(grammar)

In [33]:
# Now apply this to yr pos_tags
tree = chunk_parser.parse(pos_tags)

In [35]:
# Here’s how you can see a visual representation of this tree:

# tree.draw()

You got two noun phrases:

    'a dangerous business' has a determiner, an adjective, and a noun.
    'door' has just a noun.

In [None]:
# after chunking it’s time to look at chinking.

Chinking is used together with chunking, but while chunking is used to include a pattern, chinking is used to exclude a pattern

We already have pos_tag

In [36]:
pos_tags

[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

The next step is to create a grammar to determine what you want to include and exclude in your chunks.

In [57]:
>>> grammar = """Chunk: {<.*>+}
}<JJ>{"""

The first rule of your grammar is {<.*>+}, its curly braces facing inward {}, it indicates what patterns to include in your chunks. In this case, you want to include everything: <.*>+.

The second rule of your grammar is }<JJ>{ 
braces facing outward   }{    indicating what patterns to exclude in your chunks( here adjectives: <JJ>) 

In [58]:
# create a parser with this grammer
parser=nltk.RegexpParser(grammar)

In [61]:
import warnings
warnings.filterwarnings('ignore')

In [62]:
tree = parser.parse(pos_tags)
tree

The Ghostscript executable isn't found.
See http://web.mit.edu/ghostscript/www/Install.htm
If you're using a Mac, you can try installing
https://docs.brew.sh/Installation then `brew install ghostscript`


LookupError: 

Tree('S', [Tree('Chunk', [('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT')]), ('dangerous', 'JJ'), Tree('Chunk', [('business', 'NN'), (',', ','), ('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'), ('.', '.')])])

### Using Named Entity Recognition (NER)

Named entities are noun phrases that refer to specific locations, people, organizations, and so on.

In [63]:
# Here’s the list of named entity types from the NLTK book:


    ORGANIZATION     Georgia-Pacific Corp., WHO
    PERSON           Eddy Bonte, President Obama
    LOCATION         Murray River, Mount Everest
    DATE	         June, 2008-06-29
    TIME	         two fifty a m, 1:30 p.m.
    MONEY	         175 million Canadian dollars, GBP 10.40
    PERCENT	         Twenty pct, 18.75 %
    FACILITY	     Washington Monument, Stonehenge
    GPE	             South East Asia, Midlothian
    
You can use nltk.ne_chunk() to recognize named entities,  use the parameter binary=True if you just want to know what the named entities are but not what kind of named entity they are:

In [65]:
nltk.download("maxent_ne_chunker")
nltk.download("words")
tree = nltk.ne_chunk(pos_tags)
# tree.draw()

[nltk_data] Error loading maxent_ne_chunker: <urlopen error [WinError
[nltk_data]     10060] A connection attempt failed because the
[nltk_data]     connected party did not properly respond after a
[nltk_data]     period of time, or established connection failed
[nltk_data]     because connected host has failed to respond>
[nltk_data] Error loading words: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time, or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>


LookupError: 
**********************************************************************
  Resource [93mmaxent_ne_chunker[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('maxent_ne_chunker')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mchunkers/maxent_ne_chunker/english_ace_multiclass.pickle[0m

  Searched in:
    - 'C:\\Users\\iamvi/nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\share\\nltk_data'
    - 'C:\\ProgramData\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\iamvi\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


In [None]:
# tree = nltk.ne_chunk(pos_tags, binary=True)

#### Now create a function to extract named entities:

In [None]:
def extract_ne(quote):
    words = word_tokenize(quote, language=language)
    tags = nltk.pos_tag(words)
    tree = nltk.ne_chunk(tags, binary=True)
    return set(" ".join(i[0] for i in t) for t in tree if hasattr(t, "label") and t.label() == "NE")

With this function, you gather all named entities, with no repeats

#### Getting Text to Analyze

In [67]:
# 

In [68]:
# Using a Concordance

When you use a concordance, you can see each time a word is used, along with its immediate context

In [72]:
text="""to hearing from you all . able young man seeks , sexy older women . phone for\nble relationship . 
genuine attractive man 40 y . o ., no ties , secure , 5 ft .\nship , and quality times . 
vietnamese man single , never married , financially\nip . well dressed emotionally healthy man 37 like to meet 
full figured woman fo\n nth subs like to be mistress of your man like to be treated well . bold dte no\neeks 
lady in similar position married man 50 , attrac . fit , seeks lady 40 - 5\neks nice girl 25 - 30 serious rship . 
man 46 attractive fit , assertive , and k\n 40 - 50 sought by aussie mid 40s b / man f / ship r / ship love to 
meet widowe\ndiscreet times . sth e subs . married man 42yo 6ft , fit , seeks lady for discr\nwoman , seeks professional , 
employed man , with interests in theatre , dining\n tall and of large build seeks a good man . 
i am a nonsmoker , social drinker ,\nlead to relationship . 
seeking honest man i am 41 y . o ., 5 ft . 4 , med . bui\n quiet times . seeks 35 - 45 , honest man with good 
soh & similar interests , f\n genuine , caring , honest and normal man for fship , poss rship . s / s , s /"""

In [73]:
text.concordance("man")

AttributeError: 'str' object has no attribute 'concordance'