# Contents:

- Natural Language Toolkit (NLTK)
- Text Preprocessing
    - Noise Removal
        - Language stopwords 
        - URLs or links
        - Social media entities 
        - Punctuations 
    - Lexicon Normalization
        - Stemming
        - Lemmatization
- Part of Speech Tagging
- Phrase Detection
    - chunking
    - chinking 
- Some useful functions 

# Natural Language Toolkit (NLTK) module with Python

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. It is an open source library in python. NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about

#### Advantages of NLTK
- Has support for most NLP tasks
- Provide acsses to numerous text corpora


Let to explain some terms:
- Tokenization – process of converting a text into tokens
- Tokens – words or sentences or entities present in the text
- Lexicon – Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons.
- Corpus – Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.

# Install NLTK

In [2]:
!pip install nltk



In [3]:
import nltk  
#nltk.download()

# Text Preprocessing

Since, text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is comprised of three steps:

- <b>Noise Removal</b>
    - Language stopwords (commonly used words of a language – is, am, the, of, in etc) 
    - URLs or links
    - Social media entities (mentions, hashtags)
    - Punctuations 
    - Industry specific words 
<BR> <BR>    
- <b>Lexicon Normalization</b>
     - Stemming
     - Lemmatization

let's show an example of how one might actually tokenize something into tokens with the NLTK module

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize

#example_sent = "Paragraphs can contain many different kinds of information. A paragraph could contain a series of brief examples or a single long illustration of a general point. It might describe a place, character, or process; narrate a series of events; compare or contrast two or more things; classify items into categories; or describe causes and effects. Regardless of the kind of information they contain, all paragraphs share certain characteristics. One of the most important of these is a topic sentence."
example_sent ="We are attacking on their left flank but are losing many men. We cannot see the enemy army. Nothing else to report. We are ready to attack but are waiting for your orders."
#example_sent ="Here are some very simple basic sentences. They won't be very interesting, I'm afraid.", The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."

sen_tokens = sent_tokenize(example_sent)
print(sen_tokens) #sentence tokenization #each sentence is a token
word_tokens = word_tokenize(example_sent)
print(word_tokens) #word tokenization #each word is a token

['We are attacking on their left flank but are losing many men.', 'We cannot see the enemy army.', 'Nothing else to report.', 'We are ready to attack but are waiting for your orders.']
['We', 'are', 'attacking', 'on', 'their', 'left', 'flank', 'but', 'are', 'losing', 'many', 'men', '.', 'We', 'can', 'not', 'see', 'the', 'enemy', 'army', '.', 'Nothing', 'else', 'to', 'report', '.', 'We', 'are', 'ready', 'to', 'attack', 'but', 'are', 'waiting', 'for', 'your', 'orders', '.']


In [5]:
raw_docs = ["Here are some very simple basic sentences.",
"They won't be very interesting, I'm afraid.",
"The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."]

tokenized_docs = [word_tokenize(doc) for doc in raw_docs] #using for loop to tokenize corpora(plural of corpus)

# Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

This step deals with removal of all types of noisy entities present in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

In [6]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english')) #assign language that we want to use for stop words

In [7]:
# remove the stopwprds from the word's tocenized text
words_filtered = [w for w in word_tokens if not w in stop_words]
print(words_filtered)
len(words_filtered)

['We', 'attacking', 'left', 'flank', 'losing', 'many', 'men', '.', 'We', 'see', 'enemy', 'army', '.', 'Nothing', 'else', 'report', '.', 'We', 'ready', 'attack', 'waiting', 'orders', '.']


23

In [8]:
# remove all punctuation 
words_filtered2 = [w.lower() for w in words_filtered if w.isalpha()]
print(words_filtered2)
len(words_filtered2)

['we', 'attacking', 'left', 'flank', 'losing', 'many', 'men', 'we', 'see', 'enemy', 'army', 'nothing', 'else', 'report', 'we', 'ready', 'attack', 'waiting', 'orders']


19

In [10]:
# remove URLs or links


### Regular Expresion

Regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file. They are supported by most of the programming languages like python, perl, R, Java and many others.

Regular expressions (<a href="https://docs.python.org/3/library/re.html" target="_blank" rel="noopener nofollow">Regular expressions in Python 3</a>) use two types of characters:

- <font color=green><b>a) Meta characters</b></font>: these characters have a special meaning. Here’s a complete list of them:
<font color=red><b>. ^ $ * + ? { } [ ] | ( ) \ </b></font>
    - <b>Character matches:</b>
    
        <font color=red><b>.</b></font> :       Matches with any single character except newline
        
        <font color=red><b>^</b></font> :       Match the start of the string
        
      <font color=red><b>$</b></font> :       Match the end of the string
      
      <font color=red><b>[ ]</b></font> :    Matches any single character in a square bracket
      
      <b>[a-z]</b> :    Matches one of the range of character a,b,...,z
      
      <b>[^abc]</b> : Matches a character that is not in a,b or c
      
      <b>a<font color=red>|</font>b</b> :     Matches either a or b, where a and b are string
      
      <font color=red><b>( )</b></font> :     Groups regular expressions and returns matched text
      
      <font color=red><b>\</b></font> :       It is used for special meaning characters
    <br><br>
    - <b> Characters symbols</b>
    
        <font color=red><b> \b </b></font>: Matches word boundary
        
        <font color=red><b> \d </b></font>: Any digit, equivalent to [0-9]
        
        <font color=red><b> \D </b></font>: Any non-digit, equivalent to [^ 0-9]
        
        <font color=red><b> \s </b></font>: Any whitespace, equivalent to [\t\n\r\f\v]
        
        <font color=red><b> \S </b></font>: Any non-whitespace, equivalent to [^ \t\n\r\f\v]
        
        <font color=red><b> \w </b></font>: Alphanumeric character, equivalent to [a-zA-z0-9_ ]
        
        <font color=red><b> \W </b></font>: Non-alphanumeric character, equivalent to [^ a-zA-z0-9_ ]
<br><br>
    - <b> Repetitions:</b>

        <font color=red><b>*</b></font> :       0 or more occurrences
        
        <font color=red><b>+</b></font> :       1 or more occurrences 
          
        <font color=red><b>?</b></font> :       0 or 1 occurrence
        
        <font color=red><b>{</font><b>n<font color=red>}</b></font> :       Exactly n repetitions, n>=0
        
        <font color=red><b>{</font><b>n, <font color=red>}</b></font> :       At least n repetitions
        
        <font color=red><b>{</font><b> ,n<font color=red>}</b></font> :       At most n repetitions
    

- <font color=green><b>b) Literals </b></font>(like a,b,1,2…)

In Python, we have module “re” that helps with regular expressions. So you need to import library re before you can use regular expressions in Python.

In [10]:
import re #module for regular expresion

The most common uses of regular expressions are:
- <font color=blue>Search a string </font>(search and match)
- <font color=blue>Finding a string </font>(findall)
- <font color=blue>Break string into a sub strings</font> (split)
- <font color=blue>Replace part of a string </font>(sub)

The most commonly used methods which The ‘re’ package provides to perform queries on an input string:

- <font color=green><b>re.match(pattern, string)</b></font>: 
This method finds match if it occurs at start of the string.

- <font color=green><b>re.search(pattern, string)</b></font>: 
It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only.
search() method is able to find a pattern from any position of the string but it only returns the first occurrence of the search pattern.

- <font color=green><b>re.findall (pattern, string)</b></font>: 
This method helps to get a list of all matching patterns. It has no constraints of searching from start or end.

- <font color=green><b>re.split(pattern, string, maxsplit=0)</b></font>: 
This methods helps to split string by the occurrences of given pattern.

- <font color=green><b>re.sub(pattern, replace, string)</b></font>:
It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [11]:
result = re.match(r'AV', 'AV Analytics Vidhya AV') #find how AV in string
result

<_sre.SRE_Match object; span=(0, 2), match='AV'>

In [13]:
#match() #find string 
value="voorheesville"
m = re.match(r"voo", value)
m

<_sre.SRE_Match object; span=(0, 3), match='voo'>

In [14]:
#match()
value="voorheesville"
m = re.match(r"vo*", value) # * - to check for occurence
m

<_sre.SRE_Match object; span=(0, 3), match='voo'>

In [15]:
#search() # can find string in any position in a sentence
result = re.search(r'Analytics', 'AV Analytics Vidhya AV')
result

<_sre.SRE_Match object; span=(3, 12), match='Analytics'>

In [16]:
#findall()
result = re.findall(r'AV', 'AV Analytics Vidhya AV')
result

['AV', 'AV']

In [17]:
#findall()
result = re.findall(r'(An.*)', 'AV Analytics Vidhya AV')
result

['Analytics Vidhya AV']

## Exercise:

- Find all words starting which start with 'd' or 'p' in the text bellow.

   text="The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"

In [29]:
# your code here
text="The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"
result = re.findall(r'\w*[dp]\w+', text)
result


['part',
 'address',
 'intermediate',
 'senders',
 'intermediate',
 'multiple',
 'addresses',
 'administrator',
 'address',
 'distribution']

In [30]:

#split()
result=re.split(r'y','Analytics')
result

['Anal', 'tics']

In [31]:
#split()
result=re.split(r'y','AV Analytics Vidhya AV')
result

['AV Anal', 'tics Vidh', 'a AV']

In [32]:
#split()
result=re.split(r'y','AV Analytics Vidhya AV',maxsplit=1) #maximum split for string
result

['AV Anal', 'tics Vidhya AV']

In [40]:
#split()
# Separate on one or more non-digit characters in following text.

value = "one 1 two 2 three 3"

result=re.split(r'\D+', value ) #maximum split for string
result

['', '1', '2', '3']

In [41]:
#split()
# finding all words which include 'E' in bellow tweet 

tweet2 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'
result = re.findall(r'\w*[E]\w+', tweet2)
result

['Ethics', 'Ethical']

In [43]:

text1=re.split('', tweet2 ) #maximum split for string

text2 = w for w in text1 if re.findall(r'\w*[E]\w+',w)]


SyntaxError: invalid syntax (<ipython-input-43-aa95744fc6b3>, line 4)

In [35]:
#sub()
result=re.sub(r'India','the World','AV is largest Analytics community of India')
result

'AV is largest Analytics community of the World'

#### Two useful methods

- <font color=green><b>join()</b></font>: The method returns a string in which the string elements of sequence have been joined by str separator.

- <font color=green><b>strip()</b></font>: The method returns a copy of the string in which all chars have been stripped from the beginning and the end of the string (default whitespace characters).

In [25]:
# join() 
tweet2 = 'AV is largest Analytics community of India'
text1 = tweet2.split(' ')
print(text1)
text3= ' '.join(text1)
text3

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


'AV is largest Analytics community of India'

In [26]:
# strip()  
text8 ='   A quick brown  fox  jumped over the lazy dog.  \n '
text9=text8.strip() 

print(text9)

print(re.split(' ',text9))    

A quick brown  fox  jumped over the lazy dog.
['A', 'quick', 'brown', '', 'fox', '', 'jumped', 'over', 'the', 'lazy', 'dog.']


In [27]:
# remove URLs or links

tweet = 'This is a tweet with a url:http://t.co/0DlGChTBIx and there is no any url'
tweet_clean = re.sub(r"http\S+", "", tweet)
tweet_clean

'This is a tweet with a url: and there is no any url'

In [28]:
# what do this code?
list = ["123", "4cat", "dog5", "6mouse", "mouse"]
for w in list:

    m = re.match("^\d", w)
    if m:
        print("START:", w)

    m = re.match(".*\d$", w)
    if m:
        print("  END:", w) 
    print("\n")

START: 123
  END: 123


START: 4cat


  END: dog5


START: 6mouse






## Exercise:

Write a code to extract:
- 1) hashtags from following tweet.

    tweet1 = "@nltk Text analysis is awesome! #regex #pandas #python"
<br><br>    
    
- 2) callouts from given tweet.

    tweet2 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations"
<br><br>      
- 3) the words which start with 'vi' without using ^ from the text bellow.

    text = ' visit123 "Ethics are21 view right into21 the via ideals and objectives of the United Nations" \ #UNSG @21 NY23 Society for134  a14 Ethical43 view23 vital'
 <br><br>     
- 4) all words exept those including 'dog' from given text.

    text = '100cat 223cat 534dog 400cat 500car 345dog 847bar'
<br><br>
- 5) Email address from following text.

    text='John.Smith@example.com Ethics are built right into the ideals and objectives of the United Nations" \ #UNSG @21 NY Society for Ethical Culture local-part@domain.org'


In [44]:
# your code here - Solution 1
tweet1 = "@nltk Text analysis is awesome! #regex #pandas #python"
result1 = re.findall(r'\w*[#]\w+', tweet1)
result1

['#regex', '#pandas', '#python']

In [46]:
# your code here - Solution 2
tweet2 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" 
result2 = re.findall(r'\w*[@]\w+', tweet2)
result2

SyntaxError: invalid syntax (<ipython-input-46-68936e837396>, line 2)

In [31]:
# your code here - Solution 3

In [32]:
# your code here - Solution 4

In [33]:
# your code here - Solution 5

# Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma).

The most common lexicon normalization practices are :

- <b> Stemming</b>:
    - a primary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word
    
    
- <b>Lemmatization</b>: 
    - an organized & step by step procedure of obtaining the root form of the word 
    - it makes use of: 
        - vocabulary (dictionary importance of words) 
        - morphological analysis (word structure and grammar relations)

In [47]:
#Stemming
from nltk.stem import PorterStemmer
#from nltk.tokenize import sent_tokenize, word_tokenize

new_text="the little yellow dog barked at the cat. thinking,plays,cats, leaves"
words = word_tokenize(new_text)

ps = PorterStemmer()
for w in words:
    print(ps.stem(w)) # find root of word

the
littl
yellow
dog
bark
at
the
cat
.
think
,
play
,
cat
,
leav


In [48]:
#Stemming
example_words = ["python","pythoner","pythoning","pythoned","pythonly", 'eventuellement', 'pythonic']
for w in example_words:
    print(ps.stem(w))

python
python
python
python
pythonli
eventuel
python


In [49]:
#Lemmatization
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words =("cats","geese","rocks","lives","leaves","worse","running")
for w in words:
    print(lemmatizer.lemmatize(w)) #can find root of word #works with nouns?

cat
goose
rock
life
leaf
worse
running


In [51]:
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("worse", pos="a"))
print(lemmatizer.lemmatize("running"))
print(lemmatizer.lemmatize("running",'v'))

good
best
bad
running
run


# Part of speech tagging

Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence. Recall from your high school grammer that part of speech are these verb classes like nouns, verbs, adjectives, adverbs etc.

Here is a <b>list of pos-tags</b>, what they mean, and some examples:(<font color=blue>Tag</font>, Word class, <font color=green>Example</font>)

<font color=blue>CC</font>	coordinating conjunction
<font color=blue>CD</font>	cardinal digit
<font color=blue>DT</font>	determiner
<font color=blue>EX	</font>existential there (like: <font color=green>"there is"</font> ... think of it like <font color=green>"there exists"</font>)
<font color=blue>FW	</font>foreign word
<font color=blue>IN	</font>preposition/subordinating conjunction
<font color=blue>JJ	</font>adjective	<font color=green>'big'</font>
<font color=blue>JJR	</font>adjective, comparative	<font color=green>'bigger'</font>
<font color=blue>JJS	</font>adjective, superlative	<font color=green>'biggest'</font>
<font color=blue>LS	</font>list marker	<font color=green>1)</font>
<font color=blue>MD	</font>modal	<font color=green>could, will</font>
<font color=blue>NN	</font>noun, singular <font color=green>'desk'</font>
<font color=blue>NNS	</font>noun plural	<font color=green>'desks'</font>
<font color=blue>NNP	</font>proper noun, singular	<font color=green>'Harrison'</font>
<font color=blue>NNPS	</font>proper noun, plural	<font color=green>'Americans'</font>
<font color=blue>PDT	</font>predeterminer	<font color=green>'all the kids'</font>
<font color=blue>POS	</font>possessive ending	parent<font color=green>'s</font>
<font color=blue>PRP	</font>personal pronoun	<font color=green>I, he, she</font>
<font color=blue>PRPS	</font>possessive pronoun	<font color=green>my, his, hers</font>
<font color=blue> RB	</font>adverb	<font color=green>very, silently</font>
<font color=blue> RBR</font>	adverb, comparative	<font color=green>better</font>
<font color=blue>RBS</font>	adverb, superlative	<font color=green>best</font>
<font color=blue>RP	</font>particle<font color=green>	give up</font>
<font color=blue>TO	</font>to	go <font color=green>'to' </font>the store.
<font color=blue>UH	</font>interjection	<font color=green>errrrrrrrm</font>
<font color=blue>VB	</font>verb, base form	<font color=green>take</font>
<font color=blue>VBD	</font>verb, past tense	<font color=green>took</font>
<font color=blue>VBG</font>	verb, gerund/present participle	<font color=green>taking</font>
<font color=blue>VBN	</font>verb, past participle	<font color=green>taken</font>
<font color=blue>VBP	</font>verb, sing. present, non-3d	<font color=green>take</font>
<font color=blue>VBZ	</font>verb, 3rd person sing. present	<font color=green>takes</font>
<font color=blue>WDT	</font>wh-determiner	<font color=green>which</font>
<font color=blue>WP	</font>wh-pronoun	<font color=green>who, what</font>
<font color=blue>WPS</font>possessive wh-pronoun	<font color=green>whose</font>
<font color=blue> WRB	</font>wh-abverb	<font color=green>where, when

In [40]:
#Part of speech tagging

sample_text="the little yellow dog barked at the cat"
words = word_tokenize(sample_text)

for i in words:
    tagged = nltk.pos_tag(words)
print(tagged)


[('the', 'DT'), ('little', 'JJ'), ('yellow', 'JJ'), ('dog', 'NN'), ('barked', 'VBD'), ('at', 'IN'), ('the', 'DT'), ('cat', 'NN')]


#  Phrase Detection

Phrase Detection is finding phrases in general from a corpus. One way for Phrase Detection is used to apply 'chunking' and 'chinking'.

<b>chunking</b>: 
   - Group words into hopefully meaningful chunks. 
   - One of the main goals of chunking is to group into what are known as "noun phrases."
   - These are phrases of one or more words that contain:
        - a noun 
        - maybe some descriptive words
        - maybe a verb 
        - maybe something like an adverb 
        
   - The rules that make up a chunk grammar use <b>tag patterns</b> to describe sequences of tagged words. 
       -  Tag pattern is a sequence of part-of-speech tags delimited using angle brackets

 In order to chunk, we combine  <b>the part of speech tags </b> with <b> regular expressions </b>.

<b>chinking</b>:
   - the process of removing a sequence of tokens from a chunk
   - denote the chink, after the chunk, with }{ 

In [41]:
#chunking
import nltk
sample_text="the little yellow dog barked at the cat"
words = word_tokenize(sample_text)

for i in words:
    tagged = nltk.pos_tag(words)
chunkGram = "Chunk: {<DT>?<JJ>*<NN>}"
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)

(S
  (Chunk the/DT little/JJ yellow/JJ dog/NN)
  barked/VBD
  at/IN
  (Chunk the/DT cat/NN))


In [52]:
#chunking
sample_text="Rapunzel let down her long golden hair"
words = word_tokenize(sample_text)

for i in words:
    tagged = nltk.pos_tag(words)
    
chunkGram = r"""
  NP: {<DT>?<JJ>*<NN>} 
      {<NNP>}"""
#chunk can use more than one rule   
  
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
chunked.draw()

(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  her/PRP$
  (NP long/JJ golden/JJ hair/NN))


In [None]:
#chinking
sample_text="the little yellow dog barked at the cat"
words = word_tokenize(sample_text)

for i in words:
    tagged = nltk.pos_tag(words)
chunkGram = r"""Chunk: {<DT>?<JJ>*<NN>}
                        }<DT>{ """
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
chunked.draw()

In [None]:
#chinking
sample_text="the little yellow dog barked at the cat"
words = word_tokenize(sample_text)

for i in words:
    tagged = nltk.pos_tag(words)
chunkGram = r"""Chunk: {<.*>+}  
                        }<VB.?|IN|DT|TO>+{ """
#every word is accepted into chunk
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
chunked.draw()

# Some useful functions 

### Finding specific words
- long words:words that are most than 3 letters long
- Capitalized words:the words that start by capital letter a-z
- Words that end / start with specific letter

In [None]:
text1 ="We are attacking on their left flank but are losing many men. We cannot see the enemy army. Nothing else to report. We are ready to attack but are waiting for your orders."
text2 = word_tokenize(text1)

In [None]:
#long words
text3 = [w for w in text2 if len(w)>3]
text3

In [None]:
#Capitalized words
[w for w in text2 if w.istitle()]

In [None]:
#Words that end with e
[w for w in text2 if w.endswith('e')]

In [None]:
#Words that start with t
[w for w in text2 if w.startswith('t')]

### Finding unique words: Using set()

In [None]:
#Finding unique words
text3="To be or not to be"
text4=text3.split(' ')
print(len(text4))
print(set(text4))

In [None]:
print(set([w.lower() for w in text4])) #before finding unique word, good idea to set everythin to lower case
len(set([w.lower() for w in text4]))   # because set detects lower & upper caase letters to be unique

### Frequency of words

In [None]:
from nltk.probability import FreqDist # find freq distribution

text="The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"

In [None]:
text="The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"
text_tokens=word_tokenize(text)

In [None]:
text_tokens1=[w.lower() for w in text_tokens]

dist = FreqDist(text_tokens1)
len(dist)
#dist

In [None]:
dist['for'] #how many times for appear in text

# Exercise

- Find words have length at least 3 and occur at least 3 times in previous text.

In [None]:
# your code here
findwords = [w for w in vocab if len(w)>2 and dist[w]>=3]