<img src="../images/cads-logo.png" style="height: 100px;" align=left>  <img src="../images/NLP.jpeg" style="height: 200px;" align=right, width="300">
# Natural Language Processing

# Contents:

- Natural Language Toolkit (NLTK) 
- Install NLTK
- Text Preprocessing
    - Tokenization
    - Noise Removal
        - Language Stop Words 
        - Filter Out Punctuation 
        - URLs or links
            - Regular Expression
            - The most common uses of regular expressions
            - Exercise A
            - Two useful methods
            - Exercise B
        - Social media entities 
            - Exercise C
    - Lexicon Normalization
        - Stemming
        - Lemmatization
- Part of Speech Tagging
- Named entity recognition 
- Some useful functions
    - Finding specific words
    - Finding unique words
    - Frequency of words
    - Exercise E

# Natural Language Toolkit (NLTK) module with Python

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. It is an open source library in python. NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about

#### Advantages of NLTK
- Has support for most NLP tasks
- Provide acsses to numerous text corpora


Let to explain some terms:
- Tokens – words or sentences or entities present in the text
- Tokenization – process of converting a text into tokens
- Lexicon – Words and their meanings. Example: English dictionary. Consider, however, that various fields will have different lexicons.
- Corpus – Body of text, singular. Corpora is the plural of this. Example: A collection of medical journals.

# Install NLTK

In [1]:
# in anaconda prompt
#  conda install -c anaconda nltk 


# or 
!pip install nltk



In [6]:
import nltk  
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [7]:
nltk.__file__

'D:\\Users\\Faiz\\Anaconda3\\lib\\site-packages\\nltk\\__init__.py'

# Text Preprocessing

Since, text is the most unstructured form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is comprised of two steps:

- <b>Noise Removal</b>
    - Language stopwords (commonly used words of a language – is, am, the, of, in etc) 
    - Punctuations
    - URLs or links
    - Social media entities (mentions, hashtags)
<BR> <BR>    
- <b>Lexicon Normalization</b>
     - Stemming
     - Lemmatization

## Tokenization
Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens.

let's show an example of how one might actually tokenize something into tokens with the NLTK module.

In [8]:
from nltk.tokenize import sent_tokenize, word_tokenize

#example_sent = "Paragraphs can contain many different kinds of information. A paragraph could contain a series of brief examples or a single long illustration of a general point. It might describe a place, character, or process; narrate a series of events; compare or contrast two or more things; classify items into categories; or describe causes and effects. Regardless of the kind of information they contain, all paragraphs share certain characteristics. One of the most important of these is a topic sentence."
example_sent ="We are attacking on their left flank but are losing many men. we cannot see the enemy army. Nothing else to report. We are ready to attack but are waiting for your orders."
#example_sent ="Here are some very simple basic sentences. They won't be very interesting, I'm afraid.", The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."

sen_tokens = sent_tokenize(example_sent)
print(sen_tokens)
word_tokens = word_tokenize(example_sent)
print(word_tokens)
len(word_tokens)

['We are attacking on their left flank but are losing many men.', 'we cannot see the enemy army.', 'Nothing else to report.', 'We are ready to attack but are waiting for your orders.']
['We', 'are', 'attacking', 'on', 'their', 'left', 'flank', 'but', 'are', 'losing', 'many', 'men', '.', 'we', 'can', 'not', 'see', 'the', 'enemy', 'army', '.', 'Nothing', 'else', 'to', 'report', '.', 'We', 'are', 'ready', 'to', 'attack', 'but', 'are', 'waiting', 'for', 'your', 'orders', '.']


38

In [9]:
raw_docs = ["Here are some very simple basic sentences.",
"They won't be very interesting, I'm afraid.",
"The point of these examples is to _learn how basic text cleaning works_ on *very simple* data."]

tokenized_docs = [word_tokenize(doc) for doc in raw_docs]
print(tokenized_docs)

[['Here', 'are', 'some', 'very', 'simple', 'basic', 'sentences', '.'], ['They', 'wo', "n't", 'be', 'very', 'interesting', ',', 'I', "'m", 'afraid', '.'], ['The', 'point', 'of', 'these', 'examples', 'is', 'to', '_learn', 'how', 'basic', 'text', 'cleaning', 'works_', 'on', '*very', 'simple*', 'data', '.']]


## Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

This step deals with removal of all types of noisy entities present in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

### Language Stop Words
Stop words usually refers to the most common words in a language. For instance, the English language contains stop words like ‘a’, ‘an’, ‘are’, ‘as’, ‘at’, ‘be’, ‘by’, ‘for’, ‘from’, ‘has’, ‘he’, ‘is’, ‘in’, ‘it’, ‘its’, ‘of’, ‘on’, ‘that’, ‘the’, ‘to’, ‘was’, ‘were’, ‘will’, ‘with’, etc. 

These words do not carry important meaning and are usually removed from texts. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

Now, we will learn how to remove stop words with the nltk module.

In [None]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
stop_words

In [None]:
text = 'NLTK is a leading platform for building Python programs to work with human language data.'
text_tokens = word_tokenize(text)
new_words = [w for w in text_tokens if not w in stop_words]
new_words

In [None]:
# remove the stopwprds from the word's tocenized text
words_filtered = [w for w in word_tokens if not w in stop_words]
print(words_filtered)

### Filter Out Punctuation
We can filter out all words that we are not interested in, such as all punctuation. There’s punctuation like commas, apostrophes, quotes, question marks, and more. `string.punctuation` provides a great list of punctuation characters.

import string
print(string.punctuation)

Python has the function `isalpha()` that can be used. The isalpha() method returns True if all the characters are alphabet letters (a-z).


In [None]:
# remove all punctuation 
text = 'This &is [an] example? {of} string. with?. punctuation!!!!'

text_tokens = word_tokenize(text)
print(text_tokens)
[w for w in text_tokens if w.isalpha()] 
# The method isalpha() checks whether the string consists of alphabetic characters only.

In [None]:
# remove all punctuation 
words_filtered2 = [w.lower() for w in words_filtered if w.isalpha()] # we use w.lower() to convert text to lowercase
print(words_filtered2)
len(words_filtered2)

### URLs or links
In preprocessing process we just need the meaningful words in each text. So, we prefer to remove URLs and links from text. For removing URLs and links, we need to use regular expression.

### Regular Expression

Regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file. They are supported by most of the programming languages like python, perl, R, Java and many others.

Regular expressions (<a href="https://docs.python.org/3/library/re.html" target="_blank" rel="noopener nofollow">Regular expressions in Python 3</a>) use two types of characters:

- <font color=green><b>a) Meta characters</b></font>: these characters have a special meaning. Here’s a complete list of them:
<font color=red><b>. ^ $ * + ? { } [ ] | ( ) \ </b></font>
    - <b>Character matches:</b>
    
        <font color=red><b>.</b></font> :       Matches with any single character except newline
        
        <font color=red><b>^</b></font> :       Match the start of the string
        
       <font color=red><b>$</b></font> :       Match the end of the string
      
       <font color=red><b>[ ]</b></font> :    Matches any single character in a square bracket
      
      <b>[a-z]</b> :    Matches one of the range of character a,b,...,z
      
      <b>[^abc]</b> : Matches a character that is not in a,b or c
      
      <b>a<font color=red>|</font>b</b> :     Matches either a or b, where a and b are string
      
      <font color=red><b>( )</b></font> :     Groups regular expressions and returns matched text
      
      <font color=red><b>\\</b></font> :       It is used for special meaning characters
    <br><br>
    - <b> Characters symbols</b>
    
        <font color=red><b> \b </b></font>: Matches word boundary
        
        <font color=red><b> \d </b></font>: Any digit, equivalent to [0-9]
        
        <font color=red><b> \D </b></font>: Any non-digit, equivalent to [^ 0-9]
        
        <font color=red><b> \s </b></font>: Any whitespace, equivalent to [\t\n\r\f\v]
        
        <font color=red><b> \S </b></font>: Any non-whitespace, equivalent to [^ \t\n\r\f\v]
        
        <font color=red><b> \w </b></font>: Alphanumeric character, equivalent to [a-zA-z0-9_ ]
        
        <font color=red><b> \W </b></font>: Non-alphanumeric character, equivalent to [^ a-zA-z0-9_ ]
<br><br>
    - <b> Repetitions:</b>

        <font color=red><b>*</b></font> :       0 or more occurrences
        
        <font color=red><b>+</b></font> :       1 or more occurrences 
          
        <font color=red><b>?</b></font> :       0 or 1 occurrence
        
        <font color=red><b>{</font><b>n<font color=red>}</b></font> :       Exactly n repetitions, n>=0
        
        <font color=red><b>{</font><b>n, <font color=red>}</b></font> :       At least n repetitions
        
        <font color=red><b>{</font><b> ,n<font color=red>}</b></font> :       At most n repetitions
    

- <font color=green><b>b) Literals </b></font>(like a,b,1,2…)

In Python, we have module “re” that helps with regular expressions. So you need to import library re before you can use regular expressions in Python.

In [None]:
import re

### The most common uses of regular expressions are:
- <font color=blue>Search a string </font>(search and match)
- <font color=blue>Finding a string </font>(findall)
- <font color=blue>Break string into a sub strings</font> (split)
- <font color=blue>Replace part of a string </font>(sub)

The most commonly used methods which The ‘re’ package provides to perform queries on an input string:

- <font color=green><b>re.match(pattern, string)</b></font>: 
This method finds match if it occurs at start of the string.

- <font color=green><b>re.search(pattern, string)</b></font>: 
It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only.
search() method is able to find a pattern from any position of the string but it only returns the first occurrence of the search pattern.

- <font color=green><b>re.findall (pattern, string)</b></font>: 
This method helps to get a list of all matching patterns. It has no constraints of searching from start or end.

- <font color=green><b>re.split(pattern, string, maxsplit=0)</b></font>: 
This methods helps to split string by the occurrences of given pattern.

- <font color=green><b>re.sub(pattern, replace, string)</b></font>:
It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [None]:
#match()
result = re.match(r'AV', 'AV Analytics Vidhya AV')
result

In [None]:
#match()
value="vrheesville"
m = re.match(r"vo?", value)
m

In [None]:
#match()
value="vrheesville"
m = re.match(r"vo+", value)
m

In [None]:
#search()
result = re.search(r'AV', 'AV Analytics Vidhya AV')
result

In [None]:
#findall()
result = re.findall(r'AV', 'AV Analytics Vidhya AV')
result

In [None]:
#findall()
result = re.findall(r'(An.*)', 'AV Analytics Vidhya AV')
result

## <font color=green> Exercise A</font>

- a) Find all words which start with 'd' or 'p' in the following text.
- b) Find all words which contain 'd' or 'p' letter in the following text.

   text = "The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"

In [None]:
# a)
text="The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"
# Your code here

In [None]:
# b)
text="The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"
# Your code here

In [None]:
#split()
result=re.split(r'i','Analytics')
result

In [None]:
#split()
result=re.split(r'a','AV Analytics Vidhya AV')
result

In [None]:
#split()
result=re.split(r'a','AV Analytics Vidhya AV',maxsplit=1)
result

In [None]:
#split()
# Separate on one or more non-digit characters in following text.

value = "one 1 two 2 three 3"
result = re.split('...', value)
result

In [None]:
#split()
# finding all words which include 'E' in bellow tweet 

tweet2 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \
#UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'

result = re.findall(r'...', tweet2)
result

In [None]:
text1=re.split(' ',tweet2)
text2=[w for w in text1 if re.findall(r'\w*[E]\w*', w)]
text2

In [None]:
text1 = word_tokenize(tweet2)
text2=[w for w in text1 if re.findall(r'\w*[E]\w*', w)]
text2

In [None]:
#sub()
result=re.sub(r'India','the World','AV is largest Analytics community of India')
result

#### Two useful methods

- <font color=green><b>join()</b></font>: The method returns a string in which the string elements of sequence have been joined by str separator.

- <font color=green><b>strip()</b></font>: The method returns a copy of the string in which all characters have been stripped from the beginning and the end of the string (default whitespace characters). 

In [None]:
# join() 
tweet2 = 'AV is largest Analytics community of India'
text1 = tweet2.split(' ')
print(text1)

In [None]:
text2 = ' '.join(text1)
text2

In [None]:
# strip()  
text8 ='    quick brown  fox  jumped over the lazy dog.  \n '
text9 = text8.strip(' ') 
print(text9)    

In [None]:
# remove URLs or links

tweet = 'This is a tweet with a url:http://t.co/0DlGChTBIx and there is no any url'
tweet_clean = re.sub(r"http\S+", "", tweet)
tweet_clean

In [None]:
# what do this code?
list = ["123", "4cat", "dog5", "6mouse", "mouse"]

for w in list:
    m = re.match("^\d", w)
    if m:
        print("START:", w)

    m = re.match(".*\d$", w)
    if m:
        print("END:", w) 
    print("\n")

## <font color=green> Exercise B</font>
Write a piece of code to extract:      
1) the words which start with 'vi' without using ^ from the text bellow.

    text = ' visit123 "Ethics are21 view right into21 the via ideals and objectives of the United Nations" \ #UNSG @21 NY23 Society for134  a14 Ethical43 view23 vital'
     
2) all words exept those including 'dog' from given text.

       text = '100cat 223cat 534dog 400cat 500car 345dog 847bar'

3) all words exept numbers from the text bellow.

       text = 'Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls.'

4) Email address from following text.

    text='John.Smith@example.com Ethics are built right into the ideals and objectives of the United Nations" \ #UNSG @21 NY Society for Ethical Culture local-part@domain.org'


In [None]:
# 1)
# Your code here 

In [None]:
# 2)
# Your code here 

In [None]:
# 3)
# Your code here 

In [None]:
# 4)
# Your code here 

### Social media entities
Over the years, there has been a significant increase in the amount of short text available, as social networks like LinkedIn, Facebook, Twitter and the like, have risen in popularity. Such data stores often contain information that is relevant and potentially valuable. In analysing social medis posts, we will face with a lot of noise in the data, particularly URLs, the hashtags, twitter Ids, and @user. We also use regular expression to remove this type of noises from the text.


### <font color=green> Exercise C</font>

Write a piece of code to extract:
- 1) hashtags from following tweet.

    - a) tweet1 = "@nltk Text analysis is awesome! #regex #pandas #python"
    - b) tweet2 = ["Here are some very #simple basic #sentences.",
"They won't be #very interesting, I'm #afraid.",
"The #point of these #examples is to _learn how basic text #cleaning works_ on *very simple* data."]

   
    
- 2) callouts from given tweet.

    tweet2 = '@UN @UN_Women "Ethics are built right into the ideals and objectives of the United Nations" \ #UNSG @ NY Society for Ethical Culture bit.ly/2guVelr'


In [None]:
# 1a)
# Your code here 

In [None]:
# 1b)
# Your code here 

In [None]:
# 2)
# Your code here 

# Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma).

The most common lexicon normalization practices are :

- <b> Stemming</b>:<br>
Stemming is a process of reducing words to their word stem, base or root form (For example “fishing”, “fished” and “fisher” all reduce to the stem “fish”). The main algorithm is Porter stemming algorithm which removes common morphological and inflexional endings(“ing”, “ly”, “es”, “s” etc) from words.
    
    
- <b>Lemmatization</b>: <br>
The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. **As opposed to stemming, lemmatization does not simply chop off inflections**. Instead it uses lexical knowledge bases to get the correct base forms of words.
    - it makes use of: 
        - vocabulary (dictionary importance of words) 
        - morphological analysis (word structure and grammar relations)

In [None]:
#Stemming
from nltk.stem import PorterStemmer

new_text="the little yellow dog barked at the cat. thinking, plays, cats, leaves"
# new_text = "There are several types of stemming algorithms."

words = word_tokenize(new_text)

ps = PorterStemmer()
for w in words:
    print(ps.stem(w))

In [None]:
#Stemming
example_words = ["python","pythoner","pythoning","pythoned","pythonly", 'eventuellement', 'darfed']
for w in example_words:
    print(ps.stem(w))

In [None]:
#Lemmatization
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words =("cats","geese","rocks","lives","leaves","worse","runing")
# words = ["been", "had", "done", "languages", "cities", "mice"]

for w in words:
    print(lemmatizer.lemmatize(w))

In [None]:
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("worse", pos="a"))
print(lemmatizer.lemmatize("runing"))
print(lemmatizer.lemmatize("runing",'v'))

# Part of speech tagging

Apart from the grammar relations, every word in a sentence is also associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function of a word in the sentence. Recall from your high school grammer that part of speech are these verb classes like nouns, verbs, adjectives, adverbs etc.

Here is a <b>list of pos-tags</b>, what they mean, and some examples:(<font color=blue>Tag</font>, Word class, <font color=green>Example</font>)

<font color=blue>CC</font>	coordinating conjunction <br>
<font color=blue>CD</font>	cardinal digit<br>
<font color=blue>DT</font>	determiner<br>
<font color=blue>EX	</font>existential there (like: <font color=green>"there is"</font> ... think of it like <font color=green>"there exists"</font>)<br>
<font color=blue>FW	</font>foreign word<br>
<font color=blue>IN	</font>preposition/subordinating conjunction<br>
<font color=blue>JJ	</font>adjective	<font color=green>'big'</font><br>
<font color=blue>JJR	</font>adjective, comparative	<font color=green>'bigger'</font><br>
<font color=blue>JJS	</font>adjective, superlative	<font color=green>'biggest'</font><br>
<font color=blue>LS	</font>list marker	<font color=green>1)</font><br>
<font color=blue>MD	</font>modal	<font color=green>could, will</font><br>
<font color=blue>NN	</font>noun, singular <font color=green>'desk'</font><br>
<font color=blue>NNS	</font>noun plural	<font color=green>'desks'</font><br>
<font color=blue>NNP	</font>proper noun, singular	<font color=green>'Harrison'</font><br>
<font color=blue>NNPS	</font>proper noun, plural	<font color=green>'Americans'</font><br>
<font color=blue>PDT	</font>predeterminer	<font color=green>'all the kids'</font><br>
<font color=blue>POS	</font>possessive ending	parent<font color=green>'s</font><br>
<font color=blue>PRP	</font>personal pronoun	<font color=green>I, he, she</font><br>
<font color=blue>PRPS	</font>possessive pronoun	<font color=green>my, his, hers</font><br>
<font color=blue> RB	</font>adverb	<font color=green>very, silently</font><br>
<font color=blue> RBR</font>	adverb, comparative	<font color=green>better</font><br>
<font color=blue>RBS</font>	adverb, superlative	<font color=green>best</font><br>
<font color=blue>RP	</font>particle<font color=green>	give up</font><br>
<font color=blue>TO	</font>to	go <font color=green>'to' </font>the store.<br>
<font color=blue>UH	</font>interjection	<font color=green>errrrrrrrm</font><br>
<font color=blue>VB	</font>verb, base form	<font color=green>take</font><br>
<font color=blue>VBD	</font>verb, past tense	<font color=green>took</font><br>
<font color=blue>VBG</font>	verb, gerund/present participle	<font color=green>taking</font><br>
<font color=blue>VBN	</font>verb, past participle	<font color=green>taken</font><br>
<font color=blue>VBP	</font>verb, sing. present, non-3d	<font color=green>take</font><br>
<font color=blue>VBZ	</font>verb, 3rd person sing. present	<font color=green>takes</font><br>
<font color=blue>WDT	</font>wh-determiner	<font color=green>which</font><br>
<font color=blue>WP	</font>wh-pronoun	<font color=green>who, what</font><br>
<font color=blue>WPS</font>possessive wh-pronoun	<font color=green>whose</font><br>
<font color=blue> WRB	</font>wh-abverb	<font color=green>where, when

In [None]:
#Part of speech tagging
import nltk
sample_text = "the little yellow dog barked at the cat"
# sample_text = "Parts of speech examples: an article, to write, interesting, easily, and, of"

words = word_tokenize(sample_text)

tagged = nltk.pos_tag(words)
print(tagged)

# Named entity recognition
Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).

- The named entities such as:
     - Person names - <font color=green> Eddy Bonte, President Obama</font>
     - Location names - <font color=green>Murray River, Mount Everest</font>
     - Company names -<font color=green> Washington Monument, Stonehenge</font>
     - Date - <font color=green>June, 2008-06-29</font>
     - Time - <font color=green>two fifty a m, 1:30 p.m.</font>
     - Percent - <font color=green>twenty pct, 18.75 %</font>
     - Money - <font color=green>175 million Canadian Dollars, GBP 10.40</font>
     - GPE (Geographical Entity) - <font color=green>South East Asia, Midlothian</font>
     - Organization -<font color=green> Georgia-Pacific Corp., WHO</font>

In [None]:
from nltk import word_tokenize, pos_tag, ne_chunk

text = "Bill works for Apple so he went to Boston for a conference."

pos_tag_words = pos_tag(word_tokenize(text))
print(ne_chunk(pos_tag_words))

In [None]:
sample_text="Sara likes to visit different parts of malaysia. She visits Cameron hilands, Frozen hills, Port dickson and Melaka until now."
# sample_text=" I'm going to Germany this Monday."
# sample_text="Mark and John are working at Google."
# sample_text="WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
# sample_text=" When, after the 2010 election, Wilkie, Rob, Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply."
words = word_tokenize(sample_text)

tagged = nltk.pos_tag(words)
namedEnt=nltk.ne_chunk(tagged)
print(namedEnt)

In [None]:
namedEnt.draw()

# Some useful functions 

### Finding specific words
- long words:words that are more than 3 letters long
- Capitalized words:the words that start with a capital letter [a-z]
- Words that end / start with a specific letter

In [None]:
text1 = "We are attacking on their left flank but are losing many men. We cannot see the enemy army. Nothing else to report. We are ready to attack but are waiting for your orders."
text2 = word_tokenize(text1)

In [None]:
# Long words
text3 = [w for w in text2 if len(w)>3]
text3

In [None]:
# Capitalized words
[w for w in text2 if w.istitle()]

In [None]:
# Words that end with e
[w for w in text2 if w.endswith('e')]

In [None]:
# Words that start with t
[w for w in text2 if w.startswith('t')]

### Finding unique words: Using set()

In [None]:
#Finding unique words
text3 = "To be or not to be"
text4 = text3.split(' ')
print(len(text4))
print(set(text4))

In [None]:
print(set([w.lower() for w in text4]))
len(set([w.lower() for w in text4]))

### Frequency of words

In [None]:
from nltk.probability import FreqDist

text="The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"

In [None]:
text ="The local part of an email address has no significance for intermediate mail relay systems other than the final mailbox host. Email senders and intermediate relay systems must not assume it to be case-insensitive, since the final mailbox host may or may not treat it as such. A single mailbox may receive mail for multiple email addresses, if configured by the administrator. Conversely, a single email address may be the alias to a distribution list to many mailboxes"
text_tokens = word_tokenize(text)


In [None]:
text_tokens1 = [w.lower() for w in text_tokens]

dist = FreqDist(text_tokens1)
len(dist)
dist

In [None]:
vocab = dist.keys()
vocab

In [None]:
dist['the']

### <font color=green> Exercise E</font>

- Find words have length at least 3 and occur at least 3 times in previous text.

In [None]:
# your code is here