# Part of Speech (POS) Tagging

In this notebook, we will learn about part-of-speech tagging using the NLTK library and the brown corpus. Then we will check the text similarity in the corpus. This is not a full fledged project rather implementation of POS tagging concept on a corpus

<a id='import_lib'></a>
## 1. Import Libraries

Let us start by importing the required libraries and downloading the corpus from the nltk library

In [1]:
# Importing the natural languge tool kit
import nltk

# Downloading the brown corpus
nltk.download('brown') 

# Downloading the tagger
nltk.download('averaged_perceptron_tagger') 

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\nilim\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\nilim\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Q1. Consider the following three sentences. Tokenize and identify part of speech for each word in each sentence

In [2]:
# Let's take a sentence and tokenize it using the function word_tokenize()
text = nltk.word_tokenize("I came to Bangaluru to meet some awesome people.") 

In [3]:
# After tokenizing the sentence we will perform POS tagging on the sentence 
nltk.pos_tag(text) 

[('I', 'PRP'),
 ('came', 'VBD'),
 ('to', 'TO'),
 ('Bangaluru', 'NNP'),
 ('to', 'TO'),
 ('meet', 'VB'),
 ('some', 'DT'),
 ('awesome', 'JJ'),
 ('people', 'NNS'),
 ('.', '.')]

In [11]:
# Identifying the part-of-speech of word 'jump' in a sentence
nltk.pos_tag(nltk.word_tokenize("Do you want to jump")) 

[('Do', 'VBP'), ('you', 'PRP'), ('want', 'VB'), ('to', 'TO'), ('jump', 'VB')]

In [12]:
# Making the part-of-speech of word 'jump' as noun in the sentence and verify whether the function can identify correctly or not
nltk.pos_tag(nltk.word_tokenize("That was a nice jump")) 

[('That', 'DT'), ('was', 'VBD'), ('a', 'DT'), ('nice', 'JJ'), ('jump', 'NN')]

## Text Similarity

In [4]:
 # Converting all the words in the brown corpus to lower case
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())

## Q.2 Consider the 'brown' corpus (use the nltk library to download the courpus). Find words similar to a word 'kid' from the corpus

In [5]:
# Let us now try to find a similar word to 'kid'
# We can also specify how many similar words to generate in the similar function by using parameter 'num'
text.similar('kid') 

# The output for this should be nouns like people such as 'man', 'children'
# Here it has also found other nouns like'time', house' and so on

man other time house trial way place doctor world car city state one
past children case family government south situation


<table align='left'>
    <tr>
        <td width='8%'>
            <img src='infer.png'>
        </td>
        <td>
            <div align='left', style='font-size:120%'>
                <font color='#21618C'>
                    <b> In order to find similar words, the trigrams are formed, so here the trigram would be formed with 'kid' as the middle word, then the function tries to find the same trigrams in the corpus with the middle word having the same context as 'kid' and then prints the noun from the context. <br></b>
                </font>
            </div>
        </td>
    </tr>
</table>

In [6]:
# If we want to generate only 5 similar words then we can use
text.similar('kid', num = 5) 

# Here we can see that only 5 similar words are generated for the word 'kid' as compared to the previous output where 20 words were generated

man other time house trial


In [7]:
# Identifying similar words to 'run' which is a 'verb'
text.similar('run') 

get be do in see work go have take make put and find time look day say
use come show


In [8]:
# Identifying similar words which are prepositions
text.similar('on') 
# Add concordance function and demonstrate it

in of to and for at with from by as that is into but when was over
about all through


In [9]:
# Since 'cricket' has 2 meanings- game or insect
text.similar('cricket') 

# The output doesn't have any words which are related to the cricket game or cricket insect
# This might happen because cricket doesn't occur much in the brown corpus so a better substitute could not be found

and available second wonderful glass moment window formidable wicked


## Q.3 Consider the 'gutenberg' corpus (use the nltk library to download the courpus). Print all those sentences in which the word 'kid' occurs

In [25]:
# Let us now look at the concordance function in nltk
import nltk

# Download the gutenberg corpora from the dialog-box 
nltk.download()

# Using the concordance function
# The concordance function will show us all the sentences in the corpus in which the word 'kid' occurs
# The function has 3 parameters:
# word- target word for which we have to find the concordance, width- width of each line(in characters), and lines- number of lines to display (default=25)
text.concordance('kid', width = 40, lines = 30)

# In the output we can see 30 sentences in which the word 'kid' occurs

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Displaying 30 of 61 matches:
rly days when the kid wanted to pop his
ote . `` see that kid '' ? ? he said , 
eal low , and the kid walks over to me 
. `` when i was a kid '' , maris told a
 cockpit . `` the kid had a automatic ,
get the minutes . kid ory , the trombon
probably has more kid appeal than any o
n brown soap on a kid bandage overnight
t of the world `` kid '' . lou gehrig w
y , badly dressed kid whose arms were t
hing ( taking the kid downtown is too m
base . a new york kid , a refugee from 
ated the pedersen kid too , dying in ou
bout the pedersen kid . he'd not care a
ky to a slit of a kid and maybe lose on
 and the pedersen kid was in the kitche
g of the pedersen kid mother-naked in a
ound the pedersen kid by the crib '' . 
it's the pedersen kid . the kid '' . ``
edersen kid . the kid '' . `` nothing t
pa . the pedersen kid . the kid '' -- `
edersen kid . the kid '' -- `` i goddam
s had run 