**Getting Started with Natural Language Processing.**

Let's get started with a sample project. 

Start with importing the nltk library. To make sure you have all the packages, run nltk.download() and select all in download option. 

In [0]:
!pip install nltk

#@title
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all
    Downloading collection u'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package brown to /root/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package brown_tei to /root/nltk_data...
       |   Unzipping corpora/brown_tei.zip.
       | Downloading package cess_cat to /root/nltk_data...
       |   Unzipping corpora/cess_cat.zip.
       | Downloading packag

True

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> l
Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [ ] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [

True

Let's start with 2 random lines of text. 

**Q1: Start with tokenizing it into sentence using the package.**

In [0]:
#@title
text="My name is Adam. My name is not Adam. "
from nltk.tokenize import word_tokenize, sent_tokenize
sents=sent_tokenize(text)
print(sents)

['My name is Adam.', 'My name is not Adam.']


['My name is Adam.', 'I like to code.']


**Q2: Print the words out of the last tokenized line**

In [0]:
#@title
words=[word_tokenize(sent) for sent in sents]
print(words)

[['My', 'name', 'is', 'Adam', '.'], ['My', 'name', 'is', 'not', 'Adam', '.']]


[['My', 'name', 'is', 'Adam', '.'], ['I', 'like', 'to', 'code', '.']]


**Q3: Now it's time to remove stopwords and punctuation. Start with calling the package first.**

In [0]:
#@title
from nltk.corpus import stopwords 
from string import punctuation
customStopWords=set(stopwords.words('english')+list(punctuation))

**Q4: Now print out the tokenized words without punctuations and stopwords.**

In [0]:
#@title
wordsWOStopwords=[word for word in word_tokenize(text) if word not in customStopWords]
print(wordsWOStopwords)

['My', 'name', 'Adam', 'My', 'name', 'Adam']


['My', 'name', 'Adam', 'I', 'like', 'code']


**Q5: Now it's time to use Bigram Association measures and print the ngram items.**

In [0]:
#@title
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wordsWOStopwords)
sorted(finder.ngram_fd.items())

[(('Adam', 'My'), 1), (('My', 'name'), 2), (('name', 'Adam'), 2)]

In [0]:
My name is Adam. I like to code. 

[(('Adam', 'I'), 1),
 (('I', 'like'), 1),
 (('My', 'name'), 1),
 (('like', 'code'), 1),
 (('name', 'Adam'), 1)]

**Q6: Now take another line of next and print the stemmed words using LancasterStemmer. **

In [0]:
#@title
text2 = "It is close. We are closer. "
from nltk.stem.lancaster import LancasterStemmer
st=LancasterStemmer()
stemmedWords=[st.stem(word) for word in word_tokenize(text2)]
print(stemmedWords)

['it', 'is', 'clos', '.', 'we', 'ar', 'clos', '.']


**Q7: Use pos tag and tokenize the words.**

In [0]:
#@title
nltk.pos_tag(word_tokenize(text2))

[('Adam', 'NNP'),
 ('likes', 'VBZ'),
 ('to', 'TO'),
 ('play', 'VB'),
 ('football', 'NN'),
 ('and', 'CC'),
 ('rugby', 'NN'),
 ('.', '.')]

**Q8: Now import wordnet and print out the sysnset.**

In [0]:
#@title
from nltk.corpus import wordnet as wn
for ss in wn.synsets('bass'):
    print(ss, ss.definition())

(Synset('bass.n.01'), u'the lowest part of the musical range')
(Synset('bass.n.02'), u'the lowest part in polyphonic music')
(Synset('bass.n.03'), u'an adult male singer with the lowest voice')
(Synset('sea_bass.n.01'), u'the lean flesh of a saltwater fish of the family Serranidae')
(Synset('freshwater_bass.n.01'), u'any of various North American freshwater fish with lean flesh (especially of the genus Micropterus)')
(Synset('bass.n.06'), u'the lowest adult male singing voice')
(Synset('bass.n.07'), u'the member with the lowest range of a family of musical instruments')
(Synset('bass.n.08'), u'nontechnical name for any of numerous edible marine and freshwater spiny-finned fishes')
(Synset('bass.s.01'), u'having or denoting a low vocal or instrumental range')


**Q9: It's time to play around with lesk now. Import that and tokenize a new line to find the Synset.**

*Suggested: Sing in a lower tone, along with the bass*

In [0]:
#@title
from nltk.wsd import lesk
sense1 = lesk(word_tokenize("Sing in a lower tone, along with the bass"),'bass')
print(sense1, sense1.definition())

(Synset('bass.n.07'), u'the member with the lowest range of a family of musical instruments')


**Q10: Try this another way.**

In [0]:
#@title
sense2 = lesk(word_tokenize("This sea bass was really hard to catch"),'bass')
print(sense2, sense2.definition())

(Synset('sea_bass.n.01'), u'the lean flesh of a saltwater fish of the family Serranidae')


**I like to code. The code should be clean. What is the code of conduct? **




In [0]:
!pip install nltk

