# Basics of NLP with nltk


### List of Topics
- [Tokenization](#1)
- [Stop Word](#2)
- [Frequency Distribution](#3)
- [Lemmitization](#4)

In [1]:
input = '''Guilt is only useful when it causes us to change our behaviour and to make amends. 
Feelings of shame are not just personal. We can also pick up feelings of shame from our nation, 
family, or our even just being in the vicinity of an event.'''

print(input)

Guilt is only useful when it causes us to change our behaviour and to make amends. 
Feelings of shame are not just personal. We can also pick up feelings of shame from our nation, 
family, or our even just being in the vicinity of an event.


In [2]:
%config Completer.use_jedi = False

## Work and Sentence Tokenization<a id='1'></a>

In [3]:
#for reference. Download 'punkt' for tokenize operations
from nltk import tokenize
word = tokenize.word_tokenize(input)
sentence = tokenize.sent_tokenize(input)

print('Word Tokenizer Output:\n{}\n\nSentence Tokenizer Output:\n{}'.format(word,sentence))

Word Tokenizer Output:
['Guilt', 'is', 'only', 'useful', 'when', 'it', 'causes', 'us', 'to', 'change', 'our', 'behaviour', 'and', 'to', 'make', 'amends', '.', 'Feelings', 'of', 'shame', 'are', 'not', 'just', 'personal', '.', 'We', 'can', 'also', 'pick', 'up', 'feelings', 'of', 'shame', 'from', 'our', 'nation', ',', 'family', ',', 'or', 'our', 'even', 'just', 'being', 'in', 'the', 'vicinity', 'of', 'an', 'event', '.']

Sentence Tokenizer Output:
['Guilt is only useful when it causes us to change our behaviour and to make amends.', 'Feelings of shame are not just personal.', 'We can also pick up feelings of shame from our nation, \nfamily, or our even just being in the vicinity of an event.']


## Removing Stop Words <a id='2'></a>

In [4]:
#fetching stopwords
from nltk.corpus import stopwords
stop_words = list(stopwords.words('english')) 
print(stop_words[:30])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself']


In [5]:
#removing punctuations and stop words
filter_sentence = []
for w in word:
    if w in stop_words or len(w)==1:
        pass
    else:
        filter_sentence.append(w)
print(filter_sentence)

['Guilt', 'useful', 'causes', 'us', 'change', 'behaviour', 'make', 'amends', 'Feelings', 'shame', 'personal', 'We', 'also', 'pick', 'feelings', 'shame', 'nation', 'family', 'even', 'vicinity', 'event']


## Frequency Distribution<a id='3'></a>

In [6]:
from nltk.probability import FreqDist
import pandas as df

In [7]:
fdist = dict(FreqDist(filter_sentence))

In [8]:
data = df.DataFrame(fdist.values(),index=fdist.keys(),columns=['Count'])
data

Unnamed: 0,Count
Guilt,1
useful,1
causes,1
us,1
change,1
behaviour,1
make,1
amends,1
Feelings,1
shame,2


## Lemmitization<a id='4'></a>

In [9]:
from nltk.stem import WordNetLemmatizer 
lemmat = WordNetLemmatizer()

In [10]:
for i in filter_sentence:
    print(i,' - ',lemmat.lemmatize(i))

Guilt  -  Guilt
useful  -  useful
causes  -  cause
us  -  u
change  -  change
behaviour  -  behaviour
make  -  make
amends  -  amends
Feelings  -  Feelings
shame  -  shame
personal  -  personal
We  -  We
also  -  also
pick  -  pick
feelings  -  feeling
shame  -  shame
nation  -  nation
family  -  family
even  -  even
vicinity  -  vicinity
event  -  event
