<a href="https://colab.research.google.com/github/khamoh/NaturalLangProcessing/blob/master/TextProcessing_using_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLTK - widely used for text processing 


1. Tokenisation - converting string to collection of words or sentences 
2. Morphological analysis - convert a word to its root form 
3. Part of speech tagging 
4. Named entity recognition 
5. Spelling correction 

In [34]:
import nltk
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("tagsets")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

Tokenization 

In [0]:
data = """  My name is XYZ, I live in Solihull. My home is in near shopping center. sometime I
go out for shopping, you can contact my at test@test.com and ask for Mr.XYZ
""" 

In [14]:
data.split('.')

['  My name is XYZ, I live in Solihull',
 ' My home is in near shopping center',
 ' sometime I\ngo out for shopping, you can contact my at test@test',
 'com and ask for Mr',
 'XYZ\n']

Sentence Tokenization 

In [23]:
nltk.sent_tokenize(data)

['  My name is XYZ, I live in Solihull.',
 'My home is in near shopping center.',
 'sometime I\ngo out for shopping, you can contact my at test@test.com and ask for Mr.XYZ']

In [20]:
x = nltk.sent_tokenize(data)
print(x[2])

sometime I
go out for shopping, you can contact my at test@test.com and ask for Mr.XYZ


Word Tokenization 

In [16]:
nltk.word_tokenize(data)

['My',
 'name',
 'is',
 'XYZ',
 ',',
 'I',
 'live',
 'in',
 'Solihull',
 '.',
 'My',
 'home',
 'is',
 'in',
 'near',
 'shopping',
 'center',
 '.',
 'sometime',
 'I',
 'go',
 'out',
 'for',
 'shopping',
 ',',
 'you',
 'can',
 'contact',
 'my',
 'at',
 'test',
 '@',
 'test.com',
 'and',
 'ask',
 'for',
 'Mr.XYZ']

#Morphological analysis
- Converting a word to a root form  
- children to child 
- wives to wife 
- knives to knife 


Two methods: 
*   Stemming  - Faster,  less accurate, works on spelling level 
*   Lemmatization  - slower, more accurate - works on meaning level 



**Stemming **

In [24]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("cars")

'car'

In [25]:
ps = PorterStemmer()
ps.stem("boxes")

'box'

In [27]:
ps = PorterStemmer()
ps.stem("knives")

'knive'

In [28]:
ps = PorterStemmer()
ps.stem("children")

'children'

In [29]:
from nltk.stem import WordNetLemmatizer
wd  = WordNetLemmatizer()
wd.lemmatize("cars")

'car'

In [30]:
wd  = WordNetLemmatizer()
wd.lemmatize("children")

'child'

In [31]:
wd  = WordNetLemmatizer()
wd.lemmatize("wives")

'wife'

**Part of Speech Taggin POS Tagging**

In [35]:
nltk.pos_tag(nltk.word_tokenize("There was an eagle in the sky and it was looking for food"))

[('There', 'EX'),
 ('was', 'VBD'),
 ('an', 'DT'),
 ('eagle', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('sky', 'NN'),
 ('and', 'CC'),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('looking', 'VBG'),
 ('for', 'IN'),
 ('food', 'NN')]

In [38]:
nltk.help.upenn_tagset("VBD")


VBD: verb, past tense
    dipped pleaded swiped regummed soaked tidied convened halted registered
    cushioned exacted snubbed strode aimed adopted belied figgered
    speculated wore appreciated contemplated ...


In [39]:
nltk.help.upenn_tagset("NN")

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


## **#NER - Named Entity Research  **

In [0]:
import spacy 

In [42]:
nlp = spacy.load("en") 
doc = nlp("The big grey dog ate all of the chocolate, but fortunately he wasn't sick!")
doc.text.split() 

['The',
 'big',
 'grey',
 'dog',
 'ate',
 'all',
 'of',
 'the',
 'chocolate,',
 'but',
 'fortunately',
 'he',
 "wasn't",
 'sick!']

In [0]:
nlp = spacy.load("en_core_web_sm")

In [0]:
data = nlp("Microdoft developped a solution for corona pandemic and we will all work towards finding it in UK on date 01-01-2021, Bill Gates will help us")

In [50]:
from spacy import displacy 
displacy.render(data, style= 'ent', jupyter =True)

# organisation read the data and GDPR data can be hidden 

**Spelling Correction**

In [52]:
#Higher the distance between the words, lowe the similarity 

nltk.jaccard_distance(set("orange"), set('orenge'))

0.16666666666666666

In [53]:

nltk.jaccard_distance(set("orange"), set('random'))

0.5

In [0]:
dictionary = ['Mango','Orange','Icecream']

def correct(word):
  score =1 
  ans = ""
  for w in dictionary: 
    dist = nltk.jaccard_distance(set(w), set(word))
    if dist << score:
      ans = w 
      score = dist
    return ans

In [56]:
correct('Mangi')

TypeError: ignored

In [0]:
      score = dist
