# Natural Language Processing - freeCodeCamp
Natural language processing exists at the overlap of computer science, artifical intelligence and human language. It aims to solve important issues such as 
* making appointments
* buying things
* spell checking
* generating responses 
* social media monitoring

With current applications including 

* keyword search / information extraction 
* advertisement matching 
* sentimental analysis
* speech recognition 
* chatbots
* machine translation (eg. google translate)

## Components of NLP
* natural language **understanding** 
    * mapping input to useful representation
    * analysing different aspects of language 
* natural language **generation**
    * text and sentence planning 
    * text realisation -  mapping sentence plan into the sentence structure 

Natural language **understanding** is much harder because there are a lot of ambiguities:
* lexical
    * eg. 'she is looking for a *match*' 
* syntactic
    * eg. 'the chicken is ready to eat' 
* referential 
    * eg. 'the bold told his father something. *He* was very upset.' 

In [6]:
# python -m pip install nltk
import os
import nltk
import nltk.corpus

In [7]:
#nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [8]:
from nltk.corpus import brown
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [9]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [10]:
hamlet = nltk.corpus.gutenberg.words('shakespeare-hamlet.txt')
hamlet

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', ...]

In [11]:
for word in hamlet[:500]:
    print(word, sep=' ', end=' ')

[ The Tragedie of Hamlet by William Shakespeare 1599 ] Actus Primus . Scoena Prima . Enter Barnardo and Francisco two Centinels . Barnardo . Who ' s there ? Fran . Nay answer me : Stand & vnfold your selfe Bar . Long liue the King Fran . Barnardo ? Bar . He Fran . You come most carefully vpon your houre Bar . ' Tis now strook twelue , get thee to bed Francisco Fran . For this releefe much thankes : ' Tis bitter cold , And I am sicke at heart Barn . Haue you had quiet Guard ? Fran . Not a Mouse stirring Barn . Well , goodnight . If you do meet Horatio and Marcellus , the Riuals of my Watch , bid them make hast . Enter Horatio and Marcellus . Fran . I thinke I heare them . Stand : who ' s there ? Hor . Friends to this ground Mar . And Leige - men to the Dane Fran . Giue you good night Mar . O farwel honest Soldier , who hath relieu ' d you ? Fra . Barnardo ha ' s my place : giue you goodnight . Exit Fran . Mar . Holla Barnardo Bar . Say , what is Horatio there ? Hor . A peece of him Bar 

## Tokenisation

In [12]:
nlp = 'Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment. NLP drives computer programs that translate text from one language to another, respond to spoken commands, and summarize large volumes of text rapidly—even in real time. There’s a good chance you’ve interacted with NLP in the form of voice-operated GPS systems, digital assistants, speech-to-text dictation software, customer service chatbots, and other consumer conveniences. But NLP also plays a growing role in enterprise solutions that help streamline business operations, increase employee productivity, and simplify mission-critical business processes.'

The word tokenizer chops the string up into tokens, including punctuation.

In [13]:
from nltk.tokenize import word_tokenize 

In [14]:
nlp_tokens = word_tokenize(nlp)
nlp_tokens[:5]

['Natural', 'language', 'processing', '(', 'NLP']

In [15]:
len(nlp_tokens)

191

The frequency counter will count how many times the same word/token has come up, and using lower() for case insensetive.

In [16]:
from nltk.probability import FreqDist
fdist = FreqDist()

In [17]:
for word in nlp_tokens:
    fdist[word.lower()] += 1 
fdist

FreqDist({',': 13, 'the': 7, 'and': 7, 'to': 6, 'of': 6, '.': 6, 'nlp': 5, 'in': 5, 'text': 4, '’': 4, ...})

In [18]:
fdist['nlp']

5

In [19]:
len(fdist)

119

In [20]:
fdist_top10 = fdist.most_common(10)
fdist_top10

[(',', 13),
 ('the', 7),
 ('and', 7),
 ('to', 6),
 ('of', 6),
 ('.', 6),
 ('nlp', 5),
 ('in', 5),
 ('text', 4),
 ('’', 4)]

The blankline tokenize tells us how many sentences are seperated by a blank line, so gives the number of paragraphs. 

In [21]:
from nltk.tokenize import blankline_tokenize
nlp_blank = blankline_tokenize(nlp)
len(nlp_blank)

1

In [22]:
nlp_blank[0]

'Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment. NLP drives computer programs that translate text from one language to another, respond to spoken commands, and summarize large volumes of text rapidly—even in real time. There’s a good chance you’ve interacted with NLP in the form of voice-operated GPS systems, digital assistants, speech-to-text dictation software, customer service chatbots, and other consumer conveniences. But NLP also plays

### N-grams
Tokens of n consecutive written words.

In [23]:
from nltk.util import bigrams, trigrams, ngrams

In [24]:
string = 'The best and most beautiful thingd in the world cannot be seen or even touched, they must be felt with the heart.'
quote_tokens = nltk.word_tokenize(string)
quote_tokens

['The',
 'best',
 'and',
 'most',
 'beautiful',
 'thingd',
 'in',
 'the',
 'world',
 'can',
 'not',
 'be',
 'seen',
 'or',
 'even',
 'touched',
 ',',
 'they',
 'must',
 'be',
 'felt',
 'with',
 'the',
 'heart',
 '.']

In [25]:
qoute_bigrams = list(nltk.bigrams(quote_tokens))
qoute_bigrams

[('The', 'best'),
 ('best', 'and'),
 ('and', 'most'),
 ('most', 'beautiful'),
 ('beautiful', 'thingd'),
 ('thingd', 'in'),
 ('in', 'the'),
 ('the', 'world'),
 ('world', 'can'),
 ('can', 'not'),
 ('not', 'be'),
 ('be', 'seen'),
 ('seen', 'or'),
 ('or', 'even'),
 ('even', 'touched'),
 ('touched', ','),
 (',', 'they'),
 ('they', 'must'),
 ('must', 'be'),
 ('be', 'felt'),
 ('felt', 'with'),
 ('with', 'the'),
 ('the', 'heart'),
 ('heart', '.')]

In [26]:
qoute_ngrams = list(nltk.ngrams(quote_tokens, 5))
qoute_ngrams

[('The', 'best', 'and', 'most', 'beautiful'),
 ('best', 'and', 'most', 'beautiful', 'thingd'),
 ('and', 'most', 'beautiful', 'thingd', 'in'),
 ('most', 'beautiful', 'thingd', 'in', 'the'),
 ('beautiful', 'thingd', 'in', 'the', 'world'),
 ('thingd', 'in', 'the', 'world', 'can'),
 ('in', 'the', 'world', 'can', 'not'),
 ('the', 'world', 'can', 'not', 'be'),
 ('world', 'can', 'not', 'be', 'seen'),
 ('can', 'not', 'be', 'seen', 'or'),
 ('not', 'be', 'seen', 'or', 'even'),
 ('be', 'seen', 'or', 'even', 'touched'),
 ('seen', 'or', 'even', 'touched', ','),
 ('or', 'even', 'touched', ',', 'they'),
 ('even', 'touched', ',', 'they', 'must'),
 ('touched', ',', 'they', 'must', 'be'),
 (',', 'they', 'must', 'be', 'felt'),
 ('they', 'must', 'be', 'felt', 'with'),
 ('must', 'be', 'felt', 'with', 'the'),
 ('be', 'felt', 'with', 'the', 'heart'),
 ('felt', 'with', 'the', 'heart', '.')]

## Stemming
Normalise words into its base or root form. 

In [27]:
from nltk.stem import PorterStemmer
pst = PorterStemmer()

In [28]:
pst.stem('having')

'have'

In [29]:
words_to_stem = ['give', 'giving', 'given', 'gave']
for word in words_to_stem:
    print(word, ':', pst.stem(word))

give : give
giving : give
given : given
gave : gave


In [30]:
from nltk.stem import LancasterStemmer
lst = LancasterStemmer()
for word in words_to_stem:
    print(word, ':', lst.stem(word))

give : giv
giving : giv
given : giv
gave : gav


In [31]:
from nltk.stem import SnowballStemmer
sbst = SnowballStemmer('english')

In [32]:
for word in words_to_stem:
    print(word, ':', sbst.stem(word))

give : give
giving : give
given : given
gave : gave


Sometimes stemming does not always work perfectly, so lemmetization can be used. Rather than just cutting off the end and the beginning, lemmetization takes into account the morphological analysis of the word and outputs a proper word, rather than 'giv'. Therefore, it must have access to a detailed dictionary. 

In [33]:
from nltk.stem import wordnet
from nltk.stem import WordNetLemmatizer
word_lem = WordNetLemmatizer()

In [34]:
word_lem.lemmatize('corpora')

'corpus'

In [35]:
for word in words_to_stem:
    print(word, ':', word_lem.lemmatize(word))

give : give
giving : giving
given : given
gave : gave


The lemmatizer has not changed each word as we have not given it some POS (parts of speech) to consider, so has assumed all the words as null.

## Stopwords
Stop words are thought of as useful in the context of a sentence but provide no help in NLP. 

In [36]:
from nltk.corpus import stopwords

In [5]:
print(stopwords.words('english')[:5])
print(len(stopwords.words('english')))

['i', 'me', 'my', 'myself', 'we']
179


In [38]:
import re
punctuation = re.compile(r'[-.?!,:;()[0-9]')

In [39]:
post_punctuation = []
for words in nlp_tokens:
    word = punctuation.sub('', words)
    if len(word)>0:
        post_punctuation.append(word)

In [41]:
post_punctuation[:10]

['Natural',
 'language',
 'processing',
 'NLP',
 'refers',
 'to',
 'the',
 'branch',
 'of',
 'computer']

## Parts of Speech (POS)
The gramatical type of the word (verb/noun/adjective) indicates how the word functions in meaning and gramatically in the sentence. Eg. 'Google something' - here *google* is a verb and a noun. POS tags are used to categories words. 

In [42]:
sent = 'Timothy is a natural when it comes to drawing'
sent_tokens = word_tokenize(sent)

In [43]:
for token in sent_tokens:
    print(nltk.pos_tag([token]))

[('Timothy', 'NN')]
[('is', 'VBZ')]
[('a', 'DT')]
[('natural', 'JJ')]
[('when', 'WRB')]
[('it', 'PRP')]
[('comes', 'VBZ')]
[('to', 'TO')]
[('drawing', 'VBG')]


In [44]:
sent2 = 'John is eating a delicious cake'
sent2_tokens = word_tokenize(sent2)
for token in sent2_tokens:
    print(nltk.pos_tag([token]))

[('John', 'NNP')]
[('is', 'VBZ')]
[('eating', 'VBG')]
[('a', 'DT')]
[('delicious', 'JJ')]
[('cake', 'NN')]


**Named Entity Recognition (NER)**
* movie
* monetary value
* organisation
* location
* quantities
* person

'*Google's* CEO *Sundar Pichai* introduced the new Pixel at *Minnesota* *Roi Centre Event*'

In [45]:
from nltk import ne_chunk

In [46]:
ne_sent = 'The US President stays in the White House'

In [47]:
ne_tokens = word_tokenize(ne_sent)
ne_tags = nltk.pos_tag(ne_tokens)

In [48]:
ne_ner = ne_chunk(ne_tags)
print(ne_ner)

(S
  The/DT
  (ORGANIZATION US/NNP)
  President/NNP
  stays/VBZ
  in/IN
  the/DT
  (FACILITY White/NNP House/NNP))


## Syntax
Syntax is the set of rules and principles that govern sentence structure. We can consider this structure as a syntax tree. Syntax trees can be rendered via *Ghostscript*: https://www.ghostscript.com/ . 

## Chunking
Picking up individual pieces of information and grouping them into bigger pieces. This is the opposite of tokenization. 

Eg. '**We** *caught*   <u>the pink panther<u>' can be split into 3 chunks.

In [49]:
new = 'The big cat ate the little mouse who was after fresh cheese'
new_tokens = nltk.pos_tag(word_tokenize(new))
new_tokens

[('The', 'DT'),
 ('big', 'JJ'),
 ('cat', 'NN'),
 ('ate', 'VBD'),
 ('the', 'DT'),
 ('little', 'JJ'),
 ('mouse', 'NN'),
 ('who', 'WP'),
 ('was', 'VBD'),
 ('after', 'IN'),
 ('fresh', 'JJ'),
 ('cheese', 'NN')]

In [50]:
grammar_np = r'NP: {<DT>?<JJ>*<NN>}'

In [51]:
chunk_parser = nltk.RegexpParser(grammar_np)

In [53]:
chunk_results = chunk_parser.parse(new_tokens)
chunk_results

The Ghostscript executable isn't found.
See http://web.mit.edu/ghostscript/www/Install.htm
If you're using a Mac, you can try installing
https://docs.brew.sh/Installation then `brew install ghostscript`


LookupError: 

Tree('S', [Tree('NP', [('The', 'DT'), ('big', 'JJ'), ('cat', 'NN')]), ('ate', 'VBD'), Tree('NP', [('the', 'DT'), ('little', 'JJ'), ('mouse', 'NN')]), ('who', 'WP'), ('was', 'VBD'), ('after', 'IN'), Tree('NP', [('fresh', 'JJ'), ('cheese', 'NN')])])