N-GRAMS

 "A contiguous sequence of N items from a given sample of text or speech". Here an item can be a character, a word or a sentence and N can be any integer. When N is 2, we call the sequence a bigram. Similarly, a sequence of 3 items is called a trigram, and so on.

1. https://stackabuse.com/python-for-nlp-developing-an-automatic-text-filler-using-n-grams/

2. 
 https://kavita-ganesan.com/what-are-n-grams/#.XkJHaeHhU5k



3. For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:

    the cow
    cow jumps
    jumps over
    over the
    the moon


In [0]:
import re
from nltk.util import ngrams

s = "Natural-language processing (NLP) is an area of computer science " \
    "and artificial intelligence concerned with the interactions " \
    "between computers and human (natural) languages."

s = s.lower()
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)  #optional   # re(regular expressions): https://www.tutorialspoint.com/python/python_reg_expressions.htm
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens, 5))


In [0]:
print('s',s)
print(s.split(" "))
print('tokens',tokens)

s natural language processing  nlp  is an area of computer science and artificial intelligence concerned with the interactions between computers and human  natural  languages 
['natural', 'language', 'processing', '', 'nlp', '', 'is', 'an', 'area', 'of', 'computer', 'science', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '', 'natural', '', 'languages', '']
tokens ['natural', 'language', 'processing', 'nlp', 'is', 'an', 'area', 'of', 'computer', 'science', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'natural', 'languages']


In [0]:
output

[('natural', 'language', 'processing', 'nlp', 'is'),
 ('language', 'processing', 'nlp', 'is', 'an'),
 ('processing', 'nlp', 'is', 'an', 'area'),
 ('nlp', 'is', 'an', 'area', 'of'),
 ('is', 'an', 'area', 'of', 'computer'),
 ('an', 'area', 'of', 'computer', 'science'),
 ('area', 'of', 'computer', 'science', 'and'),
 ('of', 'computer', 'science', 'and', 'artificial'),
 ('computer', 'science', 'and', 'artificial', 'intelligence'),
 ('science', 'and', 'artificial', 'intelligence', 'concerned'),
 ('and', 'artificial', 'intelligence', 'concerned', 'with'),
 ('artificial', 'intelligence', 'concerned', 'with', 'the'),
 ('intelligence', 'concerned', 'with', 'the', 'interactions'),
 ('concerned', 'with', 'the', 'interactions', 'between'),
 ('with', 'the', 'interactions', 'between', 'computers'),
 ('the', 'interactions', 'between', 'computers', 'and'),
 ('interactions', 'between', 'computers', 'and', 'human'),
 ('between', 'computers', 'and', 'human', 'natural'),
 ('computers', 'and', 'human', 'na

# Task : 
1. Try to implement N-grams on article fetched from web(https://www.whitehouse.gov/briefings-statements/).(try out with spacy)
2. Find out various applications of N-grams. 

# Named Entity Recognition(NER)

is probably the first step towards information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
1. Named Entity: It is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name.

In [0]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

In [0]:
ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

In [0]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    print('Word Tokenize:',sent)
    sent = nltk.pos_tag(sent)
    return sent

In [0]:
import nltk
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
sent = preprocess(ex)
sent

Word Tokenize: ['European', 'authorities', 'fined', 'Google', 'a', 'record', '$', '5.1', 'billion', 'on', 'Wednesday', 'for', 'abusing', 'its', 'power', 'in', 'the', 'mobile', 'phone', 'market', 'and', 'ordered', 'the', 'company', 'to', 'alter', 'its', 'practices']


[('European', 'JJ'),
 ('authorities', 'NNS'),
 ('fined', 'VBD'),
 ('Google', 'NNP'),
 ('a', 'DT'),
 ('record', 'NN'),
 ('$', '$'),
 ('5.1', 'CD'),
 ('billion', 'CD'),
 ('on', 'IN'),
 ('Wednesday', 'NNP'),
 ('for', 'IN'),
 ('abusing', 'VBG'),
 ('its', 'PRP$'),
 ('power', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mobile', 'JJ'),
 ('phone', 'NN'),
 ('market', 'NN'),
 ('and', 'CC'),
 ('ordered', 'VBD'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('to', 'TO'),
 ('alter', 'VB'),
 ('its', 'PRP$'),
 ('practices', 'NNS')]

implement noun phrase chunking to identify named entities using a regular expression consisting of rules that indicate how sentences should be chunked.

#### CHUNKING:
 is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not represent the actual meaning of the text, its advisable to use phrases such as “South Africa” as a single word instead of ‘South’ and ‘Africa’ separate words.
1. chunk: a group of bits of information   
2. https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb



 Chunking works on top of POS tagging, it uses pos-tags as input and provides chunks as output. Similar to POS tags, there are a standard set of Chunk tags like Noun Phrase(NP), Verb Phrase (VP), etc. Chunking is very important when you want to extract information from text such as Locations, Person Names etc. In NLP called Named Entity Extraction.

### Noun Phrase Chunking

chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.

In [0]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

In [0]:
cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs) #The output can be read as a tree or a hierarchy with S as the first level, denoting sentence. 

(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


In [0]:
from nltk.chunk import conlltags2tree, tree2conlltags #CoNLL, the Conference on Natural Language Learning, is SIGNLL's yearly meeting.
from pprint import pprint
iob_tagged = tree2conlltags(cs) #Convert a tree to the CoNLL IOB tag format.
pprint(iob_tagged) # input output beggining tag

#The B- prefix before a tag indicates that the tag is the beginning of a chunk, and an I- prefix before a tag indicates that the tag is inside a chunk. 
# An O tag indicates that a token belongs to no chunk. 

[('European', 'JJ', 'O'),
 ('authorities', 'NNS', 'O'),
 ('fined', 'VBD', 'O'),
 ('Google', 'NNP', 'O'),
 ('a', 'DT', 'B-NP'),
 ('record', 'NN', 'I-NP'),
 ('$', '$', 'O'),
 ('5.1', 'CD', 'O'),
 ('billion', 'CD', 'O'),
 ('on', 'IN', 'O'),
 ('Wednesday', 'NNP', 'O'),
 ('for', 'IN', 'O'),
 ('abusing', 'VBG', 'O'),
 ('its', 'PRP$', 'O'),
 ('power', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('mobile', 'JJ', 'I-NP'),
 ('phone', 'NN', 'I-NP'),
 ('market', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('ordered', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('company', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('alter', 'VB', 'O'),
 ('its', 'PRP$', 'O'),
 ('practices', 'NNS', 'O')]


With the function nltk.ne_chunk(), we can recognize named entities using a classifier, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.

In [0]:
from nltk import ne_chunk
# nltk.download('maxent_ne_chunker')
# nltk.download('words')
ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)#Geo-Political Entity

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)


## Task 
1. implement on article from whitehouse press briefing


# String Matching

In computer science, fuzzy string matching is the technique of finding strings that match a pattern approximately (rather than exactly). In another word, fuzzy string matching is a type of search that will find matches even when users misspell words or enter only partial words for the search. It is also known as approximate string matching.

1. https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49
2. A spell checker and spelling-error, typos corrector. For example, a user types “Missisaga” into Google, a list of hits is returned along with “Showing results for mississauga”. That is, search query returns results even if the user input contains additional or missing characters, or other types of spelling error.
3. Fuzzywuzzy is a Python library uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.
4. Levenshtein distance:  is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. 