Process of assigning a syntactic label to each token in a sentence
based on its lexical meaning. For example:
Input: “Santiago likes playing football”
Output:
“Santiago” => NOUN
“likes” => VERB
“playing” => VERB
“football” => NOUN

In [None]:
import nltk
from nltk import word_tokenize
text = word_tokenize("And now for something completely different. This is Canada")
nltk.pos_tag(text)

'''Chunking'''
Identification of noun phrases. For example:
Input: “South Africa is a country”
Output: “South Africa”

In [None]:
'''Chunking'''
import nltk
from nltk import word_tokenize

text = "South Africa is a country"
text = nltk.word_tokenize(text)
text

In [None]:
try:
    tagged = nltk.pos_tag(text)
    chunkGram = r"""Chunk: {<NN.?>*<VB.?>*<NNP>+<NN>?}"""
    chunkParser = nltk.RegexpParser(chunkGram)
    chunked = chunkParser.parse(tagged)
    chunked.draw()     

except Exception as e:
    print(str(e))

'''Collocations are combinations of words that occur together more
often than would be expected. Lexical association measures are
formulas that determine the degree of association between the
components of the collocation. They calculate an association score
(metric) for each collocation. For example:
“to make the bed” => [to make][the bed]
“to do homework” => [to do][homework]
“to take a risk” => [to take][a risk]
“to give someone advice” => [to give someone][advice]'''

In [None]:
'''Collocation extraction'''
from nltk.collocations import *
nltk.download('genesis')
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(nltk.corpus.genesis.words('english-web.txt'))


text = "I do not like green eggs and ham, I do not like them Sam I am!"
tokens = nltk.wordpunct_tokenize(text)
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(bigram for bigram, score in scored)  # doctest: +NORMALIZE_WHITESPACE

*** NAMED entity recognition ***
Predefined identification of type of entities in a sentence. For
example:
Input: “When Michael Jordan was at the peak of his powers as an
NBA superstar, his Chicago Bulls team were moving down the
completion, winning six National Basketball Association titles”.
Output:
“Chicago Bulls”
“Michael Jordan”
“National Basketball Association”
Refer: https://www.geeksforgeeks.org/python-named-entity-recognition-ner-using-spacy/

Dependency Parsing: Building a syntactic tree of a sentence. An analysis tree divides the
text into sub-phrases. Non-terminals in the tree are types of
phrases; terminals are the words in the sentence. For a simple
sentence "John sees Bill", an analysis is:
              sees
                |
         ----------------     
Subject|               | Object
          |               |
         John          Bill

Semantic Role Labelling SRL assigns labels in a sentence indicating their semantic role (agent, predicate, subject,
and location) in a sentence. For example: Input: “The police officer detained the suspect at the scene of the crime”
Output: “The police officer” => AGENT “detained” => PREDICATE “the suspect” => THEME “at the scene of the crime”
=> LOCATION


Bag of Words Methods that are used for natural language processing to represent documents where the order of
words (grammar) is not important.

In [None]:
piano_class_text = ('Great Piano Academy is situated'
                    ' in Mayfair or the City of London and has'
                    ' world-class piano instructors.')
piano_class_doc = nlp(piano_class_text)
for ent in piano_class_doc.ents:
    print(ent.text, ent.start_char, ent.end_char,ent.label_, spacy.explain(ent.label_))

In [None]:
'''Word Frequency'''
#nltk.download('gutenberg')
#nltk.download('inaugural')
#nltk.download('webtext')
#nltk.download('nps_chat')
#nltk.download('treebank')
from nltk.book import *
fdist1 = FreqDist(text1)
fdist1

Collocations and Bigrams
A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the
wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses;
for example, maroon wine sounds definitely odd

In [None]:
list(bigrams(['more', 'is', 'said', 'than', 'done']))

In [None]:
'''Ngrams'''
import re
from nltk.util import ngrams
s = "Natural-language processing (NLP) is an area of computer science " \
"and artificial intelligence concerned with the interactions " \
"between computers and human (natural) languages."
s = s.lower()
s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens, 3)) #change this desired number
output