--------------------------------------------------------------------------------------------------------------------

#### Corpora, Tokens, and Types


1. `Corpus :` A corpus usually contains raw text (in ASCII or UTF-8) and any metadata associated with the text.

2.  `Tokens :` correspond to words and numeric sequences separated by white-space characters or punctuation.

3.  `Instance :` In machine learning parlance, the text along with its metadata is called an instance or data point. 

![Alt text](images/nlpp_0201.png)


--------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------

#### Tokenisation

The process of breaking a text down into tokens is called `tokenization`.

--------------------------------------------------------------------------------------------------------------------

In [11]:
# make sure to install spacy and load the model 

''' 
Use following command to install spacy using conda enviroment 

1. conda install -c conda-forge spacy  
2. python -m spacy download en_core_web_sm

'''
import spacy

nlp = spacy.load('en_core_web_sm')

text = "Sruthi, don't slap the witch"

print([str(token) for token in nlp(text)])

['Sruthi', ',', 'do', "n't", 'slap', 'the', 'witch']


We can also use NLTK library and perform the tokenisation. See the example below

In [9]:

from nltk.tokenize import TweetTokenizer

tweet = "Hello Mister how do you do?"

tokeniser = TweetTokenizer()

print(f" Tokens : {tokeniser.tokenize(tweet)}")


 Tokens : ['Hello', 'Mister', 'how', 'do', 'you', 'do', '?']


--------------------------------------------------------------------------------------------------------------------

#### Notes

1. `Types` are unique tokens present in a corpus. 

2. The set of all types in a corpus is its `vocabulary` or `lexicon`. 

3. Words can be distinguished as `content words` and `stopwords`. 

4. Stopwords such as articles and prepositions serve mostly a grammatical purpose, like filler holding the content words.


- This process of understanding the linguistics of a language and applying it to solving NLP problems is called `feature engineering`. 

--------------------------------------------------------------------------------------------------------------------

#### Unigrams, Bigrams, Trigrams, …, N-grams

- N-grams are fixed-length (n) consecutive token sequences occurring in the text. A `bigram` has two tokens, a `unigram` one.


In [8]:
def n_grams(text, n):

   '''
   Takes text and number and return a list of n-grams
   '''

   return [text[i:i+n] for i in range(len(text) - n+1)]

## Lets start with a sentence 
sentence  = "Real madrid is the greatest football club in the history of football"

## Now lets use spacy model to collect the tokens in this sentence
## Make sure you have loadded the spacy model ## nlp = spacy.load('en_core_web_sm')
tokens = [str(tokens) for tokens in nlp(sentence)]

print(f" Tokens in the sentence : {tokens}\n")

## Now that we have collected the tokens we can extract n grams from the token list

print(f" n-grams of length 2 (bigrams): {n_grams(tokens, 2)}")


 Tokens in the sentence : ['Real', 'madrid', 'is', 'the', 'greatest', 'football', 'club', 'in', 'the', 'history', 'of', 'football']

 n-grams of length 2 (bigrams): [['Real', 'madrid'], ['madrid', 'is'], ['is', 'the'], ['the', 'greatest'], ['greatest', 'football'], ['football', 'club'], ['club', 'in'], ['in', 'the'], ['the', 'history'], ['history', 'of'], ['of', 'football']]


--------------------------------------------------------------------------------------------------------------------

#### Lemmas and Stems

1. `Lemmas` are root forms of words. Consider the verb fly. It can be inflected into many different words—`flow, flew, flies, flown, flowing,` and so on—and fly is the lemma for all of these seemingly different words.


--------------------------------------------------------------------------------------------------------------------



In [19]:
import spacy 

nlp = spacy.load('en_core_web_sm')

## Lets start with a sentence 
sentence  = "Real madrid is the greatest football club in the history of football"

## we can also create a simple lemmetizer function which takes and sentence and gives us the lemma

def show_lemma(sentence):
    ''' 
    Takes normal sentence and return lemma
    '''
    doc = nlp(sentence)
    
    for tokens in doc:
        print('{} ---> {}'.format(tokens, tokens.lemma_))


show_lemma("he was running late")

he ---> he
was ---> be
running ---> run
late ---> late


--------------------------------------------------------------------------------------------------------------------

- `spaCy`, for example, uses a predefined dictionary, called `WordNet`, for extracting lemmas, but `lemmatization` can be framed as a machine learning problem requiring an understanding of the `morphology of the language`.


### NOTE : We are not going to talk about `stemming` here 

--------------------------------------------------------------------------------------------------------------------



#### POS Tagging

In [20]:
import spacy 

nlp = spacy.load('en_core_web_sm')

doc = nlp(sentence)

for token in doc:
    print('{} - {}'.format(token, token.pos_))

Real - ADJ
madrid - PROPN
is - AUX
the - DET
greatest - ADJ
football - NOUN
club - NOUN
in - ADP
the - DET
history - NOUN
of - ADP
football - NOUN


--------------------------------------------------------------------------------------------------------------------

Often, we need to label a `span of text`; that is, a contiguous `multitoken` boundary. For example, consider the sentence, “Mary slapped the green witch.” We might want to identify the `noun phrases (NP)` and `verb phrases (VP)` in it, as shown here:

[NP Mary] [VP slapped] [the green witch].


This is called `chunking` or `shallow parsing`. 


`Shallow parsing` aims to derive higher-order units composed of the grammatical atoms, like nouns, verbs, adjectives, and so on. It is possible to write regular expressions over the part-of-speech tags to approximate shallow parsing if you do not have data to train models for shallow parsing. 

![Alt text](images/nlpp_0203.png)


Parse trees indicate how different grammatical units in a sentence are related hierarchically. The parse tree in this figure shows what’s called a constituent parse.

--------------------------------------------------------------------------------------------------------------------


In [23]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc  = nlp(u"Mary slapped the green witch.")

for chunk in doc.noun_chunks:
    print ('{} - {}'.format(chunk, chunk.label_))

Mary - NP
the green witch - NP


--------------------------------------------------------------------------------------------------------------------

#### Word Senses and Semantics

Words have meanings, and often more than one. 

The different meanings of a word are called its `senses`. 

`WordNet`, a long-running lexical resource project from Princeton University, aims to catalog the senses of all (well, most) words in the English language, along with other lexical relationships.4 For example, consider a word like “plane.” shows the different senses in which this word could be used.

![Alt text](images/nlpp_0205.png)


