<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#NLP-v/s-Computational-Linguistics" data-toc-modified-id="NLP-v/s-Computational-Linguistics-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>NLP v/s Computational Linguistics</a></span></li><li><span><a href="#Corpora,-Tokens-&amp;-Types" data-toc-modified-id="Corpora,-Tokens-&amp;-Types-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Corpora, Tokens &amp; Types</a></span></li><li><span><a href="#Unigrams,-Bigrams,-Trigrams,-...,-N-grams" data-toc-modified-id="Unigrams,-Bigrams,-Trigrams,-...,-N-grams-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Unigrams, Bigrams, Trigrams, ..., N-grams</a></span></li><li><span><a href="#Lemmas-&amp;-Stems" data-toc-modified-id="Lemmas-&amp;-Stems-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Lemmas &amp; Stems</a></span></li><li><span><a href="#Categorizing-Words:-POS-Tagging" data-toc-modified-id="Categorizing-Words:-POS-Tagging-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Categorizing Words: POS Tagging</a></span></li><li><span><a href="#Categorizing-Spans:-Chunking-&amp;-Named-Entity-Recognition" data-toc-modified-id="Categorizing-Spans:-Chunking-&amp;-Named-Entity-Recognition-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Categorizing Spans: Chunking &amp; Named Entity Recognition</a></span></li><li><span><a href="#Structure-of-Sentences" data-toc-modified-id="Structure-of-Sentences-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Structure of Sentences</a></span></li></ul></div>

## NLP v/s Computational Linguistics

_NLP_ aims to develop methods for solving practical problems involving language such as information extraction, automatic speech recognition, machine translation, sentiment analysis, question answering and summarization. _Computational Linguistics(CL)_ on the other hand, employs computational methods to understand properties of human language.

## Corpora, Tokens & Types

- _Corpora_: Text Dataset used for NLP methods.
- _Tokens_: Contiguous units of grouped characters.
- _Instance or Data Point_: Text along with its metadata.
- _Dataset_: Also known as corpora, a collection of instance.
- _Tokenization_: Process of breaking a text down into tokens.
- _Vocabulary/Lexicon_: Set of all types in a corpus

<img src="../images/figure_2_1.png" />

In [4]:
!python -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 748 kB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
# Tokenizing Text using Spacy
import spacy
nlp = spacy.load('en_core_web_sm')
text = "May, don't slap the green witch"
print(
    [
        str(token)
        for token in nlp(text.lower())
    ]
)

['may', ',', 'do', "n't", 'slap', 'the', 'green', 'witch']


In [8]:
# Tokenizing Text using nltk
from nltk.tokenize import TweetTokenizer
tweet = u"Snow White and the Seven Degrees #MakeAMovieCold@midnight:-)"
tokenizer = TweetTokenizer()
print(
    tokenizer.tokenize(tweet.lower())
)

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


## Unigrams, Bigrams, Trigrams, ..., N-grams

_N-grams_ are fixed-length(n) consective token sequences occuring in the text. A bigram has two tokens, a unigram one.

In [9]:
def n_grams(text, n):
    return [
        text[i:i+n] for i in range(len(text)-n+1)
    ]

cleaned = [
    'mary', ',', "n't", 'slap', 'green', 'witch', '.'
]
print(n_grams(cleaned, 3))

[['mary', ',', "n't"], [',', "n't", 'slap'], ["n't", 'slap', 'green'], ['slap', 'green', 'witch'], ['green', 'witch', '.']]


## Lemmas & Stems

- _Lemmas_ are root forms of works. Process of reducing the dimensionality of vector representation by reducing tokens to their lemmas is called _lemmatization_.
- _Stemming_ uses handcrafted rules to strip endings of words to reduce then to a common form called _stems_

In [10]:
doc = nlp("he was running late")
for token in doc:
    print(f"{token} -----> {token.lemma_}")

he -----> he
was -----> be
running -----> run
late -----> late


## Categorizing Words: POS Tagging

In [15]:
doc = nlp("Mary slapped the green witch")
for token in doc:
    print(f"{token} ----> {token.pos_}")

Mary ----> PROPN
slapped ----> VERB
the ----> DET
green ----> ADJ
witch ----> NOUN


## Categorizing Spans: Chunking & Named Entity Recognition

- _Shallow/Chuncking_ parsing aims to derive higher order units composed of the grammatical atoms, like nouns, verbs, adjectives and so on.
- _Named Entity_ is a string mention of a real world concept like a person, location, organization, drug name etc.

In [16]:
doc = nlp("Mary slapped the green witch")
for chunk in doc.noun_chunks:
    print(f"{chunk} ----> {chunk.label_}")

Mary ----> NP
the green witch ----> NP


## Structure of Sentences

- _Parsing_ is identifying the relationship between phrasal units identified using Shallow Parsing.

<img src="../images/figure_2_2.png" />
<img src="../images/figure_2_3.png" />