# **Key Components of an NLP Pipeline:**

*Sample Input: "I watched Amy playing badminton at the country club yesterday."*

**Text Normalization:** Transforming text into a consistent, standard format, such as converting to lowercase, removing punctuation, or expanding abbreviations.

Sample Output: ["i watched amy playing badminton at the country club yesterday"]

**Tokenization:** Segmenting text into words, punctuations marks etc.

*Sample Output: ["I", "watched", "Amy", "playing", "badminton", "at", "the", "country", "club", "yesterday", "."]*

**Parts-ofspeech (POS) Tagging:** Assigning word types to tokens, like verb or noun.

*Sample Output: [PRON, VERB, PROPN, VERB, NOUN, ADP, DET, ADJ, NOUN, NOUN, PUNCT]*

**Dependency Parsing:** Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.

*Sample Output: "watched" - main verb, "I" - subject, "Amy" - object, etc.*

**Named Entity Recognition (NER):** Labelling named “real-world” objects, like persons, companies or locations.

*Sample Output: "Amy" - PERSON, "country club" - LOCATION, "yesterday" - DATE.*

**Lemmatization:** Assigning the base forms of words.

*Sample Output: ["I", "watch", "Amy", "play", "badminton", "at", "the", "country", "club", "yesterday", "."]*

## **SpaCy**

Reference: https://spacy.io/usage/spacy-101

In [None]:
!pip show spacy

Name: spacy
Version: 3.7.6
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: catalogue, cymem, jinja2, langcodes, murmurhash, numpy, packaging, preshed, pydantic, requests, setuptools, spacy-legacy, spacy-loggers, srsly, thinc, tqdm, typer, wasabi, weasel
Required-by: en-core-web-sm, fastai


# **Linguistic Annotations**

Reference: https://spacy.io/usage/spacy-101#annotations


SpaCy offers a range of linguistic annotations to help us understand a text's grammatical structure, including word types (such as parts of speech) and the relationships between words (dependency labels).

In [None]:
# Import the spaCy package
import spacy

# Load the pre-trained pipeline "en_core_web_sm", which is a small English model
# Ability to recognize parts of speech, dependencies, named entities, etc
# The results in a 'Language' object, usually referred to as 'nlp'
# This 'nlp' object can be used to process text
nlp = spacy.load("en_core_web_sm")

# Process a string with the loaded pipeline
# The input text is passed to the 'nlp' object
# This results in a 'Doc' object which contains the processed text with annotations
doc = nlp("Jake lives on a beach in Italy and buys a popsicle for $10.")

# Loop through each token in the 'Doc' object
# A token is typically a word, character, or subword
for token in doc:
    # For each token, print the following:
    print("Token:", token.text)        # Print the original text of the token
    print("Part of Speech (POS):", token.pos_)  # Print the part of speech tag (e.g., noun, verb, etc.)
    print("Dependency Label:", token.dep_)  # Print the dependency label (e.g., object, subject, punctuation, etc.)

Token: Jake
Part of Speech (POS): PROPN
Dependency Label: nsubj
Token: lives
Part of Speech (POS): VERB
Dependency Label: ROOT
Token: on
Part of Speech (POS): ADP
Dependency Label: prep
Token: a
Part of Speech (POS): DET
Dependency Label: det
Token: beach
Part of Speech (POS): NOUN
Dependency Label: pobj
Token: in
Part of Speech (POS): ADP
Dependency Label: prep
Token: Italy
Part of Speech (POS): PROPN
Dependency Label: pobj
Token: and
Part of Speech (POS): CCONJ
Dependency Label: cc
Token: buys
Part of Speech (POS): VERB
Dependency Label: conj
Token: a
Part of Speech (POS): DET
Dependency Label: det
Token: popsicle
Part of Speech (POS): NOUN
Dependency Label: dobj
Token: for
Part of Speech (POS): ADP
Dependency Label: prep
Token: $
Part of Speech (POS): SYM
Dependency Label: nmod
Token: 10
Part of Speech (POS): NUM
Dependency Label: pobj
Token: .
Part of Speech (POS): PUNCT
Dependency Label: punct


*Note: SpaCy preserves the original text, including character positions (offsets), spaces, and formatting, even after splitting it into tokens, allowing you to reconstruct the text exactly as it was before processing.*

# **Tokenization**

Tokenization is the process of breaking down text into smaller meaningful components, such as words, punctuation marks, and other characters, known as tokens.

**Key Concepts:**

1. The input text is split into tokens based on whitespace characters (similar to text.split(' ') in Python).
2. SpaCy has a set of predefined rules called tokenizer exceptions. These rules determine how certain substrings should be tokenized.

Example: "don't" should be split into two tokens - "do" and "n't"; "U.K." should remain a single token.

3. SpaCy processes text from left to right, applying tokenizer exceptions and checking for prefixes (e.g., the quotation mark in "Hey"), suffixes (e.g., the period in "Dr."), and infixes (e.g., the hyphen in "self-help"). If a rule matches, the tokenizer splits the token accordingly and continues processing the resulting substrings.


In [None]:
# Import the spaCy package
import spacy

# Load the pre-trained pipeline "en_core_web_sm"
# Consists of tokenization rules specific to the English language
nlp = spacy.load("en_core_web_sm")

# The input text is tokenized
doc = nlp("Jake lives on a beach in Italy and buys a popsicle for $10.")

# Loop through each token in the 'Doc' object
# The Doc object contains individual tokens that represent meaningful components such as words, punctuation marks, and other characters
for token in doc:
    # Print each token
    print(token.text)

Jake
lives
on
a
beach
in
Italy
and
buys
a
popsicle
for
$
10
.


# **Part-of-speech Tags and Dependencies**

SpaCy a pre-trained pipeline consisting of statistical models trained on large datasets. These models help us predict the most likely POS tags, dependency labels, and other attributes for each token.

**Parts-of-Speech Tagging:**
Classifies each word in a sentence as a noun, verb, adjective, etc. SpaCy utilizes both simple UPOS tags (universal part-of-speech) and more detailed tags.

**Syntactic Dependencies:**
Describe the relationships between tokens in a sentence, identifying which token is the subject, object, or modifier of another token.

In [None]:
# Import the spaCy package
import spacy

# Load the pre-trained pipeline "en_core_web_sm"
# Consists of models for POS tagging, dependency parsing, etc
nlp = spacy.load("en_core_web_sm")

# The input text is parsed and tagged
doc = nlp("Jake lives on a beach in Italy and buys a popsicle for $10.")

# Loop through each token in the 'Doc' object and print various attributes of each token:
    # token.text: Original word text
    # token.lemma_: Base form of the word (lemma)
    # token.pos_: Universal part-of-speech tag
    # token.tag_: Detailed part-of-speech tag
    # token.dep_: Syntactic dependency label
    # token.shape_: Word shape – capitalization, punctuation, digits, etc
    # token.is_alpha: Whether the token consists of alphabetic characters
    # token.is_stop: Whether the token is a stop word (common words that are often filtered out)
for token in doc:
    print(f"Token: {token.text:12} | Lemma: {token.lemma_:12} | POS: {token.pos_:6} | "
          f"Tag: {token.tag_:6} | Dep: {token.dep_:8} | Shape: {token.shape_:8} | "
          f"Alpha: {str(token.is_alpha):5} | Stop: {str(token.is_stop):5}")

Token: Jake         | Lemma: Jake         | POS: PROPN  | Tag: NNP    | Dep: nsubj    | Shape: Xxxx     | Alpha: True  | Stop: False
Token: lives        | Lemma: live         | POS: VERB   | Tag: VBZ    | Dep: ROOT     | Shape: xxxx     | Alpha: True  | Stop: False
Token: on           | Lemma: on           | POS: ADP    | Tag: IN     | Dep: prep     | Shape: xx       | Alpha: True  | Stop: True 
Token: a            | Lemma: a            | POS: DET    | Tag: DT     | Dep: det      | Shape: x        | Alpha: True  | Stop: True 
Token: beach        | Lemma: beach        | POS: NOUN   | Tag: NN     | Dep: pobj     | Shape: xxxx     | Alpha: True  | Stop: False
Token: in           | Lemma: in           | POS: ADP    | Tag: IN     | Dep: prep     | Shape: xx       | Alpha: True  | Stop: True 
Token: Italy        | Lemma: Italy        | POS: PROPN  | Tag: NNP    | Dep: pobj     | Shape: Xxxxx    | Alpha: True  | Stop: False
Token: and          | Lemma: and          | POS: CCONJ  | Tag: CC    

In [None]:
# Visualize dependencies

from spacy import displacy

displacy.render(doc, style="dep")

*Note: SpaCy provides a spacy.explain() function to explain the meaning of tags and labels.*

In [None]:
# Use the spacy.explain function to understand the label
spacy.explain("NNP")

'noun, proper singular'

# **Named Entities**

In [None]:
# Import the spaCy package
import spacy

# Load the pre-trained pipeline "en_core_web_sm"
# Consists of models to recognize named entities
nlp = spacy.load("en_core_web_sm")

# Process the input text using the NLP model
# The input text is analyzed to identify named entity
doc = nlp("Jake lives on a beach in Italy and buys a popsicle for $10.")

# The 'ents' provides a list of objects, each representing a named entity
# Loop through each named entity in the 'Doc' object and print the details of each named entity:
    # ent.text: Original entity text
    # ent.start_char: Starting character index of the entity in the original text
    # ent.end_char: Ending character index of the entity in the original text
    # ent.label_: The entity's label, indicating the type of entity (e.g., ORG, GPE, MONEY)
for ent in doc.ents:
    print(f"Text: {ent.text} | Start: {ent.start_char} | End: {ent.end_char} | Label: {ent.label_}")

Text: Italy | Start: 25 | End: 30 | Label: GPE
Text: 10 | Start: 56 | End: 58 | Label: MONEY


In [None]:
# Use the spacy.explain function to understand the label
spacy.explain("GPE")

'Countries, cities, states'

In [None]:
# Visualize named entities

from spacy import displacy

displacy.render(doc, style="ent")

# **Regular Expressions**

Reference: https://spacy.io/usage/rule-based-matching

Identify the month and day of each episode from the given text.

Input: https://www.thewrap.com/the-rookie-season-6-release-date-time-episodes-schedule/

**Regular expression** = r"([A-Za-z]+\s\d{1,2})"



*   [A-Za-z]+ - Uppercase or lowercase letter (Month)
*   \s - Whitespace character (Space between month and day)
*   \d{1,2} - One or two digits (Day)

In [None]:
import re

text = """
Release Schedule:

ABC has revealed episode details for all episodes, which you can find below.

S.6 Ep.1: “Strike Back” – Feb 20
“In the aftermath of the assaults in the explosive season five finale, the team must now try to understand why they were targeted and if there is a bigger plan in place. Meanwhile, Nolan must survive his last shift before his wedding to Bailey.”

S.5 Ep.2: “The Hammer” – Feb 27 (100th Episode)
“The team comes together to celebrate John and Bailey’s wedding; meanwhile, Celina discovers a discrepancy in her case, leading to a new discovery. Elsewhere, Lucy and Tim’s relationship is put to the test.”

S.6 Ep.3: “Trouble in Paradise” – March 5
“Nolan and Bailey’s honeymoon is more of a nightmare than dream when it turns into an active crime scene. Meanwhile, Tim and Celina partner up and must uncover the identity of a John Doe.”

S.6 Ep.4: “Training Day” – March 26
“It’s Officer Aaron Thorsen’s first day back since the assault, and he’s tasked with a series of high-stress cases to determine whether he’s ready to work. Elsewhere, the team investigates a homicide case with a potential tie to the pentagram killer.”

S.6 Ep.5: “The Vow – April 2
“When a toddler is found at the scene of a crime, John and Bailey must decide whether to let the child go to a shelter for the night or care for her themselves. Meanwhile, when someone from his past returns, Tim disappears and leaves Lucy in the dark.”

S.6 Ep.6: “Secrets and Lies” – April 9
“Following their time as foster parents, Bailey has decided she wants to have a baby and forces John to reconsider their decision to not have children. Meanwhile, John and Celina discover a prison escapee whom they fear is out for revenge and race to find her before it is too late.”

S.6 Ep.7: “Crushed” – April 30
“When two teenagers go missing, it is up to the entire team to find the girls and uncover the truth about their disappearance. Meanwhile, Lopez and Harper are on a different kind of investigation – the search for the perfect nanny.”

S.6 Ep.8: “Punch Card” – May 7
“After a mafia-related mass casualty, the team is tasked to keep the peace at the hospital. Lucy and Celina work together to investigate the suspects behind the attack. Meanwhile, Tim and Aaron embark on a metro ops mission.”

S.6 Ep.9: “The Squeeze” – May 14
“Officer Nolan and Celina take on a special case; meanwhile, Monica enlists help to identify her attackers. Elsewhere, Lopez and Harper discover a connection to the trail of crimes.”

S.6 Ep.10: “Escape Plan” – May 21
“Sgt. Grey helps the team prepare for their biggest mission yet. Meanwhile, Aaron, Lopez, Celina, Tim and Smitty discover a surprising connection in their case.”
"""

# Regular expression to extract the dates
expression = r"([A-Za-z]+\.?\s\d{1,2})"

# Find all matches in the text
dates = re.findall(expression, text)

# Print the extracted dates
for i, date in enumerate(dates, 1):
    print(f"Date for Episode {i}: {date}")


Date for Episode 1: Feb 20
Date for Episode 2: Feb 27
Date for Episode 3: March 5
Date for Episode 4: March 26
Date for Episode 5: April 2
Date for Episode 6: April 9
Date for Episode 7: April 30
Date for Episode 8: May 7
Date for Episode 9: May 14
Date for Episode 10: May 21


## **NLTK**

Reference: https://www.nltk.org/

# **Tokenization**

In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Download 'punkt' - a pre-trained model provided by nltk that helps in tokenizing text.
nltk.download('punkt')

# The input text is tokenized
text = "Jake lives on a beach in Italy and buys a popsicle for $10."

# It is split into a list of tokens
tokens = word_tokenize(text)

# Print each token
for token in tokens:
    print(token)

Jake
lives
on
a
beach
in
Italy
and
buys
a
popsicle
for
$
10
.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Lemmatization and Parts-of-speech Tags**

In [None]:
# Download the required NLTK data resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet, stopwords
from nltk.tag import pos_tag

# Converts NLTK POS tags to WordNet POS tags to achieve effective lemmatization
# WordNet recognizes adjectives, verbs, nouns, and adverbs
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'): # adjectives
        return wordnet.ADJ
    elif treebank_tag.startswith('V'): # verbs
        return wordnet.VERB
    elif treebank_tag.startswith('N'): # nouns
        return wordnet.NOUN
    elif treebank_tag.startswith('R'): # adverbs
        return wordnet.ADV
    else:
        return None

# Lemmatize function using WordNet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Input text
text = "Jake lives on a beach in Italy and buys a popsicle for $10."

# Tokenize the sentence
tokens = word_tokenize(text)

# POS tagging
tagged_tokens = pos_tag(tokens)

# Define stop words
stop_words = set(stopwords.words('english'))

# Loop through each token and print various attributes of each token
for token, pos_tag in tagged_tokens:
    lemma = lemmatizer.lemmatize(token, get_wordnet_pos(pos_tag)) if get_wordnet_pos(pos_tag) else token
    # Check if only consists only of alphabetic characters
    is_alpha = token.isalpha()
    # Check for stop words
    is_stop = token.lower() in stop_words
    shape = ''.join(['X' if char.isupper() else 'x' if char.islower() else 'd' if char.isdigit() else char for char in token])

    print(f"Token: {token:12} | Lemma: {lemma:12} | POS: {pos_tag:6} | Shape: {shape:8} | Alpha: {str(is_alpha):5} | Stop: {str(is_stop):5}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Token: Jake         | Lemma: Jake         | POS: NNP    | Shape: Xxxx     | Alpha: True  | Stop: False
Token: lives        | Lemma: live         | POS: VBZ    | Shape: xxxxx    | Alpha: True  | Stop: False
Token: on           | Lemma: on           | POS: IN     | Shape: xx       | Alpha: True  | Stop: True 
Token: a            | Lemma: a            | POS: DT     | Shape: x        | Alpha: True  | Stop: True 
Token: beach        | Lemma: beach        | POS: NN     | Shape: xxxxx    | Alpha: True  | Stop: False
Token: in           | Lemma: in           | POS: IN     | Shape: xx       | Alpha: True  | Stop: True 
Token: Italy        | Lemma: Italy        | POS: NNP    | Shape: Xxxxx    | Alpha: True  | Stop: False
Token: and          | Lemma: and          | POS: CC     | Shape: xxx      | Alpha: True  | Stop: True 
Token: buys         | Lemma: buy          | POS: VBZ    | Shape: xxxx     | Alpha: True  | Stop: False
Token: a            | Lemma: a            | POS: DT     | Shape: x       

# **Entity Recognition**

In [None]:
# Download the required NLTK data resources
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

text = "Jake lives on a beach in Italy and buys a popsicle for $10."

tokens = word_tokenize(text)

tagged_tokens = pos_tag(tokens)

# Implement Named Entity Recognition on the input text
ner_tree = ne_chunk(tagged_tokens)

# Loop through each named entity and print the label of each entity:
for chunk in ner_tree:
    if hasattr(chunk, 'label'):  # This checks if the chunk is a named entity
        entity = " ".join([token for token, pos in chunk])  # Concatenate tokens in the named entity
        label = chunk.label()  # Get the entity label (e.g., PERSON, GPE)
        print(f"Text: {entity} | Label: {label}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Text: Jake | Label: GPE
Text: Italy | Label: GPE
