# **Key Components of an NLP Pipeline:**

***This notebook demonstrates how to implement key components of an NLP pipeline using spaCy and NLTK.***

*Sample Input: "I watched Amy playing badminton at the country club yesterday."*

**Text Normalization:** Transforming text into a consistent, standard format, such as converting to lowercase, removing punctuation, or expanding abbreviations.

Sample Output: ["i watched amy playing badminton at the country club yesterday"]

**Tokenization:** Segmenting text into words, punctuations marks etc.

*Sample Output: ["I", "watched", "Amy", "playing", "badminton", "at", "the", "country", "club", "yesterday", "."]*

**Parts-ofspeech (POS) Tagging:** Assigning word types to tokens, like verb or noun.

*Sample Output: [PRON, VERB, PROPN, VERB, NOUN, ADP, DET, ADJ, NOUN, NOUN, PUNCT]*

**Dependency Parsing:** Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.

*Sample Output: "watched" - main verb, "I" - subject, "Amy" - object, etc.*

**Named Entity Recognition (NER):** Labelling named “real-world” objects, like persons, companies or locations.

*Sample Output: "Amy" - PERSON, "country club" - LOCATION, "yesterday" - DATE.*

**Lemmatization:** Assigning the base forms of words.

*Sample Output: ["I", "watch", "Amy", "play", "badminton", "at", "the", "country", "club", "yesterday", "."]*

## **SpaCy**

Reference: https://spacy.io/usage/spacy-101

In [None]:
!pip show spacy

Name: spacy
Version: 3.7.6
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: catalogue, cymem, jinja2, langcodes, murmurhash, numpy, packaging, preshed, pydantic, requests, setuptools, spacy-legacy, spacy-loggers, srsly, thinc, tqdm, typer, wasabi, weasel
Required-by: en-core-web-sm, fastai


# **Linguistic Annotations**

Reference: https://spacy.io/usage/spacy-101#annotations


SpaCy offers a range of linguistic annotations to help us understand a text's grammatical structure, including word types (such as parts of speech) and the relationships between words (dependency labels).

In [2]:
import spacy

# Load small pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("Jake lives on a beach in Italy and buys a popsicle for $10.")

# Print token info
for token in doc:
    print("Token:", token.text)
    print("POS:", token.pos_)
    print("Dep:", token.dep_)

Token: Jake
POS: ADJ
Dep: nsubj
Token: lives
POS: VERB
Dep: ROOT
Token: on
POS: ADP
Dep: prep
Token: a
POS: DET
Dep: det
Token: beach
POS: NOUN
Dep: pobj
Token: in
POS: ADP
Dep: prep
Token: Italy
POS: PROPN
Dep: pobj
Token: and
POS: CCONJ
Dep: cc
Token: buys
POS: VERB
Dep: conj
Token: a
POS: DET
Dep: det
Token: popsicle
POS: NOUN
Dep: dobj
Token: for
POS: ADP
Dep: prep
Token: $
POS: SYM
Dep: nmod
Token: 10
POS: NUM
Dep: pobj
Token: .
POS: PUNCT
Dep: punct


*Note: SpaCy preserves the original text, including character positions (offsets), spaces, and formatting, even after splitting it into tokens, allowing you to reconstruct the text exactly as it was before processing.*

# **Tokenization**

Tokenization is the process of breaking down text into smaller meaningful components, such as words, punctuation marks, and other characters, known as tokens.

- The input text is split into tokens based on whitespace characters (similar to text.split(' ') in Python).
- SpaCy has a set of predefined rules called tokenizer exceptions. These rules determine how certain substrings should be tokenized.

Example: "don't" should be split into two tokens - "do" and "n't"; "U.K." should remain a single token.

- SpaCy processes text from left to right, applying tokenizer exceptions and checking for prefixes (e.g., the quotation mark in "Hey"), suffixes (e.g., the period in "Dr."), and infixes (e.g., the hyphen in "self-help"). If a rule matches, the tokenizer splits the token accordingly and continues processing the resulting substrings.


In [3]:
import spacy

# Load small pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Tokenize text
doc = nlp("Jake lives on a beach in Italy and buys a popsicle for $10.")

# Print tokens
for token in doc:
    print(token.text)

Jake
lives
on
a
beach
in
Italy
and
buys
a
popsicle
for
$
10
.


# **Part-of-speech Tags and Dependencies**

SpaCy a pre-trained pipeline consisting of statistical models trained on large datasets. These models help us predict the most likely POS tags, dependency labels, and other attributes for each token.

**Parts-of-Speech Tagging:**
Classifies each word in a sentence as a noun, verb, adjective, etc. SpaCy utilizes both simple UPOS tags (universal part-of-speech) and more detailed tags.

**Syntactic Dependencies:**
Describe the relationships between tokens in a sentence, identifying which token is the subject, object, or modifier of another token.

In [4]:
# Import the spaCy package
import spacy

# Load the small pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Parse and tag the text
doc = nlp("Jake lives on a beach in Italy and buys a popsicle for $10.")

# Print various attributes of each token:
    # token.text: Original word text
    # token.lemma_: Base form of the word (lemma)
    # token.pos_: Universal part-of-speech tag
    # token.tag_: Detailed part-of-speech tag
    # token.dep_: Syntactic dependency label
    # token.shape_: Word shape – capitalization, punctuation, digits, etc
    # token.is_alpha: Whether the token consists of alphabetic characters
    # token.is_stop: Whether the token is a stop word (common words that are often filtered out)
for token in doc:
    print(f"Token: {token.text:12} | Lemma: {token.lemma_:12} | POS: {token.pos_:6} | "
          f"Tag: {token.tag_:6} | Dep: {token.dep_:8} | Shape: {token.shape_:8} | "
          f"Alpha: {str(token.is_alpha):5} | Stop: {str(token.is_stop):5}")

Token: Jake         | Lemma: jake         | POS: ADJ    | Tag: JJ     | Dep: nsubj    | Shape: Xxxx     | Alpha: True  | Stop: False
Token: lives        | Lemma: live         | POS: VERB   | Tag: VBZ    | Dep: ROOT     | Shape: xxxx     | Alpha: True  | Stop: False
Token: on           | Lemma: on           | POS: ADP    | Tag: IN     | Dep: prep     | Shape: xx       | Alpha: True  | Stop: True 
Token: a            | Lemma: a            | POS: DET    | Tag: DT     | Dep: det      | Shape: x        | Alpha: True  | Stop: True 
Token: beach        | Lemma: beach        | POS: NOUN   | Tag: NN     | Dep: pobj     | Shape: xxxx     | Alpha: True  | Stop: False
Token: in           | Lemma: in           | POS: ADP    | Tag: IN     | Dep: prep     | Shape: xx       | Alpha: True  | Stop: True 
Token: Italy        | Lemma: Italy        | POS: PROPN  | Tag: NNP    | Dep: pobj     | Shape: Xxxxx    | Alpha: True  | Stop: False
Token: and          | Lemma: and          | POS: CCONJ  | Tag: CC    

In [None]:
# Visualize dependencies

from spacy import displacy

displacy.render(doc, style="dep")

*Note: SpaCy provides a spacy.explain() function to explain the meaning of tags and labels.*

In [None]:
# Use the spacy.explain function to understand the label
spacy.explain("NNP")

'noun, proper singular'

# **Named Entities**

In [5]:
import spacy

# Load small pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("Jake lives on a beach in Italy and buys a popsicle for $10.")

# Print named entities
for ent in doc.ents:
    print(f"Text: {ent.text} | Start: {ent.start_char} | End: {ent.end_char} | Label: {ent.label_}")


Text: Jake | Start: 0 | End: 4 | Label: NORP
Text: Italy | Start: 25 | End: 30 | Label: GPE
Text: 10 | Start: 56 | End: 58 | Label: MONEY


In [6]:
# Use the spacy.explain function to understand the label
spacy.explain("GPE")

'Countries, cities, states'

In [7]:
# Visualize named entities

from spacy import displacy

displacy.render(doc, style="ent")

## **NLTK**

Reference: https://www.nltk.org/

# **Tokenization**

In [9]:
import nltk
from nltk.tokenize import word_tokenize

# Download 'punkt' - a pre-trained model provided by nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# Tokenize the input text
text = "Jake lives on a beach in Italy and buys a popsicle for $10."
tokens = word_tokenize(text)

# Print each token
for token in tokens:
    print(token)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...


Jake
lives
on
a
beach
in
Italy
and
buys
a
popsicle
for
$
10
.


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


# **Lemmatization and Parts-of-speech Tags**

In [11]:
# Download the required NLTK data resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger_eng')

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet, stopwords
from nltk.tag import pos_tag

# Converts NLTK POS tags to WordNet POS tags
# WordNet recognizes adjectives, verbs, nouns, and adverbs
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'): # adjectives
        return wordnet.ADJ
    elif treebank_tag.startswith('V'): # verbs
        return wordnet.VERB
    elif treebank_tag.startswith('N'): # nouns
        return wordnet.NOUN
    elif treebank_tag.startswith('R'): # adverbs
        return wordnet.ADV
    else:
        return None

# Lemmatize function using WordNet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

text = "Jake lives on a beach in Italy and buys a popsicle for $10."

# Tokenize the text
tokens = word_tokenize(text)

# POS tagging
tagged_tokens = pos_tag(tokens)

# Define stop words
stop_words = set(stopwords.words('english'))

# Loop through each token and print various attributes of each token
for token, pos_tag in tagged_tokens:
    lemma = lemmatizer.lemmatize(token, get_wordnet_pos(pos_tag)) if get_wordnet_pos(pos_tag) else token
    # Check if only consists only of alphabetic characters
    is_alpha = token.isalpha()
    # Check for stop words
    is_stop = token.lower() in stop_words
    shape = ''.join(['X' if char.isupper() else 'x' if char.islower() else 'd' if char.isdigit() else char for char in token])

    print(f"Token: {token:12} | Lemma: {lemma:12} | POS: {pos_tag:6} | Shape: {shape:8} | Alpha: {str(is_alpha):5} | Stop: {str(is_stop):5}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Token: Jake         | Lemma: Jake         | POS: NNP    | Shape: Xxxx     | Alpha: True  | Stop: False
Token: lives        | Lemma: live         | POS: VBZ    | Shape: xxxxx    | Alpha: True  | Stop: False
Token: on           | Lemma: on           | POS: IN     | Shape: xx       | Alpha: True  | Stop: True 
Token: a            | Lemma: a            | POS: DT     | Shape: x        | Alpha: True  | Stop: True 
Token: beach        | Lemma: beach        | POS: NN     | Shape: xxxxx    | Alpha: True  | Stop: False
Token: in           | Lemma: in           | POS: IN     | Shape: xx       | Alpha: True  | Stop: True 
Token: Italy        | Lemma: Italy        | POS: NNP    | Shape: Xxxxx    | Alpha: True  | Stop: False
Token: and          | Lemma: and          | POS: CC     | Shape: xxx      | Alpha: True  | Stop: True 
Token: buys         | Lemma: buy          | POS: VBZ    | Shape: xxxx     | Alpha: True  | Stop: False
Token: a            | Lemma: a            | POS: DT     | Shape: x       

# **Entity Recognition**

In [13]:
# Download the required NLTK data resources
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker_tab')

import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

text = "Jake lives on a beach in Italy and buys a popsicle for $10."

tokens = word_tokenize(text)

tagged_tokens = pos_tag(tokens)

# Process the text
ner_tree = ne_chunk(tagged_tokens)

# Print the entities
for chunk in ner_tree:
    if hasattr(chunk, 'label'):  # This checks if the chunk is a named entity
        entity = " ".join([token for token, pos in chunk])  # Concatenate tokens in the named entity
        label = chunk.label()  # Get the entity label (e.g., PERSON, GPE)
        print(f"Text: {entity} | Label: {label}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


Text: Jake | Label: GPE
Text: Italy | Label: GPE
