# NLP YouTube Workshop (Session 1)

Welcome to the first session of our NLP Workshop Notebook! In this introductory module, we'll lay the groundwork for understanding and applying Natural Language Processing techniques using Python's NLTK and spaCy packages. 🤓


## 1. NLP Toolkit
1. **[NLTK (Natural Language Toolkit)](https://en.wikipedia.org/wiki/Natural_Language_Toolkit)** is a comprehensive open-source Python library designed for *educational* and *research* purposes in natural language processing. It provides robust tools for tasks like tokenization, stemming, lemmatization, parsing, and more, backed by extensive corpora and lexical resources. [The Natural Language Toolkit](https://www.nltk.org/)

2. **[spaCy](https://en.wikipedia.org/wiki/SpaCy)** is an *industrial-strength* NLP library in Python tailored for real-world applications, emphasizing speed and accuracy in tasks such as tokenization, named entity recognition, and dependency parsing. Its modern API and efficient design make it ideal for processing large-scale text data. [spaCy Documentation](https://spacy.io/)

In [None]:
# --- NLTK Setup ---
!pip install nltk
import nltk
# Download essential NLTK datasets: tokenizers, stopwords, WordNet, etc.
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('gutenberg')
nltk.download('averaged_perceptron_tagger_eng')

# --- spaCy Setup ---
!pip install spacy
import spacy

# Download and load the small English model for spaCy
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")


### **WordNet:**
WordNet is a comprehensive lexical database within NLTK that organizes English words into synonym sets (synsets) to facilitate semantic analysis; alternatives like BabelNet or ConceptNet extend these capabilities to multilingual and broader semantic relationships.

### **Gutenberg:**
The Gutenberg corpus in NLTK is a curated collection of classic literary texts sourced from Project Gutenberg, offering diverse works for exploring historical language patterns and stylistic nuances; alternatives like the Brown Corpus or Reuters Corpus provide additional perspectives for varied text analysis.

### **Punkt:**
Punkt is an unsupervised sentence tokenizer in NLTK that intelligently segments text into sentences using learned punctuation patterns, while spaCy’s built-in sentence segmentation offers a statistical alternative with robust boundary detection.

### **en_core_web_sm:**
The spaCy model en_core_web_sm is a lightweight yet efficient English model for tasks such as tokenization, POS tagging, dependency parsing, and NER; for enhanced accuracy and richer features, consider its larger counterparts—en_core_web_md, en_core_web_lg, or the transformer-based en_core_web_trf.

In [None]:
# our example in this work
short_text = "Natural Language Processing is fun and educational."
long_text = 'Dr. Smith, traveled to Washington, D.C. on Jan. 5th for a cutting-edge NLP conference. During his keynote, he explained that advancements in tokenization techniques—particularly those implemented in NLTK and spaCy (e.g., handling abbreviations like "Dr." and "e.g." seamlessly)—are transforming text analysis.'
text_emails = "Contact us at admin.support_34@example.com or sales-dep@company.org for inquiries."


## 2. Corpora and Lexical Resources
Corpora and Lexical Resources are essential for understanding language structure and semantics. They provide extensive collections of texts (corpora) and structured word relationships (lexical databases) that support tasks like language modeling, semantic analysis, and more.



In [None]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
# Retrieve synsets for the word 'computer'
synsets = wn.synsets('great')
print("WordNet Synsets for 'computer':", synsets)


[Synset('computer.n.01'), Synset('calculator.n.01')]


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
# List available files in the Gutenberg corpus
print("Gutenberg Files:", gutenberg.fileids())


Gutenberg Files: ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


### spaCy Lexical Resources
While spaCy does not include separate corpora like NLTK, its language models (e.g., en_core_web_sm) incorporate rich lexical data, including vocabulary, part-of-speech tags, and dependency structures.


In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

short_text = "Natural Language Processing is fun and educational."
doc = nlp(short_text)
# Print token, lemma, POS tag, and dependency relation for each token
for token in doc:
    print(f"Token: {token.text}, Lemma: {token.lemma_}, POS: {token.pos_}, Dep: {token.dep_}")


Token: Natural, Lemma: Natural, POS: PROPN, Dep: compound
Token: Language, Lemma: Language, POS: PROPN, Dep: compound
Token: Processing, Lemma: processing, POS: NOUN, Dep: nsubj
Token: is, Lemma: be, POS: AUX, Dep: ROOT
Token: fun, Lemma: fun, POS: ADJ, Dep: acomp
Token: and, Lemma: and, POS: CCONJ, Dep: cc
Token: educational, Lemma: educational, POS: ADJ, Dep: conj
Token: ., Lemma: ., POS: PUNCT, Dep: punct


## 3. Tokenization Techniques
Tokenization is the process of splitting text into smaller units such as words or sentences, which simplifies subsequent text analysis and processing.



In [None]:
import nltk
nltk.download('punkt_tab')
# NLTK Tokenization
# Word Tokenization with NLTK
long_text = 'Dr. Smith, traveled to Washington, D.C. on Jan. 5th for a cutting-edge NLP conference! During his keynote, he explained that advancements in tokenization techniques—particularly those implemented in NLTK and spaCy (e.g., handling abbreviations like "Dr." and "e.g." seamlessly)—are transforming text analysis.'

words = nltk.word_tokenize(long_text)
print("Word Tokens:", words)

# Sentence Tokenization with NLTK:
sentences = nltk.sent_tokenize(long_text)
print("Sentence Tokens:", sentences)
print(len(sentences))


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Word Tokens: ['Dr.', 'Smith', ',', 'traveled', 'to', 'Washington', ',', 'D.C.', 'on', 'Jan.', '5th', 'for', 'a', 'cutting-edge', 'NLP', 'conference', '!', 'During', 'his', 'keynote', ',', 'he', 'explained', 'that', 'advancements', 'in', 'tokenization', 'techniques—particularly', 'those', 'implemented', 'in', 'NLTK', 'and', 'spaCy', '(', 'e.g.', ',', 'handling', 'abbreviations', 'like', '``', 'Dr.', "''", 'and', '``', 'e.g', '.', "''", 'seamlessly', ')', '—are', 'transforming', 'text', 'analysis', '.']
Sentence Tokens: ['Dr. Smith, traveled to Washington, D.C. on Jan. 5th for a cutting-edge NLP conference!', 'During his keynote, he explained that advancements in tokenization techniques—particularly those implemented in NLTK and spaCy (e.g., handling abbreviations like "Dr." and "e.g."', 'seamlessly)—are transforming text analysis.']
3


In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

text = 'SpaCy is a powerful NLP library. It supports many useful features'
doc = nlp(text)

# print("sentence Tokenization :" )
# for sent in doc.sents :
#   print(sent)

#word tokenization in spaCy
for token in doc :
  print(token.text)

# spaCy Tokenization
# Word Tokenization with spaCy:


SpaCy
is
a
powerful
NLP
library
.
It
supports
many
useful
features


## 4. Regex for Pattern Matching

Regex for pattern matching is a powerful technique to extract or filter specific text patterns. With NLTK, you can leverage its RegexpTokenizer to tokenize text based on custom regex patterns, while spaCy’s Matcher enables regex-like matching within its robust linguistic framework.



In [None]:
# NLTK Regex Example
from nltk.tokenize import RegexpTokenizer
text = "Hello world! This is an example: email@example.com, phone: 123-456-7890."

# Define a tokenizer that captures alphanumeric words
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
print("NLTK Regex Tokens:", tokens)

NLTK Regex Tokens: ['Hello', 'world', 'This', 'is', 'an', 'example', 'email', 'example', 'com', 'phone', '123', '456', '7890']


In [None]:
# spaCy Regex Example
from spacy.matcher import Matcher

# Load the small English model
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# Define a regex-based pattern: match tokens that start with a capital letter
pattern = [{"TEXT": {"REGEX": "^[A-Z][a-z]+"}}]
matcher.add("CAPITAL_PATTERN", [pattern])

text = "Hello world! This is an Example sentence."
doc = nlp(text)

# Apply the matcher and print matched tokens
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print("Matched token:", span.text)


Matched token: Hello
Matched token: This
Matched token: Example


In [None]:
import re
text_emails = "Contact us at admin.support_34@example.com or sales-dep@company.org for inquiries."

# Example 1: Email Detection
email_pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"  # Regex for email matching

emails = re.findall(email_pattern, text_emails)
print("Detected Emails:", emails)


Detected Emails: ['admin.support_34@example.com', 'sales-dep@company.org']


## 5. Stopwords Filtering

Stopwords filtering involves removing commonly used words (e.g., "the", "is", "at") that often add little semantic value to text analysis. This process helps focus on more meaningful terms. NLTK offers a comprehensive list of stopwords, whereas spaCy incorporates an internal attribute for each token to determine if it is a stopword.

In [None]:
# NLTK Stopwords Filtering
from nltk.corpus import stopwords

text = "This is a simple example demonstrating stopword removal in natural language processing.! this is for research purposes only!"
words = nltk.word_tokenize(text)
filtered_words = [word for word in words if word.lower() not in stopwords.words('english')]
print("Filtered Words:", filtered_words)


Filtered Words: ['simple', 'example', 'demonstrating', 'stopword', 'removal', 'natural', 'language', 'processing', '.']


In [None]:
# spaCy Stopwords Filtering
#and also removing punctioations
sample_text = 'This is a simple example demonstrating stopword removal in natural language processing.! this is for research purposes only!'
doc = nlp(sample_text)
filtered_tokens = [
    token.text for token in doc
    if not token.is_stop and not token.is_punct
]
print("Filtered Tokens:", filtered_tokens)


Filtered Tokens: ['simple', 'example', 'demonstrating', 'stopword', 'removal', 'natural', 'language', 'processing', 'research', 'purposes']


## 6. Stemming Methods

Stemming reduces words to their root forms by removing affixes, thus simplifying text for analysis and retrieval tasks. NLTK offers robust stemming algorithms like Porter, Lancaster, and Snowball. While spaCy focuses on lemmatization for morphological normalization, you can integrate NLTK's stemmers with spaCy's tokenization if stemming is required.

In [None]:
# NLTK Stemming Example
import nltk
nltk.download('punkt_tab')
from nltk.stem import PorterStemmer , LancasterStemmer

# Initialize the PorterStemmer
ps = PorterStemmer()
Ln = LancasterStemmer()
# sn = SnowballStemmer(language="english")
text = "caring careful  cares  cats  boxes wolves"
words = nltk.word_tokenize(text)
stems = [ps.stem(word) for word in words]
Ln_stems = [Ln.stem(word) for word in  words]
# sn_stems = [ps.stem(word) for word in words]
print("NLTK porterStmmer Words:", stems)
# print("NLTK Snowball Words:", sn_stems)
print("NLTK Lancaster Words:", Ln_stems)

NLTK porterStmmer Words: ['care', 'care', 'care', 'cat', 'box', 'wolv']
NLTK Lancaster Words: ['car', 'car', 'car', 'cat', 'box', 'wolv']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# Integrating NLTK Stemming with spaCy
from nltk.stem import PorterStemmer

# Load spaCy's small English model
ps = PorterStemmer()

doc = nlp("running runner easily run")
stems = [ps.stem(token.text) for token in doc]
print("spaCy Integrated Stemmed Tokens:", stems)


## 7. Lemmatization Strategies

Lemmatization converts words into their base or dictionary forms by analyzing their context and morphology, resulting in more meaningful representations than simple stemming. NLTK uses the WordNetLemmatizer with the WordNet database, while spaCy provides built-in lemmatization integrated within its processing pipeline.

In [None]:
# NLTK Lemmatization Example
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
text = "The striped bats are hanging on their feet for best"
words = nltk.word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("NLTK Lemmatized Words:", lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...


NLTK Lemmatized Words: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best']


In [None]:
text = "The striped bats are hanging on their feet for best"
doc = nlp(text)
lemmatized_tokens = [token.lemma_ for token in doc]
print("spaCy Lemmatized Tokens:", lemmatized_tokens)


NameError: name 'nlp' is not defined

## 8. Parsing and Chunking

Parsing involves analyzing a sentence's grammatical structure, while chunking groups tokens into higher-level units like noun phrases. NLTK employs rule-based parsers for chunking, and spaCy utilizes statistical models to identify syntactic dependencies and extract phrase chunks.

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')
# NLTK Parsing and Chunking Example
# Tokenize and tag parts of speech

text = "The quick brown fox jumps over the lazy dog."
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)

# Define a simple chunk grammar for noun phrases (NP)
grammar = "NP: {<DT>?<JJ>*<NN>}"

# Create a RegexpParser object and parse the tagged tokens
cp = nltk.RegexpParser(grammar)
parsed_tree = cp.parse(tagged_tokens)

print("NLTK Parsed Tree:" , parsed_tree)

# Uncomment the line below to visualize the tree (requires GUI support)
# parsed_tree.pretty_print()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


NLTK Parsed Tree (Single Line): (S (NP The/DT quick/JJ brown/NN) (NP fox/NN) jumps/VBZ over/IN (NP the/DT lazy/JJ dog/NN) ./.)


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


In [None]:
# Extract and print noun chunks using spaCy's built-in functionality
print("spaCy Noun Chunks:")
for chunk in doc.noun_chunks:
    print(chunk.text)


spaCy Noun Chunks:
The striped bats
their feet


## 9. Hyponyms and Hypernyms Exploration

Hyponyms and hypernyms are semantic relationships that organize words hierarchically—hyponyms denote more specific terms (e.g., "poodle" for "dog"), while hypernyms represent more general categories (e.g., "canine" for "dog"). NLTK’s WordNet is a powerful resource for exploring these relationships. Although spaCy doesn't natively extract hypernyms and hyponyms, you can integrate its NLP capabilities with NLTK’s WordNet for extended lexical analysis. We can not do this directly at spaCy..it is better to to this in nltk.

In [None]:
# NLTK WordNet Example

from nltk.corpus import wordnet as wn

# Choose a synset for the word 'dog'
dog_synset = wn.synset('dog.n.01')

# Retrieve hypernyms (more general terms)
hypernyms = dog_synset.hypernyms()
print("Hypernyms of 'dog':", hypernyms)

# Retrieve hyponyms (more specific terms)
hyponyms = dog_synset.hyponyms()
print("Hyponyms of 'dog':", hyponyms)


Hypernyms of 'dog': [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
Hyponyms of 'dog': [Synset('basenji.n.01'), Synset('great_pyrenees.n.01'), Synset('working_dog.n.01'), Synset('poodle.n.01'), Synset('toy_dog.n.01'), Synset('mexican_hairless.n.01'), Synset('puppy.n.01'), Synset('hunting_dog.n.01'), Synset('newfoundland.n.01'), Synset('corgi.n.01'), Synset('griffon.n.02'), Synset('dalmatian.n.02'), Synset('leonberg.n.01'), Synset('cur.n.01'), Synset('pooch.n.01'), Synset('pug.n.01'), Synset('lapdog.n.01'), Synset('spitz.n.01')]


In [None]:
# Integrating spaCy with NLTK for Lexical Exploration

from nltk.corpus import wordnet as wn

# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")
doc = nlp("The dog barked at the mailman.")

# Find the token 'dog' and explore its lexical relationships using NLTK's WordNet
for token in doc:
    if token.text.lower() == 'dog':
        synsets = wn.synsets(token.text, pos=wn.NOUN)
        if synsets:
            synset = synsets[0]
            hypernyms = synset.hypernyms()
            hyponyms = synset.hyponyms()
            print(f"Token: {token.text}")
            print("Hypernyms:", hypernyms)
            print("Hyponyms:", hyponyms)


Token: dog
Hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
Hyponyms: [Synset('basenji.n.01'), Synset('great_pyrenees.n.01'), Synset('working_dog.n.01'), Synset('poodle.n.01'), Synset('toy_dog.n.01'), Synset('mexican_hairless.n.01'), Synset('puppy.n.01'), Synset('hunting_dog.n.01'), Synset('newfoundland.n.01'), Synset('corgi.n.01'), Synset('griffon.n.02'), Synset('dalmatian.n.02'), Synset('leonberg.n.01'), Synset('cur.n.01'), Synset('pooch.n.01'), Synset('pug.n.01'), Synset('lapdog.n.01'), Synset('spitz.n.01')]


## Name Entity Recognition (NER)



In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Apple Inc. announced a new partnership with OpenAI at the annual Oscar Award event in California."

doc = nlp(text)

print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")


# ORG (Organization)
# GPE (Geopolitical Entity)


Named Entities:
Apple Inc. -> ORG
OpenAI -> GPE
Oscar Award -> WORK_OF_ART
California -> GPE


In [None]:
import nltk
import string
from nltk.corpus import gutenberg, stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer


# Select a text file from Gutenberg (e.g., 'shakespeare-hamlet.txt')
file_id = "shakespeare-hamlet.txt"
raw_text = gutenberg.raw(file_id)

# Step 1: Text Cleaning (Removing Gutenberg Header/Footer)
def clean_text(text):
    lines = text.split("\n") # break (enter - new line)
    start_idx, end_idx = 0, len(lines)

    # Removing Gutenberg boilerplate (First few and last few lines)
    for i, line in enumerate(lines):
        if "START OF THIS PROJECT GUTENBERG" in line:
            start_idx = i + 1
        if "END OF THIS PROJECT GUTENBERG" in line:
            end_idx = i
            break

    cleaned_lines = lines[start_idx:end_idx]
    cleaned_text = " ".join(cleaned_lines)
    return cleaned_text

text = clean_text(raw_text)

# Step 2: Lowercase
text = text.lower()

# Step 3: Tokenization
tokens = word_tokenize(text)

# Step 4: Remove Punctuation & Stopwords
stop_words = set(stopwords.words("english"))
tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Step 5: Stemming & Lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stemmed_tokens = [stemmer.stem(word) for word in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

# Step 6: Convert back to text
stemmed_text = " ".join(stemmed_tokens)
lemmatized_text = " ".join(lemmatized_tokens)

# Output Results
print("Original Text (First 500 characters):\n", text[:500])
print("\nStemmed Text (First 500 characters):\n", stemmed_text[:500])
print("\nLemmatized Text (First 500 characters):\n", lemmatized_text[:500])


Original Text (First 500 characters):
 [the tragedie of hamlet by william shakespeare 1599]   actus primus. scoena prima.  enter barnardo and francisco two centinels.    barnardo. who's there?   fran. nay answer me: stand & vnfold your selfe     bar. long liue the king     fran. barnardo?   bar. he     fran. you come most carefully vpon your houre     bar. 'tis now strook twelue, get thee to bed francisco     fran. for this releefe much thankes: 'tis bitter cold, and i am sicke at heart     barn. haue you had quiet guard?   fran. not

Stemmed Text (First 500 characters):
 tragedi hamlet william shakespear 1599 actu primu scoena prima enter barnardo francisco two centinel barnardo fran nay answer stand vnfold self bar long liue king fran barnardo bar fran come care vpon hour bar strook twelu get thee bed francisco fran releef much thank bitter cold sick heart barn haue quiet guard fran mous stir barn well goodnight meet horatio marcellu riual watch bid make hast enter horatio marcellu f