# **Practice Assignment: NLP with NLTK & spaCy**

* This assignment is part of the NLP Workshop on YouTube, which is free and open to the public.
* **Lecturer: Reza Shokrzad.**
*‌ [دسترسی به جلسه اول کلاس](https://youtube.com/live/lDCoqQSc4ZE?feature=share)
* [برنامه اجرایی کلاس و جلسات](https://docs.google.com/spreadsheets/d/1SP3NJ9H7yp8sgof-zp_t4oxmdxjMdEgoL_mmCDvdUm4/edit?gid=0#gid=0)

salam
Welcome to this **Fill-in-the-Blanks NLP Assignment!** 🎯 This exercise will help you solidify your understanding of **NLTK** and **spaCy** by filling in the missing parts of the code. Follow the instructions carefully, and make sure to test your solutions!


## **1. Working with Corpora & Lexical Resources**
**Task:** Load and analyze texts from different corpora.
- Use NLTK’s **Gutenberg** corpus to load the text of *Moby Dick*.
- Tokenize it into words.
- Count the top 10 most frequent words (excluding stopwords).

In [None]:
import nltk
nltk.download('punkt_tab')
nltk.download('gutenberg')
nltk.download('stopwords')
from nltk.corpus import gutenberg
from nltk.probability import FreqDist
from nltk.corpus import stopwords
# the first step actually involves fiding out what kinds of text there are in this
fileids = nltk.corpus.gutenberg.fileids()

fileid = "melville-moby_dick.txt"
# Load text
text = gutenberg.raw(fileid)  # FILL THIS

# Tokenize words
words = nltk.word_tokenize(text)

# Remove stopwords
filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stopwords.words('english')]  # FILL THIS

# Compute frequency distribution
fdist = FreqDist(filtered_words)
#aalam sonya
# Print top 10 words
print(fdist.most_common(10))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


[('whale', 1095), ('one', 913), ('like', 580), ('upon', 565), ('ahab', 511), ('man', 498), ('ship', 469), ('old', 443), ('ye', 438), ('would', 436)]


## **2. Tokenization Techniques**
**Task:** Tokenize a given text using both **NLTK** and **spaCy**.

In [None]:
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')
nlp = spacy.load('en_core_web_sm')

text = "SpaCy is fast! However, NLTK provides flexibility in tokenization."

# NLTK Tokenization
nltk_word_tokens = word_tokenize(text)  # FILL THIS
nltk_sent_tokens = sent_tokenize(text)  # FILL THIS

# spaCy Tokenization
doc = nlp(text)
spacy_tokens = [token.text for token in doc]

print("NLTK Word Tokens:", nltk_word_tokens)
print("NLTK Sentence Tokens:", nltk_sent_tokens)
print("spaCy Tokens:", spacy_tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


NLTK Word Tokens: ['SpaCy', 'is', 'fast', '!', 'However', ',', 'NLTK', 'provides', 'flexibility', 'in', 'tokenization', '.']
NLTK Sentence Tokens: ['SpaCy is fast!', 'However, NLTK provides flexibility in tokenization.']
spaCy Tokens: ['SpaCy', 'is', 'fast', '!', 'However', ',', 'NLTK', 'provides', 'flexibility', 'in', 'tokenization', '.']


## **3. Regex Pattern Matching for Phone Number Detection**
**Task:** Write a pattern using regex to find the phone number in the text.

In [None]:
import re

# Example 2: Phone Number Extraction
text_phones = "Call me at +1-202-555-0173 or reach our office at (415) 123-4567."
phone_pattern = r"(\+?\d{1,2}[-\s]?)?(\(?\d{3}\)?)[-\s]?\d{3}[-\s]?\d{4}"

phones = re.findall(phone_pattern , text_phones)
print("Detected Phone Numbers:", phones)


Detected Phone Numbers: [('+1-', '202'), ('', '(415)')]


## 4. **Stopwords Filtering using NLTK**
**Task:** Analyze movie reviews where stopwords are removed to focus on meaningful words.

In [None]:

# nltk.download("stopwords")
# nltk.download("punkt")

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# 🎬 Sample Movie Review
review = """The movie was absolutely amazing! The cinematography was stunning, and the characters were incredibly well-developed.
However, the storyline felt a bit predictable at times, and some scenes were unnecessarily long. Overall, a great experience!"""

# Tokenize words
words = word_tokenize(review)
#sla,
# Remove stopwords
filtered_words = [word for word in words if word.lower() not in stopwords.words("english") and word.isalpha()]

# Output results
print("Original Words:", words)
print("\nFiltered (No Stopwords):", filtered_words)


## 5. **Stemming Methods using NLTK**
**Task:** Analyze legal and scientific terms to observe how different stemming algorithms behave.

In [None]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer

# ⚖️ Sample Legal & Scientific Terms
words = ["arguing", "justification", "liable", "obligations", "classification", "microbiology", "evolutionary", "running", "happiness"]

# Initialize Stemmer Objects
porter = PorterStemmer()
lancaster = LancasterStemmer()

# Apply Stemming
porter_stems = [porter.stem(word) for word in words]
lancaster_stems = [lancaster.stem(word) for word in words]

# Output Results
print("Original Words:", words)
print("\nPorter Stemmer Results:", porter_stems)
print("\nLancaster Stemmer Results:", lancaster_stems)


Original Words: ['arguing', 'justification', 'liable', 'obligations', 'classification', 'microbiology', 'evolutionary', 'running', 'happiness']

Porter Stemmer Results: ['argu', 'justif', 'liabl', 'oblig', 'classif', 'microbiolog', 'evolutionari', 'run', 'happi']

Lancaster Stemmer Results: ['argu', 'just', 'liabl', 'oblig', 'class', 'microbiolog', 'evolv', 'run', 'happy']


## 6. **Lemmatization Strategies using NLTK & spaCy**

### NLTK’s WordNetLemmatizer
**Task:** Lemmatize a political news headline to show how lemmatization helps retain the correct part of speech (POS) while normalizing words.

In [None]:
import nltk
nltk.download("wordnet")
nltk.download("punkt_tab")
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

# 📰 Sample News Headline
headline = "The senators debated the increasing regulations affecting technology companies."

# Tokenize words
words = word_tokenize(headline)

# Initialize Lemmatizer
lemmatizer = WordNetLemmatizer()

# Apply Lemmatization (default without POS tagging)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("Original Words:", words)
print("\nLemmatized Words:", lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Original Words: ['The', 'senators', 'debated', 'the', 'increasing', 'regulations', 'affecting', 'technology', 'companies', '.']

Lemmatized Words: ['The', 'senator', 'debated', 'the', 'increasing', 'regulation', 'affecting', 'technology', 'company', '.']


### spaCy’s Built-in Lemmatizer

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

# Process the same headline
doc = nlp(headline)

# Apply Lemmatization
spacy_lemmatized = [token.text for token in doc]

print("\nspaCy Lemmatized Words:", spacy_lemmatized)



spaCy Lemmatized Words: ['The', 'senators', 'debated', 'the', 'increasing', 'regulations', 'affecting', 'technology', 'companies', '.']


## 7. **Parsing & Chunking using NLTK**

**Task:** Analyze legal contracts and job descriptions where parsing and chunking help extract meaningful phrases like noun phrases (NPs) or verb phrases (VPs).

In [None]:
# 📜 Task: Extracting Key Phrases from Legal & Job Documents
import nltk

# nltk.download("punkt")
nltk.download("averaged_perceptron_tagger_eng")

# 📜 Sample Legal Contract Text
contract_text = "The tenant shall pay the monthly rent before the 5th of each month."

# Tokenize & POS Tagging
words = nltk.word_tokenize(contract_text)
pos_tags = nltk.pos_tag(words)

# Define a Chunking Grammar for Noun Phrases (NP)
grammar = r"NP: {<DT>?<JJ>*<NN>+}"

# Apply Chunking
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(pos_tags)

# Display Results
print("Chunked Tree:")
tree.pretty_print()


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Chunked Tree:
                                               S                                                                     
    ___________________________________________|__________________________________________________________            
   |       |        |       |      |      |    |          NP                      NP                      NP         
   |       |        |       |      |      |    |     _____|______         ________|_________         _____|_____      
shall/MD pay/VB before/IN the/DT 5th/CD of/IN ./. The/DT     tenant/NN the/DT monthly/JJ rent/NN each/DT     month/NN



## 8. **Exploring Hyponyms & Hypernyms using WordNet (NLTK)**

**Task:** Hyponyms (specific terms) and hypernyms (general terms) in scientific and business domains, where hierarchical relationships between words are essential.

In [None]:
# 🔍 Task: Explore Word Relationships in Science & Business
from nltk.corpus import wordnet

# 🦁 Find Hypernyms & Hyponyms for "lion"
word = "lion"
synset = wordnet.synsets(word)[0]  # Selecting the first synset

# Hypernyms (More General Category)
hypernyms = synset.hypernyms()
print(f"Hypernyms (More General Concept) of '{word}':")
print([hypernym.name().split('.')[0] for hypernym in hypernyms])

# Hyponyms (More Specific Types)
hyponyms = synset.hyponyms()
print(f"\nHyponyms (More Specific Types) of '{word}':")
print([hyponym.name().split('.')[0] for hyponym in hyponyms])


Hypernyms (More General Concept) of 'lion':
['big_cat']

Hyponyms (More Specific Types) of 'lion':
['lionet', 'lioness', 'lion_cub']


## **9. Named Entity Recognition (NER) with spaCy**
**Task:** Extract named entities from a complex sentence.

In [None]:
nlp = spacy.load("en_core_web_sm")
text = "In 1969, Neil Armstrong became the first person to walk on the Moon during the Apollo 11 mission."

doc = nlp(text)  # FILL THIS

print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")

Named Entities:
1969 -> DATE
Neil Armstrong -> PERSON
first -> ORDINAL
Moon -> PERSON
Apollo -> ORG
