# 📘 Introduction to Natural Language Processing (NLP) for Beginners

Welcome to your first 2-hour session on NLP! 🚀

**Natural Language Processing (NLP)** is a fascinating field of AI that teaches computers how to understand and work with human language. Think of it as the bridge between how we talk and how machines process information. From Siri and Alexa to Google Translate, NLP is everywhere!

In this notebook, we will transform raw text into structured data that a machine learning model can understand. 

### 🎯 Learning Objectives for Today:

By the end of this 2-hour session, you will be able to:
1.  **Understand Text Structure**: Identify parts of speech (POS) and named entities (NER).
2.  **Group Words**: Learn how to group words into meaningful phrases (Chunking).
3.  **Normalize Text**: Reduce words to their root forms (Lemmatization).
4.  **Explore Word Meanings**: Use WordNet to find word relationships.
5.  **Convert Text to Numbers**: Create numerical features from text using the Bag-of-Words model.
6.  **Clean Up Features**: Understand basic feature selection by removing common words (stopwords).
7.  **Measure Similarity**: Calculate how similar two documents are using their numerical representations.

--- 
### **Setup: Installing Necessary Libraries**
First, let's make sure we have the Python libraries we need. Run the cell below to install `nltk` and `scikit-learn`.

In [2]:
# This command installs the libraries. The '!' lets us run terminal commands in Jupyter.
!pip install nltk scikit-learn

# Now, let's import them and download some required data packages from NLTK
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet')
nltk.download('stopwords')

print("✅ Setup Complete! You're ready to start.")



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...


✅ Setup Complete! You're ready to start.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping chunkers\maxent_ne_chunker_tab.zip.


True

## Topic 1: Part-of-Speech (POS) Tagging

📄 **Explanation**

**Part-of-Speech (POS) Tagging** is like labeling words in a sentence with their grammatical type. Is the word a noun, a verb, an adjective, or something else? 

This is a crucial first step because the meaning of a word can change based on its POS. For example, "book" can be a noun ("I read a **book**") or a verb ("I need to **book** a flight"). POS tagging helps the computer understand this context.

Common tags include:
- `NN`: Noun (e.g., *cat, building*)
- `VB`: Verb (e.g., *run, housed*)
- `JJ`: Adjective (e.g., *old, beautiful*)
- `DT`: Determiner (e.g., *the, a*)

In [3]:
# 💻 Example: Let's tag a sentence!
from nltk import word_tokenize, pos_tag

# Our example sentence
sentence = "The old building housed Apple Inc. in California."

# First, we break the sentence into individual words (tokens)
tokens = word_tokenize(sentence)

# Then, we apply POS tagging to the tokens
pos_tags = pos_tag(tokens)

print(pos_tags)

[('The', 'DT'), ('old', 'JJ'), ('building', 'NN'), ('housed', 'VBD'), ('Apple', 'NNP'), ('Inc.', 'NNP'), ('in', 'IN'), ('California', 'NNP'), ('.', '.')]


| Tag | Description |
|---|---|
| CC | Coordinating conjunction |
| CD | Cardinal number |
| DT | Determiner |
| EX | Existential there |
| FW | Foreign word |
| IN | Preposition or subordinating conjunction |
| JJ | Adjective |
| JJR | Adjective, comparative |
| JJS | Adjective, superlative |
| LS | List item marker |
| MD | Modal |
| NN | Noun, singular or mass |
| NNS | Noun, plural |
| NNP | Proper noun, singular |
| NNPS | Proper noun, plural |
| PDT | Predeterminer |
| POS | Possessive ending |
| PRP | Personal pronoun |
| PRP$ | Possessive pronoun |
| RB | Adverb |
| RBR | Adverb, comparative |
| RBS | Adverb, superlative |
| RP | Particle |
| SYM | Symbol |
| TO | to |
| UH | Interjection |
| VB | Verb, base form |
| VBD | Verb, past tense |
| VBG | Verb, gerund or present participle |
| VBN | Verb, past participle |
| VBP | Verb, non-3rd person singular present |
| VBZ | Verb, 3rd person singular present |
| WDT | Wh-determiner |
| WP | Wh-pronoun |
| WP$ | Possessive wh-pronoun |
| WRB | Wh-adverb |

### 🎯 Practice Task 1

Now it's your turn! In the code cell below, create a new sentence and use the same process to find its POS tags. Try a sentence like: `"A fast car runs smoothly."`

In [4]:
# Your sentence here
my_sentence = "A fast car runs smoothly."

# Tokenize the sentence
my_tokens = word_tokenize(my_sentence)

# Get the POS tags
my_tags = pos_tag(my_tokens)

# Print your results!
print(my_tags)

[('A', 'DT'), ('fast', 'JJ'), ('car', 'NN'), ('runs', 'VBZ'), ('smoothly', 'RB'), ('.', '.')]


---## Topic 2: Named Entity Recognition (NER) Tagging

📄 **Explanation**

**Named Entity Recognition (NER)** is the process of finding and classifying 'named entities' in text. These are real-world objects, such as:
- **Persons**: `Steve Jobs`
- **Organizations**: `Apple Inc.`
- **Locations**: `California`
- **Dates**: `October 5, 2011`

NER systems often use a tagging scheme called **IOB**: 
- **B-** (Beginning): Marks the beginning of an entity.
- **I-** (Inside): Marks a word that is inside an entity, but not the first word.
- **O** (Outside): Marks a word that is not part of any entity.

For example, in `Apple Inc.`, `Apple` would be `B-ORG` and `Inc.` would be `I-ORG`.

In [9]:
# 💻 Example: Let's find the named entities!
from nltk import ne_chunk

# We'll use the POS-tagged sentence from our previous step
sentence = "Jeff Bezos, the founder of Amazon, visited the main headquarters in Seattle on Monday."
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

# Now, we apply Named Entity Chunking
ner_tree = ne_chunk(pos_tags)

print(ner_tree)

# 💡 Fun Fact: The output is a 'tree' structure. You can see how 'Apple' and 'Inc.' are grouped under the ORGANIZATION label, and 'California' is a GPE (Geo-Political Entity, like a location).

(S
  (PERSON Jeff/NNP)
  (GPE Bezos/NNP)
  ,/,
  the/DT
  founder/NN
  of/IN
  (GPE Amazon/NNP)
  ,/,
  visited/VBD
  the/DT
  main/JJ
  headquarters/NN
  in/IN
  (GPE Seattle/NNP)
  on/IN
  Monday/NNP
  ./.)


| Tag | Description | Example |
|---|---|---|
| PERSON | People, including fictional characters. | "Steve Jobs", "Marie Curie" |
| NORP | Nationalities or religious or political groups. | "American", "Christian" |
| FAC | Facilities: buildings, airports, highways, bridges, etc. | "Eiffel Tower", "JFK Airport" |
| ORG | Organizations: companies, agencies, institutions, etc. | "Google", "United Nations" |
| GPE | Geopolitical Entity: countries, cities, states. | "USA", "Paris", "California" |
| LOC | Non-GPE locations, mountain ranges, bodies of water. | "Sahara Desert", "Nile River" |
| PRODUCT | Objects, vehicles, foods, etc. (not services). | "iPhone", "Ford Mustang" |
| EVENT | Named hurricanes, battles, wars, sports events, etc. | "Hurricane Katrina", "Super Bowl" |
| WORK_OF_ART | Titles of books, songs, etc. | "The Mona Lisa", "Bohemian Rhapsody" |
| LAW | Named documents made into laws. | "General Data Protection Regulation" |
| LANGUAGE | Any named language. | "English", "Spanish" |
| DATE | Absolute or relative dates or periods. | "2024-10-26", "yesterday" |
| TIME | Time units smaller than a day. | "four o'clock", "10:30 a.m." |
| PERCENT | Percentage, including "%". | "20%", "fifty percent" |
| MONEY | Monetary values, including unit. | "25 dollars", "€100" |
| QUANTITY | Measurements, as of weight or distance. | "25 miles", "10 kg" |
| ORDINAL | "first", "second", etc. | "first", "10th" |
| CARDINAL | Numerals that do not fall under another type. | "one", "2", "three" |

### 🎯 Practice Task 2

Identify the named entities in the following sentence by writing them down and labeling them (e.g., PER, ORG, LOC).

**Sentence**: `"Dr. Jonas Salk discovered the polio vaccine in Pittsburgh, Pennsylvania."`

*(Bonus: Try running this sentence through the code above to see what NLTK finds!)*

In [10]:
# Your code for the bonus task here
ner_sentence = "Dr. Jonas Salk discovered the polio vaccine in Pittsburgh, Pennsylvania."

# Your turn! Tokenize, POS tag, and then run ne_chunk
ner_tokens = word_tokenize(ner_sentence)
ner_tags = pos_tag(ner_tokens)
ner_result = ne_chunk(ner_tags)

print(ner_result)

(S
  Dr./NNP
  (PERSON Jonas/NNP Salk/NNP)
  discovered/VBD
  the/DT
  polio/NN
  vaccine/NN
  in/IN
  (GPE Pittsburgh/NNP)
  ,/,
  (GPE Pennsylvania/NNP)
  ./.)


## Topic 3: Chunking (or Shallow Parsing)

📄 **Explanation**

**Chunking** is a process of grouping related words into phrases, or "chunks." It's like finding the basic building blocks of a sentence. Unlike a full grammatical parse, it's a simpler, 'shallow' way to see structure.

The most common type is **Noun Phrase (NP) Chunking**, where we group words to find noun phrases.

For example, in the sentence `"The big red ball bounced"`, the noun phrase is `[NP The big red ball]`.

We can define a grammar rule to find these chunks. A common rule for an NP is: "Find an optional Determiner (like 'The'), followed by any number of Adjectives, and then a Noun."

In [14]:
# 💻 Example: Let's find Noun Phrases!
from nltk import RegexpParser

# We'll use our tagged sentence again
sentence = "The old building housed Apple Inc. in California."
tokens = word_tokenize(sentence)
pos_tags = pos_tag(tokens)

In [15]:
pos_tags

[('The', 'DT'),
 ('old', 'JJ'),
 ('building', 'NN'),
 ('housed', 'VBD'),
 ('Apple', 'NNP'),
 ('Inc.', 'NNP'),
 ('in', 'IN'),
 ('California', 'NNP'),
 ('.', '.')]

In [16]:


# Define our Noun Phrase (NP) chunking grammar rule
# <DT>? = optional Determiner
# <JJ>* = zero or more Adjectives
# <NN.*>+ = one or more Nouns of any type (NN, NNP, etc.)
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"

In [17]:
grammar

'NP: {<DT>?<JJ>*<NN.*>+}'

In [18]:


# Create a parser with our grammar
chunk_parser = RegexpParser(grammar)

# Parse the sentence to find chunks
chunk_tree = chunk_parser.parse(pos_tags)

print(chunk_tree)

# 💡 Notice how it correctly grouped (NP The old building), (NP Apple Inc.), and (NP California)!

(S
  (NP The/DT old/JJ building/NN)
  housed/VBD
  (NP Apple/NNP Inc./NNP)
  in/IN
  (NP California/NNP)
  ./.)


| Chunk Tag | Description | Example Sentence | Chunked Example |
|---|---|---|---|
| NP | Noun Phrase | The quick brown fox | [NP The quick brown fox] |
| VP | Verb Phrase | is jumping over | [VP is jumping over] |
| PP | Prepositional Phrase | on the lazy dog | [PP on the lazy dog] |
| ADJP | Adjective Phrase | very quick | [ADJP very quick] |
| ADVP | Adverb Phrase | almost too quickly | [ADVP almost too quickly] |
| SBAR | Subordinating Conjunction | because he is happy | [SBAR because he is happy] |
| PRT | Particle | looked up the info | [PRT up] |
| CONJP | Conjunction Phrase | and but or | [CONJP and] |
| INTJ | Interjection | Wow! | [INTJ Wow!] |
| LST | List Marker | a) b) c) | [LST a)] |

### 🎯 Practice Task 3

Try to create a simple grammar rule to chunk **verb phrases (VP)** that consist of a verb (`VB` or `VBD`) and an adverb (`RB`).

**Sentence**: `The cat ran quickly.`
**Target Chunk**: `[VP ran quickly]`

Fill in the grammar rule in the code below.

In [21]:
practice_sentence = "The cat ran quickly."
practice_tokens = word_tokenize(practice_sentence)
practice_tags = pos_tag(practice_tokens)

# Define your grammar rule for a Verb Phrase (VP) here
# A verb can be VBD (past tense) or VB (present). Let's use <VB.*> to catch both.
# An adverb is RB.
vp_grammar = "VP: {<VB.*><RB>}"  # Your rule here!

vp_parser = RegexpParser(vp_grammar)
vp_tree = vp_parser.parse(practice_tags)

print(vp_tree)

(S The/DT cat/NN (VP ran/VBD quickly/RB) ./.)


## Topic 4: Lemmatization

📄 **Explanation**

**Lemmatization** is the process of reducing a word to its root or dictionary form, which is called the **lemma**.

Unlike its simpler cousin, **Stemming** (which just chops off ends of words), lemmatization is smarter. It considers the word's Part-of-Speech to find the correct dictionary form.

- The lemma of `running` (verb) is `run`.
- The lemma of `ran` (verb) is `run`.
- The lemma of `better` (adjective) is `good`.

This helps group different forms of a word into a single concept, which is very useful for analysis.

In [22]:
# 💻 Example: Finding the lemma
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Let's lemmatize some words. The 'pos' argument tells it the part of speech.
# 'v' for verb, 'a' for adjective, 'n' for noun.
print(f"running (verb) -> {lemmatizer.lemmatize('running', pos='v')}")
print(f"ran (verb) -> {lemmatizer.lemmatize('ran', pos='v')}")
print(f"better (adjective) -> {lemmatizer.lemmatize('better', pos='a')}")
print(f"buildings (noun) -> {lemmatizer.lemmatize('buildings', pos='n')}")

# 💡 What happens if you don't specify the POS? Try it!
print(f"\nrunning (default pos) -> {lemmatizer.lemmatize('running')}")

running (verb) -> run
ran (verb) -> run
better (adjective) -> good
buildings (noun) -> building

running (default pos) -> running


### 🎯 Practice Task 4

Create a list of words: `['studies', 'studying', 'feet', 'leaves']`.

Loop through the list and print the lemma for each word. Remember to use the correct `pos` tag (`'v'` for verbs, `'n'` for nouns) to get the right answer!

In [23]:
lemmatizer = WordNetLemmatizer()
my_words = ['studies', 'studying', 'feet', 'leaves']

# Hint: 'studies' and 'studying' are verbs. 'feet' and 'leaves' are nouns.
print(f"studies (verb) -> {lemmatizer.lemmatize('studies', pos='v')}")
print(f"studying (verb) -> {lemmatizer.lemmatize('studying', pos='v')}")
print(f"feet (noun) -> {lemmatizer.lemmatize('feet', pos='n')}")
print(f"leaves (noun) -> {lemmatizer.lemmatize('leaves', pos='n')}")

studies (verb) -> study
studying (verb) -> study
feet (noun) -> foot
leaves (noun) -> leaf


## Topic 5: WordNet

📄 **Explanation**

**WordNet** is a huge digital dictionary for English. But instead of just definitions, it groups words into sets of synonyms called **synsets** and shows how they are related to each other.

It's like a giant web of word meanings! Some key relationships are:

- **Hypernyms**: The 'is-a' relationship going up. A `car` *is a* `vehicle`. `vehicle` is the hypernym.
- **Hyponyms**: The relationship going down. `car`, `truck`, and `bus` are hyponyms of `vehicle`.

WordNet helps computers understand the meaning and context behind words, not just the words themselves.

In [24]:
# 💻 Example: Exploring WordNet
from nltk.corpus import wordnet

# Let's find the synsets (different meanings) for the word 'car'
syns = wordnet.synsets('car')
print("Synsets for 'car':", syns)

# Let's look at the first meaning
car_syn = syns[0]
print(f"\nFirst synset: {car_syn.name()}")
print(f"Definition: {car_syn.definition()}")

# Now let's find its hypernym (what is it an instance of?)
hypernyms = car_syn.hypernyms()
print(f"Hypernyms: {hypernyms}")

Synsets for 'car': [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]

First synset: car.n.01
Definition: a motor vehicle with four wheels; usually propelled by an internal combustion engine
Hypernyms: [Synset('motor_vehicle.n.01')]


### 🎯 Practice Task 5

Using the code example above as a guide, find the **hyponyms** (more specific examples) of the first synset for `vehicle`.

In [25]:
# Find the first synset for 'vehicle'
vehicle_syn = wordnet.synsets('vehicle')[0]

# Get the hyponyms of that synset
vehicle_hyponyms = vehicle_syn.hyponyms()

# Print the results
print(f"The first synset for vehicle is: {vehicle_syn.name()}")
print(f"Its hyponyms are: {vehicle_hyponyms}")

# 🧪 Try changing the word to 'dog' or 'cat' and see what you find!

The first synset for vehicle is: vehicle.n.01
Its hyponyms are: [Synset('wheeled_vehicle.n.01'), Synset('rocket.n.01'), Synset('craft.n.02'), Synset('bumper_car.n.01'), Synset('steamroller.n.02'), Synset('skibob.n.01'), Synset('sled.n.01'), Synset('military_vehicle.n.01')]


## Topic 6: Bag-of-Words (BoW) Model

📄 **Explanation**

Machine learning models don't understand text; they understand numbers. The **Bag-of-Words (BoW)** model is a simple way to turn text into numerical vectors.

Here's how it works:
1.  **Create a Vocabulary**: Collect all unique words from your entire set of documents.
2.  **Count Words**: For each document, count how many times each word from the vocabulary appears.

The result is a vector (a list of numbers) for each document, where each number represents the count of a specific word.

💡 **Key Idea**: This model is called a "bag" of words because it ignores grammar and word order, only caring about word counts.

In [26]:
# 💻 Example: Creating BoW vectors
from sklearn.feature_extraction.text import CountVectorizer

In [28]:
# Our documents
d1 = "NLP is fun and NLP is great."
d2 = "NLP is great for text analysis."
corpus = [d1, d2]
corpus

['NLP is fun and NLP is great.', 'NLP is great for text analysis.']

In [30]:
# Create a CountVectorizer object
vectorizer = CountVectorizer()
vectorizer

In [32]:
# Learn the vocabulary and create the BoW vectors
X = vectorizer.fit_transform(corpus)
X

<2x8 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [33]:
# Print the vocabulary (the features)
print("Vocabulary:", vectorizer.get_feature_names_out())

Vocabulary: ['analysis' 'and' 'for' 'fun' 'great' 'is' 'nlp' 'text']


In [34]:
# Print the BoW vectors as a dense array
print("\nBoW Vectors:")
print(X.toarray())


BoW Vectors:
[[0 1 0 1 1 2 2 0]
 [1 0 1 0 1 1 1 1]]


### 🎯 Practice Task 6

You have two new sentences:
- `doc_A = "The cat sat on the mat."`
- `doc_B = "The dog sat on the log."`

Use `CountVectorizer` to create BoW vectors for these two documents and print the vocabulary and the resulting array.

In [35]:
doc_A = "The cat sat on the mat."
doc_B = "The dog sat on the log."
new_corpus = [doc_A, doc_B]

# Create a new vectorizer
my_vectorizer = CountVectorizer()

# Fit and transform the new corpus
my_X = my_vectorizer.fit_transform(new_corpus)

# Print the results
print("New Vocabulary:", my_vectorizer.get_feature_names_out())
print("\nNew BoW Vectors:")
print(my_X.toarray())

New Vocabulary: ['cat' 'dog' 'log' 'mat' 'on' 'sat' 'the']

New BoW Vectors:
[[1 0 0 1 1 1 2]
 [0 1 1 0 1 1 2]]


## Topic 7: Feature Selection (Removing Stop Words)

📄 **Explanation**

Often, our text contains very common words that don't add much meaning, like `the`, `is`, `a`, `in`. These are called **stop words**.

**Feature Selection** is the process of choosing the most relevant features (words) for our model. A simple and effective form of feature selection is removing stop words.

By removing them, we reduce the size of our vocabulary and help our model focus on the words that carry more importance.

In [36]:
# 💻 Example: Removing stop words
from nltk.corpus import stopwords

sentence = "NLP is great for text analysis."
tokens = word_tokenize(sentence.lower()) # Convert to lowercase to match stop words

# Get the list of English stop words
stop_words = set(stopwords.words('english'))

filtered_tokens = []
for word in tokens:
    if word not in stop_words:
        filtered_tokens.append(word)

print("Original Tokens:", tokens)
print("Filtered Tokens (no stop words):", filtered_tokens)

Original Tokens: ['nlp', 'is', 'great', 'for', 'text', 'analysis', '.']
Filtered Tokens (no stop words): ['nlp', 'great', 'text', 'analysis', '.']


### 🎯 Practice Task 7

Remove the stop words from the following sentence: `"This is a simple example to show the process."`

In [37]:
stop_words = set(stopwords.words('english'))
practice_sentence = "This is a simple example to show the process."

# Your code here: tokenize, and then filter out stop words
practice_tokens = word_tokenize(practice_sentence.lower())

filtered_list = [word for word in practice_tokens if word not in stop_words and word.isalpha()]

print(filtered_list)

['simple', 'example', 'show', 'process']


## Topic 8: Document Similarity

📄 **Explanation**

Now that we can turn text into numerical vectors (using BoW), we can measure how similar two documents are! **Document Similarity** is a score that tells us how close two documents are in terms of their content.

A popular method is **Cosine Similarity**. It measures the angle between two vectors. 
- A score of **1** means the documents are identical (or at least point in the same direction).
- A score of **0** means the documents have no words in common.
- A score in between (like **0.6**) means they are moderately similar.

This is very powerful for tasks like finding related articles, plagiarism detection, and recommendation systems.

In [38]:
# 💻 Example: Calculating Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity

# Let's reuse our first BoW example
d1 = "NLP is fun and NLP is great."
d2 = "NLP is great for text analysis."
corpus = [d1, d2]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
vectors = X.toarray()

# The vectors are:
vec1 = vectors[0].reshape(1, -1) # reshape for the function
vec2 = vectors[1].reshape(1, -1)
print("Vector for D1:", vec1)
print("Vector for D2:", vec2)

# Calculate cosine similarity
similarity_score = cosine_similarity(vec1, vec2)

print(f"\nCosine Similarity: {similarity_score[0][0]:.3f}")

Vector for D1: [[0 1 0 1 1 2 2 0]]
Vector for D2: [[1 0 1 0 1 1 1 1]]

Cosine Similarity: 0.615


### 🎯 Practice Task 8

Based on the result above (our calculated score was around `0.615`), would you say the two documents are very similar, somewhat similar, or not similar at all? 

Write your answer as a comment in the code cell below.

In [39]:
# Your answer here:
# A score of 0.615 is greater than 0.5, which indicates the documents are somewhat similar.
# They share important keywords like 'NLP' and 'great', but also have unique words, so they are not identical.

# 🎉 Final Revision Assignment 🎉

Congratulations on completing the core concepts! Now it's time to put everything together. This assignment is designed for you to practice at home and solidify your understanding.

You are given two new customer reviews:

**Review 1**: `"The new phone has an amazing camera and a great screen."`
**Review 2**: `"The camera is good, but the new screen is much better."`

Complete the following tasks in the code cells below.

In [None]:
# Let's define our reviews
review1 = "The new phone has an amazing camera and a great screen."
review2 = "The camera is good, but the new screen is much better."

**Task 1:** Perform POS tagging on `review1`.

In [None]:
# Your code for Task 1
tokens1 = word_tokenize(review1)
tags1 = pos_tag(tokens1)
print("POS Tags for Review 1:", tags1)

**Task 2:** Perform Named Entity Recognition (NER) on this sentence: `"Apple announced the new iPhone in California."` Is it finding the entities correctly?

In [None]:
# Your code for Task 2
ner_review = "Apple announced the new iPhone in California."
ner_tokens = word_tokenize(ner_review)
ner_tags = pos_tag(ner_tokens)
ner_tree = ne_chunk(ner_tags)
print(ner_tree)

**Task 3:** Lemmatize the following words: `['amazing', 'better', 'has']`. Remember to use the correct POS tag (`a` for adjective, `v` for verb).

In [None]:
# Your code for Task 3
lemmatizer = WordNetLemmatizer()
print(f"amazing (adj) -> {lemmatizer.lemmatize('amazing', pos='a')}")
print(f"better (adj) -> {lemmatizer.lemmatize('better', pos='a')}")
print(f"has (verb) -> {lemmatizer.lemmatize('has', pos='v')}")

**Task 4:** Create Bag-of-Words vectors for `review1` and `review2`. Before you do, remove the stop words from both reviews!

In [None]:
# Your code for Task 4
stop_words = set(stopwords.words('english'))

# Filter review 1
tokens1 = word_tokenize(review1.lower())
filtered1 = [word for word in tokens1 if word not in stop_words and word.isalpha()]
filtered_review1 = " ".join(filtered1)

# Filter review 2
tokens2 = word_tokenize(review2.lower())
filtered2 = [word for word in tokens2 if word not in stop_words and word.isalpha()]
filtered_review2 = " ".join(filtered2)

print("Filtered Review 1:", filtered_review1)
print("Filtered Review 2:", filtered_review2)

# Now create BoW vectors
final_corpus = [filtered_review1, filtered_review2]
final_vectorizer = CountVectorizer()
final_X = final_vectorizer.fit_transform(final_corpus)

print("\nFinal Vocabulary:", final_vectorizer.get_feature_names_out())
print("Final Vectors:\n", final_X.toarray())

**Task 5:** Calculate the Cosine Similarity between the two vectors you created in Task 4. Are the reviews similar?

In [None]:
# Your code for Task 5
similarity = cosine_similarity(final_X[0], final_X[1])
print(f"The cosine similarity between the reviews is: {similarity[0][0]:.3f}")

**Task 6 (Conceptual):** In your own words, why is it a good idea to lemmatize words *before* creating Bag-of-Words vectors? Write your answer as a comment.

In [None]:
# Your answer for Task 6
# It's a good idea because it groups different forms of a word into a single concept.
# For example, 'run', 'running', and 'ran' would all become 'run'.
# This means our vocabulary will be smaller, and the counts for the word 'run' will be more accurate,
# which helps the model understand that all these words refer to the same idea.

### ✅ Well done! You've completed the introduction to NLP. Keep experimenting and building on these foundational concepts!