# üåü **NLP ASSIGNMENT**
### üìù**NLP Pipeline**
#### üë©‚Äçüéì*Submitted by :* **Aruhi Choudhary**
---


## üîπ **Part 1 ‚Äî NLP Pipeline**  
In this section, we perform:  
1Ô∏è‚É£ Tokenization  
2Ô∏è‚É£ Stopword Removal  
3Ô∏è‚É£ Stemming  
4Ô∏è‚É£ POS Tagging  
5Ô∏è‚É£ Lemmatization  

Each step shows output clearly.


### üìå STEP 1 : Install and Imports Libraries

*This cell installs required NLP libraries and imports all dependencies.*

In [13]:
!pip install -q nltk

import nltk

# Required downloads
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# POS taggers (NEW requirement)
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

# Imports
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


### üìå STEP 2 : Paragraph Input

*This cell stores a custom paragraph for applying NLP steps.*

In [14]:
paragraph = """
Natural Language Processing (NLP) is a branch of artificial intelligence
that helps computers understand, interpret and generate human language.
It is widely used in chatbots, sentiment analysis, translation and many real-world applications.
"""

print("Original Paragraph:\n")
print(paragraph)


Original Paragraph:


Natural Language Processing (NLP) is a branch of artificial intelligence 
that helps computers understand, interpret and generate human language. 
It is widely used in chatbots, sentiment analysis, translation and many real-world applications.



### üìå STEP 3 : Sentence Tokenization

*This converts the paragraph into separate sentences.*

In [15]:
sent_tokens = sent_tokenize(paragraph)

print("\n--- Sentence Tokenization ---")
print(sent_tokens)



--- Sentence Tokenization ---
['\nNatural Language Processing (NLP) is a branch of artificial intelligence \nthat helps computers understand, interpret and generate human language.', 'It is widely used in chatbots, sentiment analysis, translation and many real-world applications.']


### üìå STEP 4 : Word Tokenization

*This splits the paragraph into individual words.*

In [16]:
word_tokens = word_tokenize(paragraph)

print("\n--- Word Tokenization ---")
print(word_tokens)


--- Word Tokenization ---
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'branch', 'of', 'artificial', 'intelligence', 'that', 'helps', 'computers', 'understand', ',', 'interpret', 'and', 'generate', 'human', 'language', '.', 'It', 'is', 'widely', 'used', 'in', 'chatbots', ',', 'sentiment', 'analysis', ',', 'translation', 'and', 'many', 'real-world', 'applications', '.']


### üìå STEP 5: Stopword Removal

*This removes common English stopwords like ‚Äúthe‚Äù, ‚Äúis‚Äù, ‚Äúand‚Äù, etc.*

In [17]:
stop_words = set(stopwords.words("english"))

filtered_words = [
    w for w in word_tokens
    if w.lower() not in stop_words and w.isalnum()
]

print("\n--- After Stopword Removal ---")
print(filtered_words)


--- After Stopword Removal ---
['Natural', 'Language', 'Processing', 'NLP', 'branch', 'artificial', 'intelligence', 'helps', 'computers', 'understand', 'interpret', 'generate', 'human', 'language', 'widely', 'used', 'chatbots', 'sentiment', 'analysis', 'translation', 'many', 'applications']


### üìå STEP 6: Stemming

*This reduces words to their root form using Porter Stemmer.*

In [18]:
ps = PorterStemmer()

stemmed_words = [ps.stem(w) for w in filtered_words]

print("\n--- After Stemming ---")
print(stemmed_words)


--- After Stemming ---
['natur', 'languag', 'process', 'nlp', 'branch', 'artifici', 'intellig', 'help', 'comput', 'understand', 'interpret', 'gener', 'human', 'languag', 'wide', 'use', 'chatbot', 'sentiment', 'analysi', 'translat', 'mani', 'applic']


### üìå STEP 7: POS (Part-of-Speech) Tagging

*This assigns grammatical roles like noun, verb, adjective, etc.*

In [19]:
pos_tokens = pos_tag(filtered_words)

print("\n--- POS Tags ---")
print(pos_tokens)


--- POS Tags ---
[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('NLP', 'NNP'), ('branch', 'NN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('helps', 'VBZ'), ('computers', 'NNS'), ('understand', 'VBP'), ('interpret', 'JJ'), ('generate', 'NN'), ('human', 'JJ'), ('language', 'NN'), ('widely', 'RB'), ('used', 'VBN'), ('chatbots', 'NNS'), ('sentiment', 'JJ'), ('analysis', 'NN'), ('translation', 'NN'), ('many', 'JJ'), ('applications', 'NNS')]


### üìå STEP 8: Lemmatization

*This converts words to proper dictionary/root form using POS tags.*

In [20]:
lemmatizer = WordNetLemmatizer()

# Function to map POS tags to WordNet format
def get_pos(tag):
    if tag.startswith("J"):
        return wordnet.ADJ      # Adjective
    elif tag.startswith("V"):
        return wordnet.VERB     # Verb
    elif tag.startswith("N"):
        return wordnet.NOUN     # Noun
    elif tag.startswith("R"):
        return wordnet.ADV      # Adverb
    return wordnet.NOUN

# Apply POS-aware Lemmatization
lemmatized_words = [
    lemmatizer.lemmatize(word, get_pos(tag))
    for word, tag in pos_tokens
]

print("\n--- After Lemmatization ---")
print(lemmatized_words)



--- After Lemmatization ---
['Natural', 'Language', 'Processing', 'NLP', 'branch', 'artificial', 'intelligence', 'help', 'computer', 'understand', 'interpret', 'generate', 'human', 'language', 'widely', 'use', 'chatbots', 'sentiment', 'analysis', 'translation', 'many', 'application']


## üéâ Final Summary  
| Step | Output Example | Description |
|------|----------------|-------------|
| Tokenization | ['Natural', 'Language', ...] | Converted text into tokens |
| Stopword Removal | ['Natural','Language', ...] | Removed common words |
| Stemming | ['natur', 'languag'] | Reduced words to base |
| Lemmatization | ['natural', 'language'] | Dictionary form |

‚û°Ô∏è This completes Part-1 of the assignment.