#  Skill Share NLP Assignment
**Student Name:** Gorijala lalith sai charan  
**Date:** 2025-06-01

---

This notebook includes detailed Natural Language Processing (NLP) tasks including tokenization, preprocessing, stopwords removal, POS tagging, Named Entity Recognition, and one-hot encoding using Python libraries such as NLTK and SpaCy.


##  Setup and Import Required Libraries

In [39]:
#gorijala lalith sai charan 
# 1-06-2026 
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import pos_tag, ne_chunk
from nltk.tree import Tree
from sklearn.preprocessing import OneHotEncoder
import spacy

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\JC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\JC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\JC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\JC\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\JC\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\JC\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


##  Load Dataset

In [40]:
#gorijala lalith sai charan 
# 1-06-2026
# # Load the dataset (make sure the file path is correct)
df = pd.read_csv("NLP_Assignment_Sentences.csv")
sentences = df['Sentence'].dropna().tolist()
df.head()

Unnamed: 0,SentenceID,Sentence
0,1,I love programming in Python.
1,2,Natural Language Processing is fascinating.
2,3,Spacy and NLTK are popular NLP libraries.
3,4,Machine learning enables predictive analysis.
4,5,Data preprocessing is a crucial step in NLP.


##  Task 1: Tokenization
Tokenization is the process of breaking down text into smaller components such as words or phrases. In this task, we will use `nltk.word_tokenize()` to tokenize each sentence in the dataset.

In [41]:
#gorijala lalith sai charan 
# 1-06-2026
from nltk.tokenize import word_tokenize

# Assuming sentences is a list of strings
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

# Display tokenized output
for i, sent in enumerate(tokenized_sentences):
    print(f"Sentence {i+1}: {sent}")


Sentence 1: ['I', 'love', 'programming', 'in', 'Python', '.']
Sentence 2: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']
Sentence 3: ['Spacy', 'and', 'NLTK', 'are', 'popular', 'NLP', 'libraries', '.']
Sentence 4: ['Machine', 'learning', 'enables', 'predictive', 'analysis', '.']
Sentence 5: ['Data', 'preprocessing', 'is', 'a', 'crucial', 'step', 'in', 'NLP', '.']


##  Task 1: Tokenization

**Objective:**  
Break each sentence into individual words (tokens).

**Why it matters:**  
Tokenization is the first step in NLP. It allows us to process text word by word, enabling downstream tasks like frequency analysis, vectorization, etc.

**Method:**  
We use `nltk.word_tokenize()` to split each sentence into a list of words.


## Task 2: Text Preprocessing (Stemming and Lemmatization)
In this step, we apply stemming and lemmatization:
- **Stemming**: Reduces words to their base or root form (e.g., 'running' → 'run').
- **Lemmatization**: Converts words to their base dictionary form (e.g., 'better' → 'good').

In [42]:
#gorijala lalith sai charan 
# 1-06-2026
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Use list comprehensions instead of .apply()
stemmed_sentences = [[stemmer.stem(word) for word in word_tokenize(sent)] for sent in sentences]
lemmatized_sentences = [[lemmatizer.lemmatize(word) for word in word_tokenize(sent)] for sent in sentences]

# Compare first 2 examples
for i in range(2):
    print(f"\nSentence {i+1}: {sentences[i]}")
    print("Stemmed:", stemmed_sentences[i])
    print("Lemmatized:", lemmatized_sentences[i])



Sentence 1: I love programming in Python.
Stemmed: ['i', 'love', 'program', 'in', 'python', '.']
Lemmatized: ['I', 'love', 'programming', 'in', 'Python', '.']

Sentence 2: Natural Language Processing is fascinating.
Stemmed: ['natur', 'languag', 'process', 'is', 'fascin', '.']
Lemmatized: ['Natural', 'Language', 'Processing', 'is', 'fascinating', '.']


## Task 2: Text Preprocessing (Stemming and Lemmatization)

**Objective:**  
Convert words to their base/root form.

- **Stemming**: Uses rules to remove prefixes/suffixes (e.g., 'running' → 'run').
- **Lemmatization**: Uses a vocabulary and grammar rules to return the base or dictionary form (e.g., 'better' → 'good').

**Why it matters:**  
Reduces vocabulary size and helps models treat similar words as the same.


## Task 3: Stopwords Removal
Stopwords are common words (like 'is', 'the', 'in') that usually do not add significant meaning. We will remove them using NLTK's stopword list.

In [43]:
#gorijala lalith sai charan 
# 1-06-2026
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

# tokenized_sentences is a list of lists (of tokens)
cleaned_sentences = [
    [word for word in sent if word.lower() not in stop_words]
    for sent in tokenized_sentences
]

# Display cleaned sentences
for i, sent in enumerate(cleaned_sentences):
    print(f"Sentence {i+1}:", sent)


Sentence 1: ['love', 'programming', 'Python', '.']
Sentence 2: ['Natural', 'Language', 'Processing', 'fascinating', '.']
Sentence 3: ['Spacy', 'NLTK', 'popular', 'NLP', 'libraries', '.']
Sentence 4: ['Machine', 'learning', 'enables', 'predictive', 'analysis', '.']
Sentence 5: ['Data', 'preprocessing', 'crucial', 'step', 'NLP', '.']


## Task 3: Stopwords Removal

**Objective:**  
Remove common words (like "is", "and", "the") that do not carry meaningful information.

**Why it matters:**  
Eliminates noise from the dataset and improves model performance by focusing only on the important words.

**Tool used:**  
`nltk.corpus.stopwords`


## Task 4: Part-of-Speech (POS) Tagging
Each word will be tagged with its part of speech such as noun (NN), verb (VB), adjective (JJ), etc.

In [44]:
#gorijala lalith sai charan 
# 1-06-2026
from nltk import pos_tag

# Assuming tokenized_sentences is a list of token lists
pos_tagged = [pos_tag(sent) for sent in tokenized_sentences]

# Display POS tagged results
for i, tags in enumerate(pos_tagged):
    print(f"\nSentence {i+1}:")
    print(tags)



Sentence 1:
[('I', 'PRP'), ('love', 'VBP'), ('programming', 'VBG'), ('in', 'IN'), ('Python', 'NNP'), ('.', '.')]

Sentence 2:
[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('fascinating', 'VBG'), ('.', '.')]

Sentence 3:
[('Spacy', 'NN'), ('and', 'CC'), ('NLTK', 'NNP'), ('are', 'VBP'), ('popular', 'JJ'), ('NLP', 'NNP'), ('libraries', 'NNS'), ('.', '.')]

Sentence 4:
[('Machine', 'NN'), ('learning', 'VBG'), ('enables', 'NNS'), ('predictive', 'JJ'), ('analysis', 'NN'), ('.', '.')]

Sentence 5:
[('Data', 'NNP'), ('preprocessing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('crucial', 'JJ'), ('step', 'NN'), ('in', 'IN'), ('NLP', 'NNP'), ('.', '.')]


## Task 4: Part-of-Speech (POS) Tagging

**Objective:**  
Label each word with its grammatical role, such as noun, verb, adjective, etc.

**Why it matters:**  
POS tagging helps in syntactic analysis, named entity recognition, and information extraction.

**Tool used:**  
`nltk.pos_tag()`


##  Task 5: Named Entity Recognition (NER)
Named entities are real-world objects such as people, organizations, locations, etc. We’ll extract them using both **NLTK** and **spaCy**.

In [45]:
#gorijala lalith sai charan 
# 1-06-2026
import spacy
nlp = spacy.load("en_core_web_sm")

for i, sentence in enumerate(sentences):
    doc = nlp(sentence)
    print(f"\nSentence {i+1}: {sentence}")
    for ent in doc.ents:
        print(f" - {ent.text} ({ent.label_})")


Sentence 1: I love programming in Python.
 - Python (GPE)

Sentence 2: Natural Language Processing is fascinating.
 - Natural Language Processing (ORG)

Sentence 3: Spacy and NLTK are popular NLP libraries.
 - Spacy (PERSON)
 - NLTK (ORG)
 - NLP (ORG)

Sentence 4: Machine learning enables predictive analysis.

Sentence 5: Data preprocessing is a crucial step in NLP.
 - NLP (ORG)


##  Task 5: Named Entity Recognition (NER)

**Objective:**  
Identify named entities like people, places, dates, and organizations in text.

**Why it matters:**  
NER is useful in question answering, knowledge graphs, and summarization.

**Tools used:**  
- `nltk.ne_chunk()`  
- `spacy` model (`en_core_web_sm`)


## 🔢 Task 6: One-Hot Encoding
Convert words to binary vectors. Each unique word gets a vector with one '1' and all other '0's.

In [46]:
#gorijala lalith sai charan 
# 1-06-2026
from sklearn.preprocessing import OneHotEncoder
from nltk.tokenize import word_tokenize

# Step 1: Get the first 3 sentences
first_3 = sentences[:3]

# Step 2: Tokenize and create vocabulary
tokens = [word.lower() for s in first_3 for word in word_tokenize(s)]
vocab = sorted(set(tokens))

# Step 3: Initialize encoder without 'sparse' or 'sparse_output'
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit([[word] for word in vocab])

# Step 4: Transform and display one-hot vectors
for i, sentence in enumerate(first_3):
    print(f"\nSentence {i+1}: {sentence}")
    for word in word_tokenize(sentence.lower()):
        if word in vocab:
            vector = encoder.transform([[word]]).toarray()  # Convert sparse matrix to dense
            print(f"{word}: {vector}")
        else:
            print(f"{word}: Unknown word")


Sentence 1: I love programming in Python.
i: [[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
love: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
programming: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]
in: [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
python: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]
.: [[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Sentence 2: Natural Language Processing is fascinating.
natural: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]]
language: [[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
processing: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]
is: [[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
fascinating: [[0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
.: [[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Sentence 3: Spacy and NLTK are popular NLP libraries.
spacy: [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1

##  Task 6: One-Hot Encoding

**Objective:**  
Convert words into binary vectors where each unique word has its own vector.

**Why it matters:**  
Machine learning models require numerical input. One-hot encoding is a basic way to represent categorical data like words.

**Tool used:**  
`sklearn.preprocessing.OneHotEncoder`
