# Skill Share NLP Assignment
### Title: Foundations of NLP – From Tokenization to Encoding

**Name:** Gorijala Lalith Sai Charan

##  Objective
This assignment introduces essential NLP preprocessing techniques using Python libraries like NLTK and Scikit-learn. It includes hands-on tasks with explanations to build foundational skills for language-based AI systems.

##  Part A – Basic Concepts

###  1. Tokenization
**What is tokenization?**

Think of tokenization like breaking down a large piece of text, like an article or a paragraph, into its fundamental building blocks. Instead of looking at the whole block of text at once, we chop it up into smaller, manageable pieces.

These smaller pieces are called "tokens." Depending on what you're trying to do, these tokens could be:

Words: This is the most common type of tokenization. You'd take a sentence and separate each word. For example, "The quick brown fox" becomes ["The", "quick", "brown", "fox"].
Sentences: You could also break a paragraph into individual sentences.
Sub-word units: In some advanced cases, tokens might be parts of words or even individual characters.
Why do we do this?

Imagine you're trying to understand what a text is about or analyze its meaning. Looking at it as one long string of characters is hard. By breaking it into tokens, we can:

Count words: See how often certain words appear.
Analyze grammar: Look at the relationships between words.
Prepare for analysis: Most NLP algorithms need text to be in this tokenized format before they can process it.

**Why is it important in NLP?**

It allows NLP algorithms to work at the level of meaningful units, enabling tasks like parsing, classification, and translation.It creates the building blocks: Before you can do anything complex with text, you need to define what the basic units are. Tokenization identifies these units (like words) so that all later steps can operate on them.
It enables other processes: Once you have the individual words or sentences separated, you can then do things like count how often words appear, figure out the grammar, or identify names. All these tasks rely on having distinct tokens to work with.
It turns unstructured text into something usable: Raw text is just a flow of characters. Tokenization gives it structure by breaking it down into a list of specific items, which is the format most NLP tools and models need.

In [2]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize, word_tokenize

text = "Skill Share is offering amazing NLP courses. Students love to learn with hands-on practice."
print("Sentence Tokenization:", sent_tokenize(text))
print("Word Tokenization:", word_tokenize(text))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...


Sentence Tokenization: ['Skill Share is offering amazing NLP courses.', 'Students love to learn with hands-on practice.']
Word Tokenization: ['Skill', 'Share', 'is', 'offering', 'amazing', 'NLP', 'courses', '.', 'Students', 'love', 'to', 'learn', 'with', 'hands-on', 'practice', '.']


[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


**Explanation:**
This code demonstrates sentence and word tokenization using NLTK. `sent_tokenize` splits the text into sentences, and `word_tokenize` breaks it into words and punctuation.

**Observation:**
The paragraph was successfully split into two sentences. Each sentence was further divided into tokens such as `"Skill"`, `"Share"`, `"is"`, and `"offering"`. The tokenizer handles punctuation correctly, making the text ready for further analysis.

Running this code will show how the provided text is split into sentences and then into individual words.

###  2. Stemming
Root vs. Stem: The important distinction here is that a "root" is a recognized linguistic unit, the core of a word that carries its primary meaning. A "stem," as produced by a stemming algorithm, is often a truncated version of a word that might not be a valid word in the dictionary. It's generated by applying a set of rules (heuristics) to chop off endings, which can be a less precise process than finding a true root.

Why Stemming Can Affect Meaning: Since stemming algorithms rely on rules to remove suffixes, they don't always consider the context or grammatical function of a word. This can lead to different words with distinct meanings being reduced to the same stem, or a word being stemmed incorrectly, thus altering or losing its original semantic meaning. For example, "universal" and "university" might both be stemmed to "univers," which could cause confusion if the distinction between the words is important for the NLP task.

In [3]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["playing", "played", "plays", "playful"]
stems = [stemmer.stem(word) for word in words]
print("Stemmed Words:", stems)

Stemmed Words: ['play', 'play', 'play', 'play']


**Explanation:**
This code uses the `PorterStemmer` to apply stemming, reducing words to their root form.

**Observation:**
Words like `"playing"`, `"played"`, and `"plays"` were reduced to `"play"`. `"Playful"` also became `"play"`, which shows that stemming may remove meaningful suffixes, affecting word semantics.

Running this code will show how different forms of "play" are reduced to a common stem, which in this case is "play". However, it also shows "playful" being reduced to "playf", which is not a valid word and demonstrates the heuristic nature of stemming.

### 3. Lemmatization
What is Lemmatization? As you stated, lemmatization goes beyond simply chopping off word endings. It uses a dictionary and analyzes the word's context (often by considering its part of speech) to find its dictionary base form, which is called the lemma. This means the output of lemmatization is typically a valid word.

When is Lemmatization More Appropriate? You are right that lemmatization is better when you need the actual dictionary form of a word and when preserving the word's meaning is crucial. Because lemmatization considers the word's context and uses a lexicon, it's less likely to produce non-words or merge words with different meanings into the same form, which can happen with stemming. This makes it more suitable for tasks where semantic accuracy is important, such as in question answering systems or sentiment analysis.

In [5]:
import nltk
# Download the wordnet resource which is needed for lemmatization
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["playing", "played", "plays", "playful"]
lemmas = [lemmatizer.lemmatize(word, pos="v") for word in words]
print("Lemmatized Words:", lemmas)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Lemmatized Words: ['play', 'play', 'play', 'playful']


**Explanation:**
This code applies lemmatization using `WordNetLemmatizer`, converting words to their dictionary base form considering their POS.

**Observation:**
The words `"playing"`, `"played"`, and `"plays"` were correctly lemmatized to `"play"`. `"Playful"` remained unchanged, preserving its distinct meaning and demonstrating lemmatization’s semantic sensitivity.

Notice that in this example, the pos="v" argument is used. This is important because the lemma of a word can depend on whether it's used as a noun, verb, adjective, etc. Lemmatization takes this into account, further highlighting its linguistic awareness compared to stemming.

###  4. Stopwords Removal
What are stopwords? As you mentioned, these are common words like "the," "is," "and," etc., that appear frequently in text but often don't carry much unique meaning on their own. They are often filtered out to reduce noise and focus on the more important words for analysis.

When should we keep or remove them?

Removal: Stopwords are typically removed in tasks where the frequency of meaningful words is important, such as topic modeling (finding the main themes in a document) or text classification (categorizing documents). Removing stopwords helps algorithms focus on the content words that are more indicative of the topic or category.
Retention: However, in tasks where the grammatical structure and flow of language are important, such as text summarization or machine translation, keeping stopwords is crucial. They provide the glue that holds sentences together and helps convey the overall meaning and coherence of the text.

In [7]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is an example showing off stop words filtration."
stop_words = set(stopwords.words('english'))

filtered = [word for word in word_tokenize(text) if word.lower() not in stop_words]
print("Filtered Text:", filtered)

[nltk_data] Downloading package stopwords to /root/nltk_data...


Filtered Text: ['example', 'showing', 'stop', 'words', 'filtration', '.']


[nltk_data]   Unzipping corpora/stopwords.zip.


**Explanation:**
This code removes English stopwords using NLTK, keeping only meaningful content words.

**Observation:**
Stopwords like `"this"`, `"is"`, and `"an"` were successfully filtered out. Words such as `"example"`, `"stop"`, and `"filtration"` were retained, which improves relevance in tasks like classification or topic extraction.

This code snippet tokenizes the text and then filters out the words that are present in NLTK's list of English stopwords, resulting in a list of the remaining words.

##  Part B – Intermediate Concepts

###  5. Parts of Speech (POS) Tagging
What is POS tagging? As you've stated, POS tagging assigns grammatical categories to words, like noun, verb, adjective, adverb, etc. It tells you what role each word plays in a sentence.

Importance: Knowing the part of speech for each word provides valuable syntactic information. This information is crucial for:

Parsing: Understanding the grammatical relationships between words in a sentence.
Understanding Context: The part of speech can help clarify the meaning of ambiguous words (e.g., "run" as a verb versus "run" as a noun).
Downstream Tasks: Many NLP tasks, such as named entity recognition, sentiment analysis, and machine translation, benefit from knowing the grammatical structure provided by POS tags. For example, in NER, knowing that a word is a proper noun is a strong indicator that it might be part of a named entity.

In [9]:
import nltk
# Download the averaged_perceptron_tagger resource which is needed for POS tagging
nltk.download('averaged_perceptron_tagger')
# Download the specific 'eng' sub-resource if the previous download didn't include it
nltk.download('averaged_perceptron_tagger_eng')
from nltk.tokenize import word_tokenize

text = "Skill Share empowers students with practical NLP skills."
tokens = word_tokenize(text)
tags = nltk.pos_tag(tokens)
print("POS Tags:", tags)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...


POS Tags: [('Skill', 'NNP'), ('Share', 'NNP'), ('empowers', 'VBZ'), ('students', 'NNS'), ('with', 'IN'), ('practical', 'JJ'), ('NLP', 'NNP'), ('skills', 'NNS'), ('.', '.')]


[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


**Explanation:**
The code uses `pos_tag` from NLTK to perform Part-of-Speech tagging, labeling each word's grammatical role.

**Observation:**
Words were correctly tagged, such as `"Skill"` as NNP and `"empowers"` as VBZ. This output provides essential syntactic structure that can support parsing and word sense disambiguation.

Running this code will show each token from the sentence paired with its assigned part of speech tag (e.g., ('Skill', 'NNP'), ('empowers', 'VBZ')). This output clearly illustrates how POS tagging provides structural information about the text.

###  6. Named Entity Recognition (NER)
What is NER? As you mentioned, NER identifies and classifies entities like names of people, places, and organizations in text. It's about recognizing and labeling these specific types of information within the text.

Applications: NER is a widely used technique with many practical applications, including:

Extracting key information from documents like resumes, financial reports, and news articles.
Improving search results by identifying named entities in queries.
Powering question answering systems by recognizing entities in questions and finding relevant information in text.
Helping in data anonymization by identifying and potentially masking personal information.

In [10]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Steve Jobs founded Apple in California.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Steve Jobs PERSON
Apple ORG
California GPE


**Explanation:**
This code uses spaCy to identify named entities in a sentence, assigning semantic labels like PERSON or ORG.

**Observation:**
`"Steve Jobs"` was identified as a PERSON, `"Apple"` as an ORG, and `"California"` as a GPE. This demonstrates NER’s ability to extract structured insights from unstructured text—useful in resumes, news, and finance.

This code processes the sentence and identifies "Steve Jobs" as a PERSON, "Apple" as an ORG (Organization), and "California" as a GPE (Geo-Political Entity), demonstrating how NER extracts and labels named entities.

##  Part C – Text Encoding

###  7. One Hot Encoding
How OneHotEncoding works: You're absolutely right. One-hot encoding takes categorical data (like "Gender" with values "Male", "Female", "Other") and converts it into a numerical format that machine learning models can understand. It creates new binary columns, one for each unique category. For a given data point, the column corresponding to its category will have a value of 1, and all other category columns will have a value of 0. This creates a unique "binary vector" representation for each category.

Use Cases: As you mentioned, one-hot encoding is commonly used for categorical features in various machine learning tasks. Encoding gender, geographical regions, product types, etc., are all typical applications where you need to represent distinct categories numerically without implying any order or magnitude between them.

Handling unknown labels: The handle_unknown='ignore' parameter is a crucial setting in scikit-learn's OneHotEncoder. As you noted, it prevents errors when the encoder encounters a category in new data that it didn't see during its training (fit) phase. Instead of raising an error, it will output a vector of all zeros for that unknown category. This is very useful in real-world scenarios where your training data might not contain all possible categories that appear in your test or production data.

In [12]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'Gender': ['Male', 'Female', 'Female', 'Male', 'Other']})
# Change 'sparse' to 'sparse_output' based on the error and potential version changes [1]
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

encoded = encoder.fit_transform(df[['Gender']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
print(encoded_df)

   Gender_Female  Gender_Male  Gender_Other
0            0.0          1.0           0.0
1            1.0          0.0           0.0
2            1.0          0.0           0.0
3            0.0          1.0           0.0
4            0.0          0.0           1.0


**Explanation:**
The code applies `OneHotEncoder` to a gender column in a DataFrame, transforming categories into binary vectors.

**Observation:**
Each gender label was encoded into its own column (e.g., `"Male"` → `[1,0,0]`). The `handle_unknown='ignore'` parameter ensures robust handling of unseen categories during inference.

This code snippet clearly shows how the 'Gender' column with categorical values is transformed into separate binary columns ('Gender_Female', 'Gender_Male', 'Gender_Other') where each row has a '1' in the column corresponding to its original category. The use of sparse_output=False (as suggested by the note derived from [1]) ensures the output is a dense NumPy array, which is often easier to work with than a sparse matrix, especially for smaller datasets.

##  Bonus: Real-World Reflection

**Task Selected:** Named Entity Recognition (NER)

Imagine you've just learned how to identify different types of things in a picture – like finding apples, bananas, and oranges. You're pretty good at it when the fruits are clearly shown on a table. That's like using NER on a simple, clear sentence, as the notebook showed.

Now, imagine you have to find those same fruits in a messy fruit bowl, with leaves, shadows, and other things in the way. Some might be partially hidden, some might be bruised, and some might look a bit different than what you're used to. That's like trying to use NER on a real-world resume.

Resumes are often messy:

Different styles: Everyone formats their resume differently.
Abbreviations: People use shorthand you might not recognize.
Typos: Mistakes happen!
So, even though you know how to find names, degrees, and companies (like finding the fruits), doing it perfectly on every single resume is hard because they are so varied and unstructured.

The reflection is saying that to get really good at finding those specific things in resumes (like finding all the fruits no matter how they look in the bowl), you need to:

Practice with the real stuff: Instead of just practicing with clear examples, you need to train your "fruit-finding" skills specifically on lots of different messy fruit bowls (lots of different resumes).
Learn new "fruit" types: Maybe you need to find specific types of apples or oranges that you didn't learn about initially. Similarly, in resumes, you might need to identify specific degrees or certifications that aren't the standard "organization" or "person."
So, the key takeaway is: while the basic idea of finding entities is straightforward, applying it to messy, real-world text requires dedicated effort and training on the specific kind of text you're working with to get truly accurate results.

## ✅ Conclusion

Think of the notebook as a guide that just walked you through getting text ready for a computer to understand.

It showed you how to:

Break down text: Like taking a big story and separating it into individual words and sentences (Tokenization).
Clean up words: Making words simpler so the computer sees "running," "ran," and "runs" as basically the same idea (Stemming and Lemmatization).
Filter out noise: Getting rid of common words that don't add much unique meaning, like "the" or "is" (Stopwords Removal).
Identify word types: Figuring out if a word is a person, place, or organization (NER).
Give words roles: Knowing if a word is a noun, a verb, an adjective, etc. (POS Tagging).
Translate words into numbers: Turning the text into a format that computers can actually use in calculations and models (Feature Encoding like One-Hot Encoding).
The conclusion simply says that all these steps are super important and necessary if you want to build any kind of smart system that can work with language – like chatbots, translation tools, or systems that understand what people are saying online. They are the basic tools you need in your toolbox to get started with language AI.