# Introduction to Natural Language Processing (NLP)

**Natural Language Processing (NLP)** is a field of Artificial Intelligence that gives machines the ability to read, understand, and derive meaning from human languages.

Common applications of NLP include:
*   **Chatbots & Personal Assistants:** (e.g., Siri, Alexa) Understanding voice commands.
*   **Search Engines:** (e.g., Google) Understanding the intent behind queries.
*   **Spam Detection:** Filtering emails based on their content.
*   **Sentiment Analysis:** Monitoring social media to gauge public opinion.
*   **Machine Translation:** (e.g., Google Translate) Converting text from one language to another.
*   **Text Summarization:** Automatically shortening long documents while keeping key information.


## 1. The NLP Pipeline

Most NLP tasks follow a standard pipeline:

1.  **Data Collection:** Gathering text data (tweets, articles, reviews).
2.  **Preprocessing:** Cleaning the text to make it easier for machines to understand.
    *   *Tokenization:* Splitting text into words.
    *   *Lowercasing:* Converting "Hello" to "hello".
    *   *Stopword Removal:* Removing common words like "the", "is", "at".
    *   *Stemming/Lemmatization:* Reducing words to their root form (e.g., "running" -> "run").
3.  **Text Representation:** Converting text into numbers (vectors) so algorithms can process them.
4.  **Model Training:** Training a machine learning model on the numbers.
5.  **Prediction:** Using the model on new text.


### Pipeline Example: Preprocessing

In [2]:
# A very simple example of preprocessing
corpus = [
    "Hello! How are you doing today?",
    "I am learning Natural Language Processing.",
    "NLP is fascinating!"
]

cleaned_corpus = []

print("Original Corpus:")
for s in corpus:
    print(s)

Original Corpus:
Hello! How are you doing today?
I am learning Natural Language Processing.
NLP is fascinating!


In [3]:
# Cleaning the text
for sentence in corpus:
    # 1. Lowercase
    sentence = sentence.lower()
    # 2. Remove punctuation (simple way)
    sentence = "".join([char for char in sentence if char.isalnum() or char.isspace()])
    cleaned_corpus.append(sentence)

print("\nCleaned Corpus:")
for s in cleaned_corpus:
    print(s)


Cleaned Corpus:
hello how are you doing today
i am learning natural language processing
nlp is fascinating


## 2. Tokenization

**Tokenization** is the process of breaking down text into smaller units called **tokens**.
*   **Sentence Tokenization:** Splitting text into sentences.
*   **Word Tokenization:** Splitting sentences into words.


In [4]:
# Install NLTK if not already present
!python3 -m pip install nltk

import nltk
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('punkt') # Download necessary datasets
nltk.download('punkt_tab')

from nltk.tokenize import sent_tokenize, word_tokenize, WordPunctTokenizer


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mohitchaudhary/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/mohitchaudhary/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Sentence Tokenization

In [5]:
text = "Hello world. NLP is great! Let's learn it."

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentences:")
print(sentences)

Sentences:
['Hello world.', 'NLP is great!', "Let's learn it."]


### Word Tokenization

In [6]:
# Word Tokenization
words = word_tokenize(sentences[0])
print("Words from first sentence:")
print(words)

Words from first sentence:
['Hello', 'world', '.']


### Specialized Tokenizer (Punctuation)

In [7]:
# Tokenizer that handles punctuation differently
punctuation_tokenizer = WordPunctTokenizer()
print("WordPunctTokenizer Output:")
print(punctuation_tokenizer.tokenize(text))

WordPunctTokenizer Output:
['Hello', 'world', '.', 'NLP', 'is', 'great', '!', 'Let', "'", 's', 'learn', 'it', '.']


## 3. Stemming vs. Lemmatization

Both techniques reduce words to their base form, but they work differently:

*   **Stemming:** Removes suffixes (e.g., "eating" -> "eat"). It's crude and fast but might result in non-words (e.g., "argued" -> "argu").
*   **Lemmatization:** Uses a dictionary to interpret the word's meaning and return the valid root form (lemma) (e.g., "better" -> "good").


In [8]:
nltk.download('wordnet')
from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer

words = ["eating", "writing", "programming", "ate", "better"]

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mohitchaudhary/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Stemming Example

In [9]:
# Stemming
porter = PorterStemmer()
lancaster = LancasterStemmer()

print("--- Porter Stemmer ---")
for w in words:
    print(f"{w} -> {porter.stem(w)}")

--- Porter Stemmer ---
eating -> eat
writing -> write
programming -> program
ate -> ate
better -> better


### Lemmatization Example

In [10]:
# Lemmatization
print("--- WordNet Lemmatizer ---")
lemmatizer = WordNetLemmatizer()
for w in words:
    print(f"{w} -> {lemmatizer.lemmatize(w, pos='v')}") # pos='v' treats words as verbs

--- WordNet Lemmatizer ---
eating -> eat
writing -> write
programming -> program
ate -> eat
better -> better


## 4. Text Representation

Machines can't understand text directly; they need numbers.

*   **One-Hot Encoding:** Creates a vector of the size of the vocabulary. 1 if the word is present, 0 otherwise. (Problem: Huge, sparse vectors, no meaning).
*   **Bag of Words (BoW):** Counts word frequencies. Ignores order.
*   **TF-IDF (Term Frequency-Inverse Document Frequency):** Weighs down common words (like "the") and highlights unique, important words.


In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Tiny corpus
corpus = ["I love NLP", "NLP is fun", "I love coding"]

### Manual One-Hot / Vocabulary Concept

In [12]:
# Manual One-Hot / BoW concept
vocab = sorted(list(set(" ".join(corpus).split())))
print("Vocabulary:", vocab)
# Note: 'I' might be represented as [0, 1, 0, 0, 0] depending on its index in the vocabulary

Vocabulary: ['I', 'NLP', 'coding', 'fun', 'is', 'love']


### Bag of Words (BoW) with Scikit-Learn

In [13]:
# Bag of Words with Scikit-Learn
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)

print("Bag of Words Matrix:")
print(X_bow.toarray())
print("Features (Words):", vectorizer.get_feature_names_out())

Bag of Words Matrix:
[[0 0 0 1 1]
 [0 1 1 0 1]
 [1 0 0 1 0]]
Features (Words): ['coding' 'fun' 'is' 'love' 'nlp']


### TF-IDF

In [14]:
# TF-IDF
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)

print("TF-IDF Matrix:")
print(X_tfidf.toarray())

TF-IDF Matrix:
[[0.         0.         0.         0.70710678 0.70710678]
 [0.         0.62276601 0.62276601 0.         0.4736296 ]
 [0.79596054 0.         0.         0.60534851 0.        ]]


## 5. Parts of Speech (POS) Tagging

POS tagging assigns a grammatical label (Noun, Verb, Adjective, etc.) to each token. This helps in understanding the sentence structure and distinguishing meaning (e.g., "book" as a noun vs. "book" as a verb).


In [15]:
nltk.download('averaged_perceptron_tagger_eng')

sentence = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(sentence)

pos_tags = nltk.pos_tag(tokens)
print("POS Tags:")
print(pos_tags)
# NN: Noun, VB: Verb, JJ: Adjective, etc.

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/mohitchaudhary/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


POS Tags:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]


## 6. Named Entity Recognition (NER)

**NER** identifies real-world objects in text and classifies them into categories like **PERSON**, **ORG** (Organization), **GPE** (Location), **DATE**, etc.

**Use Case:** Extracting company names from news articles or locations from tweets.


In [16]:
!python3 -m pip install spacy
!python -m spacy download en_core_web_sm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
zsh:1: command not found: python


In [17]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "Apple looking at buying U.K. startup for $1 billion on Monday."
doc = nlp(text)

print("Entities found:")
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.4.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/ipyk

Entities found:
Apple -> ORG
U.K. -> GPE
$1 billion -> MONEY
Monday -> DATE


## 7. Simple Sentiment Analysis

**Sentiment Analysis** determines the emotional tone behind a text (Positive, Negative, Neutral).


### Creating the Dataset

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# 1. Tiny Dataset
data = [
    ("I love this movie", 1), # 1 = Positive
    ("This is the best", 1),
    ("Amazing experience", 1),
    ("I hate this", 0), # 0 = Negative
    ("This is terrible", 0),
    ("Worst movie ever", 0)
]
texts, labels = zip(*data)

print(f"Data samples: {len(data)}")

Data samples: 6


### Vectorization & Training

In [19]:
# 2. Text Representation
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)

# 3. Train Model
model = LogisticRegression()
model.fit(X, labels)
print("Model trained successfully.")

Model trained successfully.


### Prediction

In [20]:
# 4. Predict on new sentences
new_sentences = ["I love NLP", "This is bad"]
X_new = vectorizer.transform(new_sentences)
predictions = model.predict(X_new)

print("Predictions:")
for text, pred in zip(new_sentences, predictions):
    label = "Positive" if pred == 1 else "Negative"
    print(f"'{text}' -> {label}")

Predictions:
'I love NLP' -> Positive
'This is bad' -> Negative


## Summary

In this notebook, we learned:
*   **Pipeline:** Preprocessing -> Representation -> Modeling.
*   **Tokenization:** Breaking text into words/sentences.
*   **Stemming/Lemmatization:** Normalizing words.
*   **Representation:** BoW vs TF-IDF.
*   **POS & NER:** Understanding grammar and entities.
*   **Applications:** Sentiment Analysis, Translation, Summarization.

**Further Study:**
*   **Word Embeddings:** (Word2Vec, GloVe) capturing meaning better than TF-IDF.
*   **Deep Learning:** RNNs, LSTMs for sequence modeling.
*   **Transformers:** (BERT, GPT) The state-of-the-art models powering modern NLP.
