# Key Areas in NLP
**Text Preprocessing** – cleaning raw text (removing stop words, stemming, lemmatization, tokenization).

**Text Representation** – converting words into numbers (Bag of Words, TF-IDF, Word2Vec, embeddings).

**Text Classification** – spam detection, sentiment analysis, topic labeling.

**Named Entity Recognition (NER)** – identifying names, locations, organizations in text.

**Machine Translation** – Google Translate, DeepL.

**Text Summarization** – generating summaries from large documents.

**Question Answering / Chatbots** – like me 🙂

**Speech Processing**– speech-to-text, voice assistants (Alexa, Siri).

# Text Preprocessing
- Raw text contains noise (punctuations, stopwords, different cases, numbers).
- Preprocessing improves accuracy in NLP tasks (classification, sentiment analysis, etc).
- 
**Steps in Preprocessing**

  1. Tokenization

  2. Lowercasing 

  3. Stop words
 
  4. Removing punctuations/numbers
 
  5. Stemming and Lemmatization

# Overall Benefits of Preprocessing
**Removes noise** (unnecessary words, punctuation).

**Reduces complexity** (smaller vocabulary → faster & more efficient models).

**Improves accuracy** (focuses model on meaningful patterns).

**Standardizes text** (so different forms of the same word are treated equally).


**Tokanization using NLTK**: (natural language tool kit)
splitting text into smaller units

**NLTK** : simple, rule based tokenizer.

**spaCy** : More advanced - handles punctuation, named entities, and linguistic features better.

In [5]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

text ="Natural Language Processing is amazing. It helps computers understand human language!"

#sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization: ",sentences)

#Word Tokenization
words = word_tokenize(text)
print("Word Tokenization: ",words)



[nltk_data] Downloading package punkt_tab to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Sentence Tokenization:  ['Natural Language Processing is amazing.', 'It helps computers understand human language!']
Word Tokenization:  ['Natural', 'Language', 'Processing', 'is', 'amazing', '.', 'It', 'helps', 'computers', 'understand', 'human', 'language', '!']


In [3]:
import spacy

#load english tokenizer
nlp = spacy.load("en_core_web_sm")

text ="Natural Language Processing is amazing. It helps computers understand human language!"

doc = nlp(text)

#sentence Tokenization
sentences = [sent.text for sent in doc.sents]
print("Sentence Tokenization: ",sentences)

#Word Tokenization
words = [token.text for token in doc]
print("Word Tokenization: ",words)

Sentence Tokenization:  ['Natural Language Processing is amazing.', 'It helps computers understand human language!']
Word Tokenization:  ['Natural', 'Language', 'Processing', 'is', 'amazing', '.', 'It', 'helps', 'computers', 'understand', 'human', 'language', '!']


In [6]:
#using nltk find the unique words present in a sentence and duplicate words with their mode ( no. of occurences)
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter

nltk.download('punkt')

text = "Natural Language Processing is amazing. Natural language helps computers understand language."

# Tokenization
words = word_tokenize(text.lower())

# Unique words
unique_words = set(words)
print("Unique words:", unique_words)

# Duplicate words with their frequency (mode)
word_counts = Counter(words)
duplicates = {word: count for word, count in word_counts.items() if count > 1}
print("Duplicate words with counts (mode):", duplicates)


Unique words: {'amazing', 'language', 'helps', 'understand', 'processing', '.', 'is', 'natural', 'computers'}
Duplicate words with counts (mode): {'natural': 2, 'language': 3, '.': 2}


[nltk_data] Downloading package punkt to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [8]:
# another way of counting dupicates
import spacy
from collections import Counter

# Load English NLP model
nlp = spacy.load("en_core_web_sm")

# Sample paragraph
text = """Artificial Intelligence is transforming the world. 
AI helps in healthcare, finance, education, and more. 
The future with AI looks exciting!"""

# Process the text
doc = nlp(text)

# Sentence Tokenization
sentences = [sent.text.strip() for sent in doc.sents]
print("Sentences:", sentences)

# Word Tokenization (ignoring punctuation & spaces)
words = [token.text.lower() for token in doc if not token.is_punct and not token.is_space]
print("Words:", words)

# Count sentences & words
print("\nTotal Sentences:", len(sentences))
print("Total Words:", len(words))

# Unique words with frequency
word_freq = Counter(words)
print("\nUnique Words and Frequency:")
for word, freq in word_freq.items():
    print(f"{word}: {freq}")

Sentences: ['Artificial Intelligence is transforming the world.', 'AI helps in healthcare, finance, education, and more.', 'The future with AI looks exciting!']
Words: ['artificial', 'intelligence', 'is', 'transforming', 'the', 'world', 'ai', 'helps', 'in', 'healthcare', 'finance', 'education', 'and', 'more', 'the', 'future', 'with', 'ai', 'looks', 'exciting']

Total Sentences: 3
Total Words: 20

Unique Words and Frequency:
artificial: 1
intelligence: 1
is: 1
transforming: 1
the: 2
world: 1
ai: 2
helps: 1
in: 1
healthcare: 1
finance: 1
education: 1
and: 1
more: 1
future: 1
with: 1
looks: 1
exciting: 1


# Stop words/Cut words
"Stop words" are common words in a language (like the, is, in, at, of, and) that usually don’t carry important meaning in text analysis.

In Natural Language Processing (NLP), stop words are often removed before processing text because:

They occur very frequently.

They don’t add much value for tasks like text classification, search engines, or keyword extraction.

**Examples of English Stop Words**

**Articles**: a, an, the

**Pronouns**: I, you, he, she, it, we, they

**Prepositions**: on, in, at, under, over

**Conjunctions**: and, or, but, because

**Auxiliary Verbs**: is, was, were, has, have, do

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

words = word_tokenize("This is an example sentence showing off the stop words filtration")
filtered_words = [w for w in words if w.lower() not in stop_words]

# count the how many stop words and words are there
stop_words_count = sum(1 for w in words if w.lower() not in stop_words)
print("Total words: ",len(words))
print("stop words count: ",stop_words_count)

print("Original: ",words)
print("Without stop words: ",filtered_words)



Total words:  11
stop words count:  6
Original:  ['This', 'is', 'an', 'example', 'sentence', 'showing', 'off', 'the', 'stop', 'words', 'filtration']
Without stop words:  ['example', 'sentence', 'showing', 'stop', 'words', 'filtration']


[nltk_data] Downloading package stopwords to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Removing punctuations/numbers

In [13]:
#regular expression to check the text with specific patterns
# eg : extract only alphabets by eleminating numbers and symbols using re.sub()
#removing punctuations/numbers
import re
stop_words = set(stopwords.words("english"))
nltk.download("stopwords")


text = "Hello!! NLP123 is great :)"
cleaned = re.sub(r'[^a-zA-Z\s]','',text) # expression to remove punctuations/numbers
print("cleaned: ",cleaned)
words = word_tokenize(cleaned)
stopwords = [w for w in words if w not in stop_words]#comphrehence expressions(one line expression)
print("After stopwords: ",stopwords)

cleaned:  Hello NLP is great 
After stopwords:  ['Hello', 'NLP', 'great']


[nltk_data] Downloading package stopwords to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Stemming and Lemmatization

**Stemming**: Cuts words to their base/root form(fast but rough).

    - "Studies" -> "studi","playing" -> "play".
    
**Lemmatization** : Reduces words to dictionary form using grammar(accurate).

    - "studies" -> "study","playing" -> "play".

In [16]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download("wordnet")
nltk.download("omw-1.4")

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["studies","studying","better","running"]

print("Stemming: ")
print([stemmer.stem(w) for w in words])

print("Lemmatization: ")
print([lemmatizer.lemmatize(w) for w in words])

[nltk_data] Downloading package wordnet to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Stemming: 
['studi', 'studi', 'better', 'run']
Lemmatization: 
['study', 'studying', 'better', 'running']


In [18]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Downloads
nltk.download("punkt")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("averaged_perceptron_tagger_eng")

lemmatizer = WordNetLemmatizer()

# Convert POS tags to WordNet format
def get_wordnet_pos(tag):
    if tag.startswith("J"):
        return wordnet.ADJ
    elif tag.startswith("V"):
        return wordnet.VERB
    elif tag.startswith("N"):
        return wordnet.NOUN
    elif tag.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN

words = ["studies", "studying", "better", "running"]

# POS tagging
pos_tags = nltk.pos_tag(words, lang="eng")

# Lemmatization with POS
lemmatized = [lemmatizer.lemmatize(w, get_wordnet_pos(t)) for w, t in pos_tags]

print("POS Tags:", pos_tags)
print("Lemmatized:", lemmatized)


[nltk_data] Downloading package punkt to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Gundluru
[nltk_data]     Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Gundluru Madhura\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


POS Tags: [('studies', 'NNS'), ('studying', 'VBG'), ('better', 'RBR'), ('running', 'VBG')]
Lemmatized: ['study', 'study', 'well', 'run']


# Text Representation

**Why Text Representation?**

  - Models requires numbers, not raw text.
  - Machines understand numbers, not words.

  - Representation helps models learn patterns like similarity, sentiment, topics, etc.

Example:

Raw text → "I love NLP"

After BoW → [1,1,1,0,...]

After embeddings → [0.21, -0.34, 0.78, ...]
  

**Bag of Words(BoW)**

  - Count the frequency of words.
  - Represents text as a bag (collection) of words, ignoring grammar & order.

  - Just counts how many times each word appears.
  - Ex:
    sentences :
    
        1. "I like NLP"
    
        2. "I like AI"

Vocabulary = [I,like,NLP,AI]

Vectors :

      - s1 -> [1,1,1,0]
      
      - s2 -> [1,1,0,1]




# NLP Mini Project — Spam vs Ham Classification

**Overview:** Build a simple end-to-end NLP pipeline:
1. Load dataset (or use sample data),
2. Clean text,
3. Convert text to TF-IDF features,
4. Train a classifier (Multinomial Naive Bayes),
5. Evaluate and predict on new messages.

**Goal:** Have a complete, resume-friendly project that can be pushed to GitHub.


# Step 1: Import Libraries

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


# Step 2: Load Dataset

In [23]:
# Load SMS Spam Collection dataset
df = pd.read_csv("smsspamcollection", sep="\t", names=["label", "message"])

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (5572, 2)


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


# Step 3: Preprocessing

In [24]:
# Encode labels: ham -> 0, spam -> 1
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

# Basic text cleaning (convert to lowercase)
df['clean_message'] = df['message'].str.lower()

# Check class distribution
print(df['label_num'].value_counts())


label_num
0    4825
1     747
Name: count, dtype: int64


# Step 4: Train-Test Split

In [25]:
X = df['clean_message']
y = df['label_num']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train size:", X_train.shape[0], " Test size:", X_test.shape[0])
print("Train class distribution:\n", y_train.value_counts(normalize=True))
print("Test class distribution:\n", y_test.value_counts(normalize=True))


Train size: 4457  Test size: 1115
Train class distribution:
 label_num
0    0.865829
1    0.134171
Name: proportion, dtype: float64
Test class distribution:
 label_num
0    0.866368
1    0.133632
Name: proportion, dtype: float64


# Step 5: Text Vectorization

In [26]:
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print("Vocabulary size:", len(vectorizer.vocabulary_))


Vocabulary size: 7668


# Step 6: Train Model (Naive Bayes)

In [27]:
model = MultinomialNB()
model.fit(X_train_vec, y_train)

print("Model trained successfully!")


Model trained successfully!


# Step 7: Predictions

In [28]:
y_pred = model.predict(X_test_vec)


# Step 8: Evaluation

In [29]:
print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print("\n📊 Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\n📄 Classification Report:\n", classification_report(y_test, y_pred))


✅ Accuracy: 0.9874439461883409

📊 Confusion Matrix:
 [[964   2]
 [ 12 137]]

📄 Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      0.99       966
           1       0.99      0.92      0.95       149

    accuracy                           0.99      1115
   macro avg       0.99      0.96      0.97      1115
weighted avg       0.99      0.99      0.99      1115



# Step 9: Test on New Messages

In [30]:
sample_msgs = [
    "Congratulations! You won a free ticket to Bahamas. Claim now.",
    "Hey, are we still meeting today?",
    "Win a free iPhone now! Click here.",
    "Can we talk later today?"
]

sample_vec = vectorizer.transform(sample_msgs)
predictions = model.predict(sample_vec)

for msg, pred in zip(sample_msgs, predictions):
    print(f"Message: {msg} -> {'Spam' if pred==1 else 'Ham'}")


Message: Congratulations! You won a free ticket to Bahamas. Claim now. -> Spam
Message: Hey, are we still meeting today? -> Ham
Message: Win a free iPhone now! Click here. -> Spam
Message: Can we talk later today? -> Ham


# Step 10: Save Model & Vectorizer (Optional)

In [31]:
import joblib

joblib.dump(model, "spam_classifier_model.pkl")
joblib.dump(vectorizer, "vectorizer.pkl")

print("Model and vectorizer saved!")


Model and vectorizer saved!
