<a href="https://colab.research.google.com/github/pankajit/DS-AI-ML/blob/master/NLP_with_Python_Step_by_Step.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP with Python — Step by Step (Hands‑on)

Welcome! This notebook teaches core Natural Language Processing (NLP) workflows **with practical Python code**.  
We start simple (tokenization, n‑grams, TF‑IDF) and build up to **text classification**, **topic modeling**, **similarity search**, and (optional) **NER**.

> Everything uses tiny in-notebook datasets so you can run it anywhere. Optional cells let you install libraries or try bigger models (spaCy / Transformers).

## What you'll learn
1. Text cleaning & tokenization (regex, stopwords)
2. Features: **Bag of Words** & **TF‑IDF**
3. **n‑grams** and why they help
4. Build a **text classifier** (Logistic Regression) with scikit‑learn
5. Evaluate: accuracy, confusion matrix, top features
6. **Topic modeling** (LDA) for unsupervised themes
7. **Cosine similarity** for search-like matching
8. *(Optional)* **spaCy NER** and *(Optional)* **Transformers sentiment**

## Quick Setup (optional)
Run this block if you're in a fresh environment (e.g., Colab) to install dependencies.

In [1]:
# OPTIONAL: install common NLP libs
# Remove the leading '!' if your environment doesn't allow shell commands.
# You can skip this if you already have these installed.
try:
    import sklearn, nltk, gensim, spacy, matplotlib
except Exception as e:
    print("Installing packages...")
    # Comment out any you don't want
    !pip -q install scikit-learn nltk gensim spacy matplotlib seaborn
    # Light model for spaCy NER (optional)
    !python -m spacy download en_core_web_sm -q

import nltk
# Download lightweight NLTK resources (safe to re-run)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

print("Setup complete.")

Installing packages...
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.2/38.2 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.7/3.7 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m33.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the sour

## 1) Tiny Labeled Dataset (Sentiment)
We'll create a small toy dataset with **positive** and **negative** sentiments.

In [3]:
toy_data = [
    ("I love this phone, the camera is amazing and battery lasts all day", "pos"),
    ("Absolutely fantastic build quality and performance", "pos"),
    ("Best purchase I've made this year, super happy!", "pos"),
    ("The service was quick and friendly, highly recommend", "pos"),
    ("Great value for money, works like a charm", "pos"),
    ("Terrible experience, the product broke in two days", "neg"),
    ("Worst customer support, very rude and unhelpful", "neg"),
    ("I hate the design and the screen is awful", "neg"),
    ("Total waste of money, not worth it", "neg"),
    ("It arrived damaged and the return process is painful", "neg"),
]

texts = [t for t, _ in toy_data]
labels = [y for _, y in toy_data]
len(texts), texts[:2], labels[:2]

['pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'neg']


(10,
 ['I love this phone, the camera is amazing and battery lasts all day',
  'Absolutely fantastic build quality and performance'],
 ['pos', 'pos'])

## 2) Basic Preprocessing
We will:
- lowercase text
- remove punctuation
- tokenize
- remove stopwords
- *(optional)* lemmatize

In [9]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

STOPWORDS = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

def clean_tokenize(text, do_lemma=False):
    # lowercase
    text = text.lower()
    # remove punctuation (keep letters + spaces)
    text = re.sub(r"[^a-z\s]", " ", text)
    # tokenize
    tokens = word_tokenize(text)
    # remove stopwords and short tokens
    tokens = [t for t in tokens if t not in STOPWORDS and len(t) > 2]
    if do_lemma:
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens

    print(clean_tokenize(texts[0], do_lemma=True))

## 3) Bag of Words (BoW) & TF‑IDF
We'll vectorize texts using **CountVectorizer** (BoW) and **TfidfVectorizer**.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# We'll provide our own tokenizer to apply the same cleaning
cv = CountVectorizer(analyzer=clean_tokenize)
X_bow = cv.fit_transform(texts)
print("BoW shape:", X_bow.shape)
print("Sample features:", list(cv.vocabulary_.keys())[:20])

tfidf = TfidfVectorizer(analyzer=clean_tokenize)
X_tfidf = tfidf.fit_transform(texts)
print("TF-IDF shape:", X_tfidf.shape)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## 4) n‑grams
Let’s capture bigrams/trigrams to encode short phrases.

In [None]:
tfidf_ngrams = TfidfVectorizer(analyzer=clean_tokenize, ngram_range=(1,2), min_df=1)
X_tfidf_ngrams = tfidf_ngrams.fit_transform(texts)
print("TF-IDF with unigrams+bigrams:", X_tfidf_ngrams.shape)
# Show top 15 features by IDF (lowest df = most rare) just to peek
features = tfidf_ngrams.get_feature_names_out()
print("Example features:", features[:30])

## 5) Text Classification (Logistic Regression)
We'll train a simple classifier on the toy dataset using **TF‑IDF (1-2 grams)**.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

X_train, X_test, y_train, y_test = train_test_split(X_tfidf_ngrams, labels, test_size=0.3, random_state=42, stratify=labels)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, pred))
print("\nReport:\n", classification_report(y_test, pred))

## 6) Inspect Important Features
Which words/phrases push the model toward **positive** vs **negative**?

In [None]:
import numpy as np

feature_names = tfidf_ngrams.get_feature_names_out()
coefs = clf.coef_[0]  # Binary classifier -> single row
top_pos_idx = np.argsort(coefs)[-10:][::-1]
top_neg_idx = np.argsort(coefs)[:10]

print("Top POSITIVE indicators:")
for i in top_pos_idx:
    print(f"{feature_names[i]:25s}  {coefs[i]: .3f}")

print("\nTop NEGATIVE indicators:")
for i in top_neg_idx:
    print(f"{feature_names[i]:25s}  {coefs[i]: .3f}")

## 7) Topic Modeling (LDA)
Unsupervised discovery of themes with **Latent Dirichlet Allocation**.
We'll use a small corpus of product reviews and tech sentences.

In [None]:
extra_corpus = [
    "The laptop performance is great for programming and data analysis",
    "Battery life could be better but the keyboard is comfortable",
    "I love the new camera features and image stabilization",
    "Customer service resolved my issue quickly and professionally",
    "Hate the lag and random crashes after the latest update",
    "Docker and Kubernetes help scale microservices in production",
    "Neural networks excel at image and text classification",
    "Cloud storage redundancy prevents accidental data loss",
    "The display is crisp, but speakers are too quiet",
    "Refund process was smooth and the agent was polite"
]

from sklearn.decomposition import LatentDirichletAllocation as LDA

lda_vec = CountVectorizer(analyzer=clean_tokenize, min_df=1)
X_lda = lda_vec.fit_transform(extra_corpus)
lda = LDA(n_components=2, random_state=42, learning_method="batch")
lda.fit(X_lda)

words = np.array(lda_vec.get_feature_names_out())
for topic_idx, comp in enumerate(lda.components_):
    top_idx = np.argsort(comp)[-10:][::-1]
    print(f"\nTopic {topic_idx}:")
    print(", ".join(words[top_idx]))

## 8) Text Similarity (Cosine)
Build a simple search: given a **query**, retrieve the most similar sentences.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Use TF-IDF on the extra_corpus
tfidf_sim = TfidfVectorizer(analyzer=clean_tokenize, ngram_range=(1,2))
X_sim = tfidf_sim.fit_transform(extra_corpus)

def search(query, top_k=3):
    q = tfidf_sim.transform([query])
    scores = cosine_similarity(q, X_sim).flatten()
    best_idx = np.argsort(scores)[-top_k:][::-1]
    return [(extra_corpus[i], float(scores[i])) for i in best_idx]

for q in ["camera stabilization", "customer support", "cloud production", "neural networks"]:
    print(f"\nQuery: {q}")
    for sent, sc in search(q):
        print(f"  -> ({sc:.3f}) {sent}")

## 9) (Optional) Named Entity Recognition with spaCy
Try extracting **people, orgs, locations** automatically.  
This cell uses the small English model; if it's not available, the install step above adds it.

In [None]:
try:
    import spacy
    nlp = spacy.load("en_core_web_sm")
    sample = "Apple is opening a new office in Bengaluru and hiring 500 engineers in 2025."
    doc = nlp(sample)
    print([(ent.text, ent.label_) for ent in doc.ents])
except Exception as e:
    print("spaCy or model not installed. Run the setup cell above if you want to try NER.")
    print("Error:", e)

## 10) (Optional) Transformers (Hugging Face)
Quick demo using a pre-trained sentiment pipeline.  
> This downloads a small model at runtime; skip if you're offline.

In [None]:
try:
    from transformers import pipeline
    clf_pipe = pipeline('sentiment-analysis')
    print(clf_pipe("I absolutely love this phone!"))
    print(clf_pipe("This is the worst update ever."))
except Exception as e:
    print("Transformers not available (or no internet). You can install with:")
    print("!pip install transformers torch --quiet")
    print("Error:", e)

## 11) Mini‑Project: Sentiment Classifier Function
A small utility you can re-use. It trains on our toy set and predicts on new text.

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(analyzer=clean_tokenize, ngram_range=(1,2))),
    ("lr", LogisticRegression(max_iter=1000))
])

pipeline.fit(texts, labels)

def predict_sentiment(s: str):
    return pipeline.predict([s])[0], pipeline.predict_proba([s])[0].max()

tests = [
    "I am delighted with the new features and the speed",
    "Horrible bug, app keeps crashing and support ignores me",
]
for t in tests:
    label, conf = predict_sentiment(t)
    print(f"{t} -> {label} ({conf:.2f})")

## Next Steps
- Replace the toy dataset with your real data (CSV), wrap pipelines in functions
- Try cross-validation (e.g., `StratifiedKFold`)
- Clean text more (URLs, emojis), add **char-level n‑grams** for misspellings
- Use **GridSearchCV** to tune hyperparameters
- Try **fastText**, **GloVe**, or **Transformers** for stronger accuracy
- Move to **spaCy** for production pipelines (tokenization, NER, POS)
- Explore **LangChain / RAG** for QA over your documents