
---

# 📚🤖 NLP Interview Preparation Cheatsheet

---

## 1️⃣ NLP Core Pipeline (↔︎ your `1_NLP_Core_Pipeline`) 🛠️

| 🔑 Step            | 🎯 Goal                  | 🐍 Key Imports / Functions                                                                         | 🧠 Mini-Notes / Examples                                          |
| ------------------ | ------------------------ | -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
| Data Collection 🌐 | Ingest text              | `requests`, `bs4.BeautifulSoup`, `pandas.read_*`, `datasets.load_dataset`                          | APIs, scraping, CSV/JSON/Parquet; HF Datasets for benchmarks      |
| Preprocessing 🧹   | Clean/normalize          | `re.sub`, `unicodedata.normalize`, `html.unescape`, `langdetect`                                   | Lowercasing (task-dependent), unicode NFKC, de-HTML, lang filter  |
| Tokenization ✂️    | Split to tokens/subwords | `nltk.word_tokenize`, `spacy.load("en_core_web_sm")`, `transformers.AutoTokenizer.from_pretrained` | Rule/word vs BPE/WordPiece/SentencePiece; `is_split_into_words`   |
| Representation 🧩  | Text → vectors           | `CountVectorizer`, `TfidfVectorizer`, `gensim.models.Word2Vec`, `SentenceTransformer`              | BoW/TF-IDF; static (word2vec/GloVe/FastText); sentence embeddings |
| Modeling 🤖        | Learn mapping            | `sklearn` (NB/SVM/LogReg), `torch`, `tensorflow`, `transformers`                                   | Classical baselines first; DL/Transformers for SOTA               |
| Evaluation 📊      | Score models             | `sklearn.metrics`, `sacrebleu.corpus_bleu`, `rouge_score`                                          | Use task-appropriate metrics (see §8)                             |
| Deployment 🚀      | Serve safely             | `fastapi`, `uvicorn`, `bentoml`, `onnxruntime`, `torchserve`                                       | Schemas via `pydantic`; health/probe; version your models         |
| Monitoring 🔍      | Track real-world         | `mlflow`, `wandb`, `evidently`, `prometheus_client`                                                | Data/label drift, Hit@k, latency, cost, guardrails                |

---

## 2️⃣ Preprocessing & Cleaning Methods (deep dive) 🧽

| 🧹 Method             | 🐍 Snippet / Import                                    | 📝 When / Why                                  |
| --------------------- | ------------------------------------------------------ | ---------------------------------------------- |
| Case/Whitespace       | `text.lower()`, `" ".join(text.split())`               | Normalize only if case isn’t signal            |
| Unicode Normalization | `unicodedata.normalize("NFKC", text)`                  | Merge look-alikes; canonicalize                |
| Punct/Digits Removal  | `re.sub(r"[^\w\s]", "", t)`, `re.sub(r"\d+", "", t)`   | Be careful for dates, codes                    |
| Stopwords             | `nltk.corpus.stopwords` / `spacy.lang.en.stop_words`   | Often helpful in BoW; less so for transformers |
| Lemma/Stemming        | `spacy.tokenizer`+`token.lemma_`, `nltk.PorterStemmer` | Lemma preferred for readability                |
| Contractions          | `contractions.fix(text)`                               | English normalization                          |
| Language Detection    | `langdetect.detect(text)`                              | Filter multilingual corpora                    |
| PII Redaction         | `presidio_analyzer`                                    | Compliance, anonymization                      |
| Sentence Split        | `spacy.pipe`, `nltk.sent_tokenize`                     | Upstream for summarization/MT                  |

---

## 3️⃣ Text Representation Techniques (↔︎ `3_Feature_Representation`) 🧩

| 🧩 Technique            | 🐍 Import / API                                    | ⚙️ Notes / Params                |
| ----------------------- | -------------------------------------------------- | -------------------------------- |
| Bag of Words            | `CountVectorizer(ngram_range=(1,2), min_df=2)`     | Sparse, fast baseline            |
| TF-IDF                  | `TfidfVectorizer(max_features=50_000)`             | Downweights common terms         |
| Hashing Trick           | `HashingVectorizer(n_features=2**20)`              | Memory-fixed, no vocab           |
| Static Embeds           | `gensim.Word2Vec`, `gensim.FastText`               | OOV handling (FastText subwords) |
| Contextual Token Embeds | `AutoModel.from_pretrained("bert-base-uncased")`   | Use last-hidden states           |
| Sentence Embeds         | `sentence_transformers` (e.g., `all-MiniLM-L6-v2`) | Semantic search, clustering      |
| Dim. Reduction          | `TruncatedSVD`, `PCA`, `UMAP`                      | SVD for sparse TF-IDF            |
| Pooling                 | mean/max/CLS token                                 | For token→sentence aggregation   |

---

## 4️⃣ Programming Tools (↔︎ `3_Programming_Tools`) ⚙️

| 🧰 Category     | 📦 Library / Import                               | 💡 Tip                      |
| --------------- | ------------------------------------------------- | --------------------------- |
| Essentials      | `re`, `json`, `itertools`, `functools`, `pathlib` | Build tiny utilities fast   |
| NLP Classics    | `nltk`, `spacy`, `gensim`                         | spaCy for fast pipelines    |
| HF Stack        | `transformers`, `datasets`, `tokenizers`          | Unified models + datasets   |
| Training Utils  | `accelerate`, `optuna`, `wandb`                   | Multi-GPU + HPO + tracking  |
| Serving         | `fastapi`, `bentoml`                              | Typed endpoints w/ Pydantic |
| Data Versioning | `dvc`, `git-lfs`                                  | Reproducible corpora        |
| Safety          | `presidio`, moderation classifiers                | PII, toxicity checks        |

---

## 5️⃣ Classical ML in NLP (↔︎ `4_Machine_Learning`) 🧮

| 🧮 Model            | 🐍 Import                                             | ✅ Good For           | ⚠️ Note             |
| ------------------- | ----------------------------------------------------- | -------------------- | ------------------- |
| Multinomial NB      | `from sklearn.naive_bayes import MultinomialNB`       | BoW/TF-IDF text cls  | Strong baseline     |
| Logistic Regression | `from sklearn.linear_model import LogisticRegression` | Linear separable cls | Use `max_iter` high |
| Linear SVM          | `from sklearn.svm import LinearSVC`                   | High-dim sparse      | Robust margin       |
| SGD (hinge/log)     | `from sklearn.linear_model import SGDClassifier`      | Large streaming      | Partial fit         |
| CRF                 | `sklearn-crfsuite`                                    | NER/POS              | Needs hand features |

---

## 6️⃣ Deep Learning for NLP (↔︎ `5_DeepLearning`) 🔥

| 🧠 Family  | 🐍 Import / Layer                  | 📌 Use-case     | 💡 Notes                         |
| ---------- | ---------------------------------- | --------------- | -------------------------------- |
| RNN        | `nn.RNN`, `keras.layers.SimpleRNN` | Proto sequence  | Rarely used now                  |
| LSTM       | `nn.LSTM`, `keras.layers.LSTM`     | Seq tagging/gen | Long deps                        |
| GRU        | `nn.GRU`                           | Lighter LSTM    | Competitive                      |
| CNN-Text   | `keras.layers.Conv1D`              | Classification  | N-gram features                  |
| Attention  | `nn.MultiheadAttention`            | Focus tokens    | Pre-Transformer                  |
| BiLSTM-CRF | `torchcrf.CRF`                     | NER             | Still strong classical-DL hybrid |

---

## 7️⃣ Transformers (↔︎ `6_Transformers`) ⚡

| 🏗️ Component / Topic          | 🐍 Import / Tool                                  | 🔧 Key Params / Tips                  |
| ------------------------------ | ------------------------------------------------- | ------------------------------------- |
| Tokenizer                      | `AutoTokenizer`                                   | `padding`, `truncation`, `max_length` |
| Encoder (BERT/RoBERTa/DeBERTa) | `AutoModel`, `AutoModelForSequenceClassification` | Use `CLS` pooling or mean-pool        |
| Seq2Seq (T5/BART/Marian)       | `AutoModelForSeq2SeqLM`                           | `num_beams`, `length_penalty`         |
| Decoder-only (GPT-style)       | `AutoModelForCausalLM`                            | `temperature`, `top_k`, `top_p`       |
| Efficient FT                   | `peft` (LoRA/QLoRA), `bitsandbytes`               | 8/4-bit + adapters = cheap FT         |
| Trainer API                    | `transformers.Trainer`                            | `lr`, `weight_decay`, `warmup_ratio`  |
| Inference                      | `pipeline`, `vllm`                                | Batch, static shapes, FP16/BF16       |
| Long Context                   | RoPE/ALiBi scaling                                | Segment packing, sliding window       |

---

## 8️⃣ LLMs & GenAI (↔︎ `7_LLMs_GenAI`) 🧠

| 🧩 Technique      | 🐍 Library                      | 📝 What to Know                         |
| ----------------- | ------------------------------- | --------------------------------------- |
| Pretraining       | `transformers`, `datasets`      | MLM/CLM objectives, tokenizer choice    |
| SFT (Instruction) | `trl.SFTTrainer`                | Supervised fine-tuning on instructions  |
| Preference Tuning | `trl` (DPO/RLHF)                | Align outputs to preferences            |
| PEFT              | `peft`                          | LoRA/adapters prompt-tuning             |
| Decoding          | `generate()`                    | Greedy/beam/top-k/top-p/typical         |
| Safety            | moderation, regex/struct schema | Refuse lists, output schemas (Pydantic) |

---

## 9️⃣ Core NLP Tasks (↔︎ `8_Core_Tasks`) 🧱

| 🎯 Task                              | 🐍 Common APIs                                                                                | 📏 Typical Metrics           | 💡 Notes                         |
| ------------------------------------ | --------------------------------------------------------------------------------------------- | ---------------------------- | -------------------------------- |
| Classification (sentiment/topic) 🏷️ | `AutoModelForSequenceClassification`, `pipeline("text-classification")`, `LogisticRegression` | Acc/Prec/Rec/F1              | Start with LogReg+TF-IDF         |
| Sequence Labeling (NER/POS) 🧷       | `AutoModelForTokenClassification`, `spacy`                                                    | F1 (entity-level), `seqeval` | Align labels with subwords       |
| QA (extractive/abstractive) ❓        | `pipeline("question-answering")`, `AutoModelForSeq2SeqLM`                                     | EM/F1, ROUGE                 | Context window limits            |
| Summarization 📚                     | `pipeline("summarization")`                                                                   | ROUGE/BERTScore              | Length control: `min_new_tokens` |
| MT 🌍                                | `MarianMTModel`, `pipeline("translation")`                                                    | BLEU/ChrF/sacreBLEU          | Domain adaptation helps          |
| Paraphrase/STS 🔁                    | `sentence_transformers`                                                                       | Pearson/Spearman             | Cosine sim on sentence embeds    |
| Information Extraction 📑            | `spacy.ner`, patterns, `re`                                                                   | Precision/Recall/F1          | Rule+ML hybrids practical        |

---

## 🔟 Evaluation Metrics (↔︎ `5_Evaluation`) 📊

| 📏 Metric           | 🐍 Call                                    | 🧠 Use              |
| ------------------- | ------------------------------------------ | ------------------- |
| Accuracy            | `accuracy_score(y, yhat)`                  | Balanced/easy tasks |
| Precision/Recall/F1 | `precision_recall_fscore_support`          | Imbalanced classes  |
| AUC-ROC/PR          | `roc_auc_score`, `average_precision_score` | Prob classifiers    |
| Confusion Matrix    | `confusion_matrix`                         | Error analysis      |
| BLEU / sacreBLEU    | `sacrebleu.corpus_bleu`                    | MT n-gram overlap   |
| ROUGE               | `rouge_score`                              | Summaries           |
| BERTScore           | `bert_score.score`                         | Semantic similarity |
| Perplexity          | `exp(loss)`                                | Language models     |
| Ranking (IR/RAG)    | Hit@k, MRR, nDCG (`recleval`/custom)       | Retrieval quality   |

---

## 1️⃣1️⃣ Multilingual & Multimodal (↔︎ `9_Multimodal`) 🌐🎧🖼️

| 🌍/🖼️ Domain         | 🐍 Library                     | 🧠 Notes                                  |
| --------------------- | ------------------------------ | ----------------------------------------- |
| Multilingual Encoders | `AutoModel` (XLM-R, mBERT)     | Shared subwords; consider script coverage |
| Translation           | `MarianMT`, `NLLB` via HF      | Use `sacrebleu` for eval                  |
| Speech → Text (ASR)   | `openai/whisper`, `torchaudio` | 16kHz mono, VAD helps                     |
| Text → Speech         | `TTS` (Coqui), `pyttsx3`       | Prosody, multilingual voices              |
| Vision-Language       | `CLIP`, `BLIP`, `Llava`        | Cross-modal retrieval/QA                  |

---

## 1️⃣2️⃣ Agentic AI & RAG (↔︎ `10_Agentic_AI`) 🕵️‍♂️🧭

| 🧠 Component     | 🐍 Tooling                                   | 📝 Key Ideas                                  |
| ---------------- | -------------------------------------------- | --------------------------------------------- |
| Retrieval        | `faiss`, `chromadb`, `sentence_transformers` | Chunking, overlap, hybrid search (BM25+dense) |
| Generators       | `AutoModelForSeq2SeqLM` / CausalLM           | Grounding + citations                         |
| Orchestration    | `langchain`, `langgraph`                     | Tools, planning (ReAct, Plan&Execute)         |
| Evaluation (RAG) | `ragas`                                      | Faithfulness, answer relevance                |
| Caching          | `redis` / in-proc                            | Cut cost/latency; TTL by prompt hash          |

---

## 1️⃣3️⃣ Scalability, Prod Readiness & Trends (↔︎ `11_Scalability_Trends`) 🏗️

| 🏭 Topic              | 🔧 Stack                                         | 💡 Practical Tip              |
| --------------------- | ------------------------------------------------ | ----------------------------- |
| Packaging             | `poetry`, `uv`, `Docker`                         | Pin models/tokenizers         |
| Serving at Scale      | `vllm`, TGI, `ray`                               | Throughput via paged KV cache |
| Observability         | `mlflow`, `wandb`, `prometheus`, `opentelemetry` | Trace→span LLM calls          |
| Cost Controls         | batching, quant (8/4-bit), early-exit            | Log tokens & latency          |
| Safety & Ethics       | policy checks, `presidio`, curated red teams     | Pre-deploy red teaming        |
| Data/Model Versioning | `dvc`, model cards                               | Always log config + seed      |

---

## 1️⃣4️⃣ Python Imports & Built-ins (with tiny examples) 🐍

| 🔧 Item                 | ✨ Example                         | 🧠 Use               |
| ----------------------- | --------------------------------- | -------------------- |
| `re.sub`                | `re.sub(r"\s+", " ", t)`          | Whitespace collapse  |
| `unicodedata.normalize` | `normalize("NFKC", t)`            | Unicode tidy         |
| `collections.Counter`   | `Counter(tokens).most_common(10)` | Top-k words          |
| `itertools.islice`      | `list(islice(iterable, 100))`     | Take first N         |
| `functools.lru_cache`   | `@lru_cache`                      | Cache heavy funcs    |
| `pathlib.Path.glob`     | `Path("data").glob("*.txt")`      | File discovery       |
| `json.loads/dumps`      | `json.loads(s)`                   | Data interchange     |
| `textwrap.shorten`      | `shorten(t, width=120)`           | Console summaries    |
| `argparse`              | CLI flags                         | Reproducible scripts |
| `typing` + `pydantic`   | Typed DTOs                        | Safer I/O schemas    |

---

## 1️⃣5️⃣ Quick Hugging Face Patterns 🧪

| 🎯 Task       | 🐍 One-liners                                                                  | 💡 Notes                                                 |
| ------------- | ------------------------------------------------------------------------------ | -------------------------------------------------------- |
| Tokenizer     | `tok = AutoTokenizer.from_pretrained("bert-base-uncased")`                     | `padding="max_length"` for batching                      |
| Text Cls      | `pipe = pipeline("text-classification", model=...)`                            | For production, use `AutoModelForSequenceClassification` |
| Summarization | `pipeline("summarization")(text, max_new_tokens=128)`                          | Control length                                           |
| Generation    | `model.generate(**tok(text, return_tensors="pt"), top_p=0.9, temperature=0.7)` | Sampling configs                                         |
| Seq2Seq FT    | `Trainer(model, args, train_dataset, eval_dataset)`                            | Log metrics; early stopping                              |

---

## 1️⃣6️⃣ Interview Tip Cards 💼

| 🧠 Topic         | 💡 What to Say in 20s                                                |
| ---------------- | -------------------------------------------------------------------- |
| Baselines First  | “I start with TF-IDF + LogReg; then move to Transformers if needed.” |
| Data Matters     | “I invest in cleaning, label quality, and ablations before tuning.”  |
| Metrics Fit Task | “For seq2seq I use ROUGE/BERTScore; retrieval uses MRR/nDCG.”        |
| Efficient FT     | “PEFT (LoRA/QLoRA) + 4-bit quant drastically reduces cost.”          |
| Guardrails       | “Typed schemas (Pydantic), PII redaction, moderation, and logging.”  |
| Observability    | “I track latency, token counts, drift (Evidently), and cost.”        |

---
