
---

# üìöü§ñ NLP Interview Preparation Cheatsheet

---

## 1Ô∏è‚É£ NLP Core Pipeline (‚ÜîÔ∏é your `1_NLP_Core_Pipeline`) üõ†Ô∏è

| üîë Step            | üéØ Goal                  | üêç Key Imports / Functions                                                                         | üß† Mini-Notes / Examples                                          |
| ------------------ | ------------------------ | -------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
| Data Collection üåê | Ingest text              | `requests`, `bs4.BeautifulSoup`, `pandas.read_*`, `datasets.load_dataset`                          | APIs, scraping, CSV/JSON/Parquet; HF Datasets for benchmarks      |
| Preprocessing üßπ   | Clean/normalize          | `re.sub`, `unicodedata.normalize`, `html.unescape`, `langdetect`                                   | Lowercasing (task-dependent), unicode NFKC, de-HTML, lang filter  |
| Tokenization ‚úÇÔ∏è    | Split to tokens/subwords | `nltk.word_tokenize`, `spacy.load("en_core_web_sm")`, `transformers.AutoTokenizer.from_pretrained` | Rule/word vs BPE/WordPiece/SentencePiece; `is_split_into_words`   |
| Representation üß©  | Text ‚Üí vectors           | `CountVectorizer`, `TfidfVectorizer`, `gensim.models.Word2Vec`, `SentenceTransformer`              | BoW/TF-IDF; static (word2vec/GloVe/FastText); sentence embeddings |
| Modeling ü§ñ        | Learn mapping            | `sklearn` (NB/SVM/LogReg), `torch`, `tensorflow`, `transformers`                                   | Classical baselines first; DL/Transformers for SOTA               |
| Evaluation üìä      | Score models             | `sklearn.metrics`, `sacrebleu.corpus_bleu`, `rouge_score`                                          | Use task-appropriate metrics (see ¬ß8)                             |
| Deployment üöÄ      | Serve safely             | `fastapi`, `uvicorn`, `bentoml`, `onnxruntime`, `torchserve`                                       | Schemas via `pydantic`; health/probe; version your models         |
| Monitoring üîç      | Track real-world         | `mlflow`, `wandb`, `evidently`, `prometheus_client`                                                | Data/label drift, Hit@k, latency, cost, guardrails                |

---

## 2Ô∏è‚É£ Preprocessing & Cleaning Methods (deep dive) üßΩ

| üßπ Method             | üêç Snippet / Import                                    | üìù When / Why                                  |
| --------------------- | ------------------------------------------------------ | ---------------------------------------------- |
| Case/Whitespace       | `text.lower()`, `" ".join(text.split())`               | Normalize only if case isn‚Äôt signal            |
| Unicode Normalization | `unicodedata.normalize("NFKC", text)`                  | Merge look-alikes; canonicalize                |
| Punct/Digits Removal  | `re.sub(r"[^\w\s]", "", t)`, `re.sub(r"\d+", "", t)`   | Be careful for dates, codes                    |
| Stopwords             | `nltk.corpus.stopwords` / `spacy.lang.en.stop_words`   | Often helpful in BoW; less so for transformers |
| Lemma/Stemming        | `spacy.tokenizer`+`token.lemma_`, `nltk.PorterStemmer` | Lemma preferred for readability                |
| Contractions          | `contractions.fix(text)`                               | English normalization                          |
| Language Detection    | `langdetect.detect(text)`                              | Filter multilingual corpora                    |
| PII Redaction         | `presidio_analyzer`                                    | Compliance, anonymization                      |
| Sentence Split        | `spacy.pipe`, `nltk.sent_tokenize`                     | Upstream for summarization/MT                  |

---

## 3Ô∏è‚É£ Text Representation Techniques (‚ÜîÔ∏é `3_Feature_Representation`) üß©

| üß© Technique            | üêç Import / API                                    | ‚öôÔ∏è Notes / Params                |
| ----------------------- | -------------------------------------------------- | -------------------------------- |
| Bag of Words            | `CountVectorizer(ngram_range=(1,2), min_df=2)`     | Sparse, fast baseline            |
| TF-IDF                  | `TfidfVectorizer(max_features=50_000)`             | Downweights common terms         |
| Hashing Trick           | `HashingVectorizer(n_features=2**20)`              | Memory-fixed, no vocab           |
| Static Embeds           | `gensim.Word2Vec`, `gensim.FastText`               | OOV handling (FastText subwords) |
| Contextual Token Embeds | `AutoModel.from_pretrained("bert-base-uncased")`   | Use last-hidden states           |
| Sentence Embeds         | `sentence_transformers` (e.g., `all-MiniLM-L6-v2`) | Semantic search, clustering      |
| Dim. Reduction          | `TruncatedSVD`, `PCA`, `UMAP`                      | SVD for sparse TF-IDF            |
| Pooling                 | mean/max/CLS token                                 | For token‚Üísentence aggregation   |

---

## 4Ô∏è‚É£ Programming Tools (‚ÜîÔ∏é `3_Programming_Tools`) ‚öôÔ∏è

| üß∞ Category     | üì¶ Library / Import                               | üí° Tip                      |
| --------------- | ------------------------------------------------- | --------------------------- |
| Essentials      | `re`, `json`, `itertools`, `functools`, `pathlib` | Build tiny utilities fast   |
| NLP Classics    | `nltk`, `spacy`, `gensim`                         | spaCy for fast pipelines    |
| HF Stack        | `transformers`, `datasets`, `tokenizers`          | Unified models + datasets   |
| Training Utils  | `accelerate`, `optuna`, `wandb`                   | Multi-GPU + HPO + tracking  |
| Serving         | `fastapi`, `bentoml`                              | Typed endpoints w/ Pydantic |
| Data Versioning | `dvc`, `git-lfs`                                  | Reproducible corpora        |
| Safety          | `presidio`, moderation classifiers                | PII, toxicity checks        |

---

## 5Ô∏è‚É£ Classical ML in NLP (‚ÜîÔ∏é `4_Machine_Learning`) üßÆ

| üßÆ Model            | üêç Import                                             | ‚úÖ Good For           | ‚ö†Ô∏è Note             |
| ------------------- | ----------------------------------------------------- | -------------------- | ------------------- |
| Multinomial NB      | `from sklearn.naive_bayes import MultinomialNB`       | BoW/TF-IDF text cls  | Strong baseline     |
| Logistic Regression | `from sklearn.linear_model import LogisticRegression` | Linear separable cls | Use `max_iter` high |
| Linear SVM          | `from sklearn.svm import LinearSVC`                   | High-dim sparse      | Robust margin       |
| SGD (hinge/log)     | `from sklearn.linear_model import SGDClassifier`      | Large streaming      | Partial fit         |
| CRF                 | `sklearn-crfsuite`                                    | NER/POS              | Needs hand features |

---

## 6Ô∏è‚É£ Deep Learning for NLP (‚ÜîÔ∏é `5_DeepLearning`) üî•

| üß† Family  | üêç Import / Layer                  | üìå Use-case     | üí° Notes                         |
| ---------- | ---------------------------------- | --------------- | -------------------------------- |
| RNN        | `nn.RNN`, `keras.layers.SimpleRNN` | Proto sequence  | Rarely used now                  |
| LSTM       | `nn.LSTM`, `keras.layers.LSTM`     | Seq tagging/gen | Long deps                        |
| GRU        | `nn.GRU`                           | Lighter LSTM    | Competitive                      |
| CNN-Text   | `keras.layers.Conv1D`              | Classification  | N-gram features                  |
| Attention  | `nn.MultiheadAttention`            | Focus tokens    | Pre-Transformer                  |
| BiLSTM-CRF | `torchcrf.CRF`                     | NER             | Still strong classical-DL hybrid |

---

## 7Ô∏è‚É£ Transformers (‚ÜîÔ∏é `6_Transformers`) ‚ö°

| üèóÔ∏è Component / Topic          | üêç Import / Tool                                  | üîß Key Params / Tips                  |
| ------------------------------ | ------------------------------------------------- | ------------------------------------- |
| Tokenizer                      | `AutoTokenizer`                                   | `padding`, `truncation`, `max_length` |
| Encoder (BERT/RoBERTa/DeBERTa) | `AutoModel`, `AutoModelForSequenceClassification` | Use `CLS` pooling or mean-pool        |
| Seq2Seq (T5/BART/Marian)       | `AutoModelForSeq2SeqLM`                           | `num_beams`, `length_penalty`         |
| Decoder-only (GPT-style)       | `AutoModelForCausalLM`                            | `temperature`, `top_k`, `top_p`       |
| Efficient FT                   | `peft` (LoRA/QLoRA), `bitsandbytes`               | 8/4-bit + adapters = cheap FT         |
| Trainer API                    | `transformers.Trainer`                            | `lr`, `weight_decay`, `warmup_ratio`  |
| Inference                      | `pipeline`, `vllm`                                | Batch, static shapes, FP16/BF16       |
| Long Context                   | RoPE/ALiBi scaling                                | Segment packing, sliding window       |

---

## 8Ô∏è‚É£ LLMs & GenAI (‚ÜîÔ∏é `7_LLMs_GenAI`) üß†

| üß© Technique      | üêç Library                      | üìù What to Know                         |
| ----------------- | ------------------------------- | --------------------------------------- |
| Pretraining       | `transformers`, `datasets`      | MLM/CLM objectives, tokenizer choice    |
| SFT (Instruction) | `trl.SFTTrainer`                | Supervised fine-tuning on instructions  |
| Preference Tuning | `trl` (DPO/RLHF)                | Align outputs to preferences            |
| PEFT              | `peft`                          | LoRA/adapters prompt-tuning             |
| Decoding          | `generate()`                    | Greedy/beam/top-k/top-p/typical         |
| Safety            | moderation, regex/struct schema | Refuse lists, output schemas (Pydantic) |

---

## 9Ô∏è‚É£ Core NLP Tasks (‚ÜîÔ∏é `8_Core_Tasks`) üß±

| üéØ Task                              | üêç Common APIs                                                                                | üìè Typical Metrics           | üí° Notes                         |
| ------------------------------------ | --------------------------------------------------------------------------------------------- | ---------------------------- | -------------------------------- |
| Classification (sentiment/topic) üè∑Ô∏è | `AutoModelForSequenceClassification`, `pipeline("text-classification")`, `LogisticRegression` | Acc/Prec/Rec/F1              | Start with LogReg+TF-IDF         |
| Sequence Labeling (NER/POS) üß∑       | `AutoModelForTokenClassification`, `spacy`                                                    | F1 (entity-level), `seqeval` | Align labels with subwords       |
| QA (extractive/abstractive) ‚ùì        | `pipeline("question-answering")`, `AutoModelForSeq2SeqLM`                                     | EM/F1, ROUGE                 | Context window limits            |
| Summarization üìö                     | `pipeline("summarization")`                                                                   | ROUGE/BERTScore              | Length control: `min_new_tokens` |
| MT üåç                                | `MarianMTModel`, `pipeline("translation")`                                                    | BLEU/ChrF/sacreBLEU          | Domain adaptation helps          |
| Paraphrase/STS üîÅ                    | `sentence_transformers`                                                                       | Pearson/Spearman             | Cosine sim on sentence embeds    |
| Information Extraction üìë            | `spacy.ner`, patterns, `re`                                                                   | Precision/Recall/F1          | Rule+ML hybrids practical        |

---

## üîü Evaluation Metrics (‚ÜîÔ∏é `5_Evaluation`) üìä

| üìè Metric           | üêç Call                                    | üß† Use              |
| ------------------- | ------------------------------------------ | ------------------- |
| Accuracy            | `accuracy_score(y, yhat)`                  | Balanced/easy tasks |
| Precision/Recall/F1 | `precision_recall_fscore_support`          | Imbalanced classes  |
| AUC-ROC/PR          | `roc_auc_score`, `average_precision_score` | Prob classifiers    |
| Confusion Matrix    | `confusion_matrix`                         | Error analysis      |
| BLEU / sacreBLEU    | `sacrebleu.corpus_bleu`                    | MT n-gram overlap   |
| ROUGE               | `rouge_score`                              | Summaries           |
| BERTScore           | `bert_score.score`                         | Semantic similarity |
| Perplexity          | `exp(loss)`                                | Language models     |
| Ranking (IR/RAG)    | Hit@k, MRR, nDCG (`recleval`/custom)       | Retrieval quality   |

---

## 1Ô∏è‚É£1Ô∏è‚É£ Multilingual & Multimodal (‚ÜîÔ∏é `9_Multimodal`) üåêüéßüñºÔ∏è

| üåç/üñºÔ∏è Domain         | üêç Library                     | üß† Notes                                  |
| --------------------- | ------------------------------ | ----------------------------------------- |
| Multilingual Encoders | `AutoModel` (XLM-R, mBERT)     | Shared subwords; consider script coverage |
| Translation           | `MarianMT`, `NLLB` via HF      | Use `sacrebleu` for eval                  |
| Speech ‚Üí Text (ASR)   | `openai/whisper`, `torchaudio` | 16kHz mono, VAD helps                     |
| Text ‚Üí Speech         | `TTS` (Coqui), `pyttsx3`       | Prosody, multilingual voices              |
| Vision-Language       | `CLIP`, `BLIP`, `Llava`        | Cross-modal retrieval/QA                  |

---

## 1Ô∏è‚É£2Ô∏è‚É£ Agentic AI & RAG (‚ÜîÔ∏é `10_Agentic_AI`) üïµÔ∏è‚Äç‚ôÇÔ∏èüß≠

| üß† Component     | üêç Tooling                                   | üìù Key Ideas                                  |
| ---------------- | -------------------------------------------- | --------------------------------------------- |
| Retrieval        | `faiss`, `chromadb`, `sentence_transformers` | Chunking, overlap, hybrid search (BM25+dense) |
| Generators       | `AutoModelForSeq2SeqLM` / CausalLM           | Grounding + citations                         |
| Orchestration    | `langchain`, `langgraph`                     | Tools, planning (ReAct, Plan&Execute)         |
| Evaluation (RAG) | `ragas`                                      | Faithfulness, answer relevance                |
| Caching          | `redis` / in-proc                            | Cut cost/latency; TTL by prompt hash          |

---

## 1Ô∏è‚É£3Ô∏è‚É£ Scalability, Prod Readiness & Trends (‚ÜîÔ∏é `11_Scalability_Trends`) üèóÔ∏è

| üè≠ Topic              | üîß Stack                                         | üí° Practical Tip              |
| --------------------- | ------------------------------------------------ | ----------------------------- |
| Packaging             | `poetry`, `uv`, `Docker`                         | Pin models/tokenizers         |
| Serving at Scale      | `vllm`, TGI, `ray`                               | Throughput via paged KV cache |
| Observability         | `mlflow`, `wandb`, `prometheus`, `opentelemetry` | Trace‚Üíspan LLM calls          |
| Cost Controls         | batching, quant (8/4-bit), early-exit            | Log tokens & latency          |
| Safety & Ethics       | policy checks, `presidio`, curated red teams     | Pre-deploy red teaming        |
| Data/Model Versioning | `dvc`, model cards                               | Always log config + seed      |

---

## 1Ô∏è‚É£4Ô∏è‚É£ Python Imports & Built-ins (with tiny examples) üêç

| üîß Item                 | ‚ú® Example                         | üß† Use               |
| ----------------------- | --------------------------------- | -------------------- |
| `re.sub`                | `re.sub(r"\s+", " ", t)`          | Whitespace collapse  |
| `unicodedata.normalize` | `normalize("NFKC", t)`            | Unicode tidy         |
| `collections.Counter`   | `Counter(tokens).most_common(10)` | Top-k words          |
| `itertools.islice`      | `list(islice(iterable, 100))`     | Take first N         |
| `functools.lru_cache`   | `@lru_cache`                      | Cache heavy funcs    |
| `pathlib.Path.glob`     | `Path("data").glob("*.txt")`      | File discovery       |
| `json.loads/dumps`      | `json.loads(s)`                   | Data interchange     |
| `textwrap.shorten`      | `shorten(t, width=120)`           | Console summaries    |
| `argparse`              | CLI flags                         | Reproducible scripts |
| `typing` + `pydantic`   | Typed DTOs                        | Safer I/O schemas    |

---

## 1Ô∏è‚É£5Ô∏è‚É£ Quick Hugging Face Patterns üß™

| üéØ Task       | üêç One-liners                                                                  | üí° Notes                                                 |
| ------------- | ------------------------------------------------------------------------------ | -------------------------------------------------------- |
| Tokenizer     | `tok = AutoTokenizer.from_pretrained("bert-base-uncased")`                     | `padding="max_length"` for batching                      |
| Text Cls      | `pipe = pipeline("text-classification", model=...)`                            | For production, use `AutoModelForSequenceClassification` |
| Summarization | `pipeline("summarization")(text, max_new_tokens=128)`                          | Control length                                           |
| Generation    | `model.generate(**tok(text, return_tensors="pt"), top_p=0.9, temperature=0.7)` | Sampling configs                                         |
| Seq2Seq FT    | `Trainer(model, args, train_dataset, eval_dataset)`                            | Log metrics; early stopping                              |

---

## 1Ô∏è‚É£6Ô∏è‚É£ Interview Tip Cards üíº

| üß† Topic         | üí° What to Say in 20s                                                |
| ---------------- | -------------------------------------------------------------------- |
| Baselines First  | ‚ÄúI start with TF-IDF + LogReg; then move to Transformers if needed.‚Äù |
| Data Matters     | ‚ÄúI invest in cleaning, label quality, and ablations before tuning.‚Äù  |
| Metrics Fit Task | ‚ÄúFor seq2seq I use ROUGE/BERTScore; retrieval uses MRR/nDCG.‚Äù        |
| Efficient FT     | ‚ÄúPEFT (LoRA/QLoRA) + 4-bit quant drastically reduces cost.‚Äù          |
| Guardrails       | ‚ÄúTyped schemas (Pydantic), PII redaction, moderation, and logging.‚Äù  |
| Observability    | ‚ÄúI track latency, token counts, drift (Evidently), and cost.‚Äù        |

---
