# Natural Language Processing (NLP) Concepts

---

## 1. Text Pre-processing

**Definition:**  
Text pre-processing involves cleaning and transforming raw text into a format suitable for machine learning models. Proper pre-processing improves model performance and reduces noise.

**Key Steps:**
1. **Tokenization:** Splitting text into smaller units called tokens (words, subwords, or characters).  
   - Example: "I love AI" → ["I", "love", "AI"]  

2. **Stemming:** Reducing words to their root form by chopping off suffixes.  
   - Example: "running", "runner" → "run"  

3. **Lemmatization:** Reducing words to their base or dictionary form using vocabulary and context.  
   - Example: "better" → "good", "running" → "run"  

**Applications:**  
- Preparing text for classification, sentiment analysis, and machine translation.

---

## 2. Word Embeddings

**Definition:**  
Word embeddings are dense vector representations of words that capture semantic meaning and relationships between words in a continuous vector space.

**Popular Methods:**
1. **Word2Vec:**  
   - Predicts a word based on its context (CBOW) or predicts context from a word (Skip-gram).  
   - Captures semantic similarity.  

2. **GloVe (Global Vectors):**  
   - Uses word co-occurrence statistics to learn embeddings.  
   - Preserves global context information.  

**Applications:**  
- Text classification, sentiment analysis, machine translation, recommendation systems.

---

## 3. Transformers (BERT, GPT)

**Definition:**  
Transformers are deep learning models that use self-attention mechanisms to process sequences in parallel and capture long-range dependencies in text.

**Key Models:**
- **BERT (Bidirectional Encoder Representations from Transformers):**  
  - Focuses on understanding context in both directions.  
  - Used for tasks like question answering, named entity recognition, and text classification.

- **GPT (Generative Pre-trained Transformer):**  
  - Autoregressive model focused on text generation.  
  - Can generate coherent and context-aware text.

**Applications:**  
- NLP tasks: summarization, translation, chatbots, code generation.

---

## 4. Sentiment Analysis & Text Classification

**Definition:**  
- **Sentiment Analysis:** Determines the sentiment (positive, negative, neutral) expressed in text.  
- **Text Classification:** Assigns predefined categories or labels to text based on content.

**Techniques:**
1. **Rule-based methods:** Using dictionaries or lexicons.  
2. **Machine learning methods:** Naive Bayes, SVM, Random Forest.  
3. **Deep learning methods:** LSTM, GRU, Transformers (BERT, GPT).  

**Applications:**  
- Customer feedback analysis  
- Social media monitoring  
- Spam detection  
- Product review classification


In [3]:
# -------------------------------
# Text Pre-processing (Colab-safe)
# -------------------------------

# Use Hugging Face tokenizer instead of NLTK
from transformers import AutoTokenizer

text = "The cats are running faster than the dogs."

# Load basic tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize text
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Simple Stemming and Lemmatization using nltk (force download)
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

stemmer = PorterStemmer()
lemmas = WordNetLemmatizer()

stems = [stemmer.stem(token) for token in tokens]
lemmas_list = [lemmas.lemmatize(token) for token in tokens]

print("Stemmed:", stems)
print("Lemmatized:", lemmas_list)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokens: ['the', 'cats', 'are', 'running', 'faster', 'than', 'the', 'dogs', '.']
Stemmed: ['the', 'cat', 'are', 'run', 'faster', 'than', 'the', 'dog', '.']
Lemmatized: ['the', 'cat', 'are', 'running', 'faster', 'than', 'the', 'dog', '.']
