# Text Classification

## An Overview
Text classification is the process of assigning predefined categories to text data based on its content. It is a fundamental task in natural language processing (NLP) with applications in spam detection, sentiment analysis, topic labeling, and more.

### Steps in Text Classification

- **Data Collection:** Gather labeled data for training and evaluation.
- **Text Preprocessing:**
    - Tokenization: Splitting text into words or subwords.
    - Lowercasing: Converting all text to lowercase to reduce variability.
    - Stopword Removal: Removing common words like "and," "the," which may not carry significant meaning.
    - Stemming/Lemmatization: Reducing words to their root form (e.g., "running" → "run").
    - Handling Special Characters: Removing or encoding punctuation and emojis.
- **Feature Extraction:**
    - Bag of Words (BoW): Represents text as a collection of word frequencies.
    - TF-IDF: Assigns weights to terms based on their importance.
    - Word Embeddings: Converts text into dense vectors using models like Word2Vec, GloVe, or FastText.
    - Sentence Embeddings: Captures the meaning of entire sentences using models like BERT or Sentence Transformers.
- **Model Training:** Train a classification model using the extracted features.
- **Evaluation:** Use metrics like accuracy, F1-score, and ROC-AUC to assess performance.

### Types of Models Used in Text Classification

**Traditional Machine Learning Models**

These models are used with features like BoW, TF-IDF, or manually engineered features.

- **Naive Bayes:**
    - Assumes features are independent.
    - Simple and effective for tasks like spam detection.

- **Logistic Regression:**
    - Predicts probabilities for each category.
    - Works well with large and sparse feature spaces.

- **Support Vector Machines (SVMs):**
    - Finds the hyperplane that best separates classes.
    - Effective for high-dimensional text data.

- **Decision Trees/Random Forests:**
    - Uses hierarchical splits to classify text.
    - Random Forests aggregate decisions from multiple trees for robustness.
-----

**Deep Learning Models**

These models automatically learn feature representations from raw text or embeddings.

- **Recurrent Neural Networks (RNNs):**
    - Suited for sequential data, processes text one token at a time.
    - Variants include LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit).
- **Convolutional Neural Networks (CNNs):**
    - Extract local patterns in text (e.g., n-grams).
    - Often combined with embeddings for short text classification.
- **Transformer-Based Models:**
    - Utilize self-attention mechanisms to capture global and contextual relationships.
    - Popular models include:
        - BERT (Bidirectional Encoder Representations from Transformers): Pretrained and fine-tuned for specific tasks.
        - GPT (Generative Pre-trained Transformer): Good for text generation and understanding.
        - RoBERTa, DistilBERT, ALBERT: Variants of BERT optimized for specific tasks.
- **Sequence-to-Sequence Models:**
    - Useful for hierarchical classification or multi-label tasks.
    - Encoder-decoder architectures (e.g., T5, BART).

---- 

**Hybrid Models**

- **ML + Deep Learning:**
    - Combine traditional ML models with embeddings as features (e.g., train SVM using Word2Vec).
- **Ensemble Methods:**
    - Combine predictions from multiple models for improved performance.


### Key Challenges in Text Classification
- **Data Scarcity:** Insufficient labeled examples.
- **Class Imbalance:** Uneven distribution of categories.
- **Ambiguity:** Words with multiple meanings.
- **Domain Adaptation:** Models trained on one domain may not generalize well to another.

## Steps in details

### Data Collection
- **Objective:** Gather a dataset that contains text samples and their corresponding labels (categories).
- **Sources of Data:**
    - *Public Datasets:* IMDB reviews (sentiment analysis), 20 Newsgroups (news classification), SMS spam collection (spam detection).
    - *Scraping/Web APIs:* Extract text from news articles, product reviews, or social media.
    - *Internal Data:* Emails, customer feedback, chatbot logs, or support tickets.

- **Example:**

    📌 Source: Customer Support Dataset

    📌 Description: 3 million tweets from customers interacting with company support teams.

    📌 Data Sample:
    Query| Category
    ---|:---:	
    "My internet is not working for the last 3 hours. Please help!"	| Technical Issue
    "How can I update my payment method?" | Billing

### **Text Preprocessing**

Text preprocessing is a crucial step in Natural Language Processing (NLP) that ensures text data is clean and structured before being used in a model. This step involves transforming raw text into a format that can be effectively processed by machine learning algorithms.

- **Tokenization**
    - Definition:
    Tokenization is the process of breaking a text string into smaller units (tokens), which can be words, subwords, or even characters.

    - Types of Tokenization:

        a) Word Tokenization
        - Splits text into words based on spaces or punctuation.
        - Example:
            - Input: `"Text classification is important!"`
            - Output: `["Text", "classification", "is", "important", "!"]`
        - Methods:
            - Python Libraries: `nltk.word_tokenize()`, `spaCy.tokenizer`
            - Regular Expressions: `re.split("\s+", text)`

        b) Subword Tokenization

        - Breaks words into smaller meaningful units, useful for handling unseen words.
        - Example:
            - `"unhappiness"` → `["un", "happiness"]`
        - Methods:
            - Byte Pair Encoding (BPE)
            - Unigram Language Model
            - WordPiece (used in BERT)

        c) Character Tokenization

        - Splits text into individual characters.
        - Example: `"hello"` → `["h", "e", "l", "l", "o"]`
        - Used in: Speech recognition, OCR tasks
- **Lowercasing**
    - Definition:
    Converting all text to lowercase to ensure uniformity and avoid case sensitivity issues.

    - Methods:
        - Python: `text.lower()`
        - NLTK: `text.casefold()`
    - Why It’s Important:
        - Reduces vocabulary size ("Apple" and "apple" are treated as the same word).
        - Some models (like BERT) do not require this as they maintain case sensitivity.

- **Stopword Removal**
    - Definition:
    Stopwords are common words like "the", "is", "and", "to", which appear frequently but do not contribute much meaning.

    - Methods:

        - NLTK:
        ```python
            from nltk.corpus import stopwords
            stop_words = set(stopwords.words('english'))
            words = [word for word in tokens if word not in stop_words]
        ```
        - spaCy:
        ```python
            import spacy
            nlp = spacy.load("en_core_web_sm")
            tokens = [token.text for token in nlp(text) if not token.is_stop]
        ```
    - Considerations:
        - Removing stopwords can sometimes remove useful words in sentiment analysis (e.g., “not” in "I do not like it").
        - Some tasks may benefit from keeping stopwords.

- **Stemming**
    - Definition: Reducing a word to its root by chopping off prefixes or suffixes.
    - Methods:
        - Porter Stemmer (NLTK):
        ```python
            from nltk.stem import PorterStemmer
            ps = PorterStemmer()
            print(ps.stem("running"))  # Output: run
        ```
        - Snowball Stemmer: More aggressive than Porter.
        - Lancaster Stemmer: Even more aggressive.
        
    - Example:

        Word | Stemmed Form (Porter)
        ---|:---:|
        Running	| Run
        Studies |	Studi
        Happily |	Happili

    - Pros: Computationally efficient.
    - Cons: May not produce real words.

- **Lemmatization**
    - Definition: Reduces words to their base dictionary form using linguistic knowledge.
    - Methods:
        - NLTK WordNet Lemmatizer:
        ```python
            from nltk.stem import WordNetLemmatizer
            lemmatizer = WordNetLemmatizer()
            print(lemmatizer.lemmatize("running", pos="v"))  # Output: run
        ```
        - spaCy Lammatizer:
        ```python
            import spacy
            nlp = spacy.load("en_core_web_sm")
            doc = nlp("running jumped studies")
            print([token.lemma_ for token in doc])  # Output: ['run', 'jump', 'study']
        ```
    - Example:

        Word | Lemmatized Form
        ---|:---:|
        Running	| Run
        Studies	| Study
        Mice | Mouse
    - Pros: Produces real words.
    - Cons: Slower than stemming
- **Handling Special Characters & Punctuation**
    - Definition:
    Removing or replacing symbols, emojis, and punctuation to standardize text.

    - Methods:
        - Removing Punctuation:
        ```python
            import string
            text = "Hello!!! How are you?"
            text = text.translate(str.maketrans("", "", string.punctuation))
            print(text)  # Output: Hello How are you
        ```
        - Replacing Emojis (Emoji Dictionary Approach):
        ```python
            import emoji
            print(emoji.demojize("I love Python! 😍"))  # Output: I love Python! :heart_eyes:
        ```
        - Removing HTML Tags:
        ```python
            from bs4 import BeautifulSoup
            clean_text = BeautifulSoup("<p>Hello</p>", "html.parser").text
        ```
- Text Normalization
    - Definition:
    Converting different forms of words into a single standard form.

    - Methods:
        - Expanding contractions: "you're" → "you are"
        ```python
            from contractions import fix
            print(fix("You're going to school."))  # Output: You are going to school.
        ```
    - Converting numbers to words: "100" → "one hundred"
        ```python
            import inflect
            p = inflect.engine()
            print(p.number_to_words(100))  # Output: one hundred
        ```
### Feature Extraction
- Objective: Convert raw text into numerical representations that models can understand.

    a) Bag of Words (BoW)
    - Definition: Represents text as a count of words in a document.
    - Example:
        ```vbnet
            Text 1: "I love machine learning"
            Text 2: "Machine learning is great"
        ```
        BoW Representation:
        ```arduino
            "I" - 1, "love" - 1, "machine" - 1, "learning" - 2, "is" - 1, "great" - 1
        ```
    b) TF-IDF (Term Frequency-Inverse Document Frequency)
    - Definition: Assigns importance to words by considering how often they appear in a document vs. across all documents.
    - Formula:
        $$
        TF = \frac{Number of times a word appears in a document}{Total words in the document}\\\\
        $$
        $$
        IDF = log(\frac{Total documents}{Number of documents containing the word}+1)
        $$

    c) Word Embeddings
    - Definition: Transforms words into dense vector representations capturing their meaning.
    - Techniques:
        - Word2Vec (CBOW & Skip-gram)
        - GloVe
        - FastText
        
    d) Sentence Embeddings
    - Definition: Captures meaning at the sentence level instead of word level.
    - Example Models:
        - BERT (Bidirectional Encoder Representations from Transformers)
        - Sentence Transformers
    - Example Use Case:
        "The bank is by the river" vs. "I deposited money in the bank" (understanding context better).

---
---

### Statistical Language Models (SLMs)

- Definition:
    A language model assigns probabilities to sequences of words. It helps predict the next word in a sequence based on previous words.

- Common Statistical Language Models:
    - Unigram Model

        - Assumes each word is independent of previous words.

        - Probability of a sentence:
            $$
            P(w_1, w_2, ..., w_n) = \Pi_i P(w_i)
            $$
            Example: "I love NLP"
            $$
            P(I)\times P(love)\times P(NLP)
            $$

        - Limitation: Ignores word dependencies.
    - Bigram Model
        - Considers dependencies between adjacent words.
        - Probability:
            $$
            P(w_1, w_2, ..., w_n) = \Pi_i P(w_i|w_{i-1})
            $$
            Example: "I love NLP"
            $$
            P(I)\times P(love|I)\times P(NLP|love)
            $$

        - Captures local word relationships.
    - Higher-order N-gram Models (Trigram, 4-gram, etc.)
        - Extends to longer sequences:
            $$
            P(w_1, w_2, ..., w_n) = \Pi_i P(w_i|w_{i-1}, w_{i-2}, ..., w_k), \forall 1\leq k \leq i
            $$
        - More context but requires more data.
- Smoothing Techniques:

    To handle unseen words in training data:

    - Laplace Smoothing: Adds a small constant to all word probabilities.
    - Backoff & Interpolation: Uses lower-order models when higher-order models fail.