<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Data Pre-processing & RNNs for Language Modeling</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Text Classification, Transfer Learning, and Model Assessment with Keras)</span></div>

# Table of Contents

1. [Text Classification Fundamentals](#section-1)
2. [Transitioning from Binary to Multi-class Classification](#section-2)
3. [Transfer Learning for Language Models](#section-3)
4. [Multi-class Classification Models with Keras](#section-4)
5. [Assessing Model Performance](#section-5)
6. [Conclusion](#section-6)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Text Classification Fundamentals</span><br>

Text classification is a fundamental task in Natural Language Processing (NLP). It involves assigning predefined categories to text documents.

### Applications of Text Classification
Text classification is ubiquitous in industry and research. Common applications include:
*   **Automatic news classification**: Categorizing articles into topics like Sports, Finance, or Politics.
*   **Document classification for businesses**: Sorting invoices, contracts, or internal memos.
*   **Queue segmentation for customer support**: Routing support tickets to the correct department (e.g., Technical Support vs. Billing) based on the text content.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Transitioning from Binary to Multi-class Classification</span><br>

When moving from binary classification (2 classes, e.g., Positive/Negative) to multi-class classification (3+ classes), several architectural changes are required in the neural network.

### Key Changes
| Component | Binary Classification | Multi-class Classification |
| :--- | :--- | :--- |
| **Output Variable ($y$)** | Scalar (0 or 1) | One-hot encoded vector |
| **Output Units** | 1 Unit | `num_classes` Units |
| **Activation Function** | Sigmoid | Softmax |
| **Loss Function** | Binary Cross-entropy | Categorical Cross-entropy |

### 1. Shape of the Output Variable ($y$)
In multi-class settings, classes are often represented using **One-hot encoding**.
*   If `num_classes = 3`, a label is a vector of length 3.
*   Example: Class 1 might be `[0, 1, 0]`.
*   The shape of $y$ becomes $(N, \text{num\_classes})$.

### 2. Output Layer Architecture
The number of units in the final dense layer must match the number of classes.

**Original Code (Conceptual Keras):**


In [None]:
# Example: num_classes = 3
# y[0] = [0, 1, 0]
# y.shape = (N, num_classes)

# Output layer
# model.add(Dense(num_classes))



### 3. Activation Function: Softmax
Instead of `sigmoid`, we use `softmax`. Softmax ensures that the output values represent the probability distribution across all classes (summing to 1).

**Original Code:**


In [None]:
# Output layer
# model.add(Dense(num_classes, activation="softmax"))



### 4. Loss Function: Categorical Cross-entropy
We switch the loss function to handle multi-class targets.

**Original Code:**


In [None]:
# Compile the model
# model.compile(loss='categorical_crossentropy')



### Practical Example: Preparing Categories
We can use Pandas to convert string labels into numerical codes, and Keras to convert those codes into one-hot vectors.

**Original Code (Pandas Series):**


In [None]:
import pandas as pd

y = ["sports", "economy", "data_science", "sports", "finance"]
# Transform to pandas series object
y_series = pd.Series(y, dtype="category")
# Print the category codes
print(y_series.cat.codes)



**Enhanced Code (Runnable Pre-processing):**


In [None]:
import pandas as pd
import numpy as np
from tensorflow.keras.utils import to_categorical

# 1. Convert text labels to numeric codes
y_labels = ["sports", "economy", "data_science", "sports", "finance"]
y_series = pd.Series(y_labels, dtype="category")

print("Category Codes:")
print(y_series.cat.codes)

# 2. Convert numeric codes to One-Hot Encoding
# Let's assume we have numeric labels [0, 1, 2]
y_numeric = np.array([0, 1, 2])

# Change to categorical
y_prep = to_categorical(y_numeric)

print("\nOne-Hot Encoded Output:")
print(y_prep)



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> In a multi-class geometric space (e.g., $\mathbb{R}^3$), one-hot encoding ensures that all classes are equidistant from each other (Distance = $\sqrt{2}$), preventing the model from assuming an ordinal relationship (e.g., that class 2 is "greater" than class 1). </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Transfer Learning for Language Models</span><br>

Transfer learning involves taking a model trained on a massive dataset and applying its knowledge to a new, related task.

### The Idea Behind Transfer Learning
*   **Better Initialization**: Start with weights that are better than random.
*   **Big Datasets**: Leverage models trained on Wikipedia, Google News, etc.
*   **Open Source**: Utilize pre-trained models available in the data science community.

### Available Architectures

| Architecture | Description | Input/Output Example |
| :--- | :--- | :--- |
| **Word2Vec** | Maps words to vectors. Two types: **CBOW** (Predict target from context) and **Skip-gram** (Predict context from target). | *Context*: "I really loved this movie" <br> **CBOW**: $X=$[I, really, this, movie], $y=$ loved |
| **FastText** | Uses words and **n-grams of characters**. Better for rare words or typos. | $X=$ [I, rea, eal, all, lly, really...], $y=$ loved |
| **ELMo** | **Embeddings from Language Models**. Uses deep bidirectional language models (biLM). Embeddings change based on context. | $X=$ [I, really, loved, this], $y=$ movie |

### Implementation: Word2Vec with Gensim
Word2Vec and FastText are available in the `gensim` package.

**Original Code (Word2Vec):**


In [None]:
from gensim.models import word2vec

# Note: In the PDF, variables like tokenized_corpus are assumed to exist.
# w2v_model = word2vec.Word2Vec(tokenized_corpus, size=embedding_dim, 
#                               window=neighbor_words_num, iter=100)

# Get top 3 similar words to "captain"
# w2v_model.wv.most_similar(["captain"], topn=3)



**Enhanced Code (Runnable with Gensim 4.x):**


In [None]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess

# 1. Create a dummy corpus
corpus = [
    "I really loved this movie",
    "The captain wore sweatpants",
    "Kirk is the captain of the ship",
    "Larry is a friend of the captain"
]
tokenized_corpus = [simple_preprocess(sentence) for sentence in corpus]

# 2. Train Word2Vec Model
# Note: 'size' is renamed to 'vector_size' and 'iter' to 'epochs' in newer Gensim
w2v_model = Word2Vec(sentences=tokenized_corpus, 
                     vector_size=100, 
                     window=5, 
                     min_count=1, 
                     epochs=100)

# 3. Get similar words (if 'captain' exists in vocab)
if 'captain' in w2v_model.wv:
    print("Similar to 'captain':")
    print(w2v_model.wv.most_similar(["captain"], topn=3))
else:
    print("'captain' not in vocabulary due to small corpus size.")



### Implementation: FastText with Gensim

**Original Code (FastText):**


In [None]:
from gensim.models import fasttext

# Instantiate the model
# ft_model = fasttext.FastText(size=embedding_dim, window=neighbor_words_num)

# Build vocabulary
# ft_model.build_vocab(sentences=tokenized_corpus)

# Train the model
# ft_model.train(sentences=tokenized_corpus,
#                total_examples=len(tokenized_corpus),
#                epochs=100)



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Multi-class Classification Models with Keras</span><br>

We will now build a complete pipeline for multi-class classification using the **20 News Groups** dataset.

### Model Architecture Comparison

**1. Binary Classification (Review)**


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

model = Sequential()
model.add(Embedding(10000, 128))
model.add(LSTM(128, dropout=0.2))
model.add(Dense(1, activation='sigmoid')) # Single unit, sigmoid
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])



**2. Multi-class Classification**


In [None]:
num_classes = 20 # Example for 20 News Groups

model = Sequential()
model.add(Embedding(10000, 128))
model.add(LSTM(128, dropout=0.2))
# Output layer has `num_classes` units and uses `softmax`
model.add(Dense(num_classes, activation="softmax"))
# Compile with categorical crossentropy
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])



### The Dataset: 20 News Groups
We use `sklearn` to fetch the data.

**Attributes:**
*   `news_train.DESCR`: Documentation.
*   `news_train.data`: The raw text data.
*   `news_train.filenames`: Paths on disk.
*   `news_train.target`: Numerical index of classes.
*   `news_train.target_names`: Human-readable names.

**Original Code (Loading Data):**


In [None]:
from sklearn.datasets import fetch_20newsgroups

# Download train and test sets
news_train = fetch_20newsgroups(subset='train')
news_test = fetch_20newsgroups(subset='test')



### Pre-processing Pipeline
We must tokenize the text, pad the sequences to a fixed length, and one-hot encode the targets.

**Original Code (Preprocessing):**


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Create and fit the tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(news_train.data)

# Create the (X, Y) variables
X_train = tokenizer.texts_to_sequences(news_train.data)
X_train = pad_sequences(X_train, maxlen=400)
Y_train = to_categorical(news_train.target)



### Training the Model
**Original Code (Training):**


In [None]:
# Train the model
model.fit(X_train, Y_train, batch_size=64, epochs=100)

# Evaluate on test data
# Note: X_test and Y_test would need similar preprocessing
# model.evaluate(X_test, Y_test)



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Assessing Model Performance</span><br>

Accuracy alone is often insufficient, especially with imbalanced datasets.

### Why Accuracy is Not Enough
If you have a 20-class task and achieve 80% accuracy:
*   Can it classify *all* classes correctly?
*   Is the accuracy the same for each class?
*   Is the model overfitting on the majority class?

### The Confusion Matrix
A confusion matrix shows the True classes vs. the Predicted classes.

| True Class \ Predicted | sci.space | alt.atheism | soc.religion.christian |
| :--- | :---: | :---: | :---: |
| **sci.space** | **76** | 2 | 0 |
| **alt.atheism** | 7 | **1** | 2 |
| **soc.religion.christian** | 9 | 0 | **3** |

### Performance Metrics Formulas

**1. Precision**
$$ \text{Precision}_{\text{class}} = \frac{\text{Correct}_{\text{class}}}{\text{Predicted}_{\text{class}}} $$

*   Example (sci.space): $\frac{76}{76 + 7 + 9} = \frac{76}{92} = 0.83$

**2. Recall**
$$ \text{Recall}_{\text{class}} = \frac{\text{Correct}_{\text{class}}}{N_{\text{class}}} $$

*   Example (sci.space): $\frac{76}{76 + 2 + 0} = \frac{76}{78} = 0.97$

**3. F1-Score**
$$ \text{F1 score} = 2 * \frac{\text{precision}_{\text{class}} * \text{recall}_{\text{class}}}{\text{precision}_{\text{class}} + \text{recall}_{\text{class}}} $$

*   Example (sci.space): $2 * \frac{0.83 * 0.97}{0.83 + 0.97} = 0.89$

### Implementation with Sklearn
We can calculate these metrics using `sklearn.metrics`.

**Original Code (Confusion Matrix):**


In [None]:
from sklearn.metrics import confusion_matrix

# Build the confusion matrix
# confusion_matrix(y_true, y_pred)
# Output:
# array([[76, 2, 0],
#        [ 7, 1, 2],
#        [ 9, 0, 3]], dtype=int64)



**Enhanced Code (Full Metrics Calculation):**
This code replicates the exact numbers found in the PDF presentation.



In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, accuracy_score, classification_report

# Recreating the data from the PDF Confusion Matrix
# Matrix structure:
# [[76, 2, 0],  <- True: sci.space
#  [ 7, 1, 2],  <- True: alt.atheism
#  [ 9, 0, 3]]  <- True: soc.religion.christian

# We construct y_true and y_pred to match this matrix
y_true = []
y_pred = []

# Class 0 (sci.space): 76 correct, 2 pred as 1, 0 pred as 2
y_true.extend([0]*78)
y_pred.extend([0]*76 + [1]*2 + [2]*0)

# Class 1 (alt.atheism): 7 pred as 0, 1 correct, 2 pred as 2
y_true.extend([1]*10)
y_pred.extend([0]*7 + [1]*1 + [2]*2)

# Class 2 (soc.religion.christian): 9 pred as 0, 0 pred as 1, 3 correct
y_true.extend([2]*12)
y_pred.extend([0]*9 + [1]*0 + [2]*3)

# 1. Accuracy
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}")

# 2. Precision, Recall, F1 (average=None gives per-class scores)
print("\nPrecision:", precision_score(y_true, y_pred, average=None))
print("Recall:   ", recall_score(y_true, y_pred, average=None))
print("F1 Score: ", f1_score(y_true, y_pred, average=None))

# 3. Classification Report
lab_names = ['sci.space', 'alt.atheism', 'soc.religion.christian']
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=lab_names))



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> The <code>classification_report</code> function is a powerful tool that calculates precision, recall, f1-score, and support (number of samples) for every class in a single function call. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Conclusion</span><br>

In this notebook, we explored the advanced steps required to build robust Language Models using Recurrent Neural Networks (RNNs) and Keras.

**Key Takeaways:**
1.  **Multi-class Classification**: Requires changing the output layer to `softmax`, the loss to `categorical_crossentropy`, and the target variable to a one-hot encoded format.
2.  **Transfer Learning**: Utilizing pre-trained embeddings like Word2Vec, FastText, or ELMo can significantly boost performance by initializing the model with learned semantic relationships.
3.  **Performance Assessment**: Accuracy is often misleading in multi-class problems. Using a **Confusion Matrix** along with **Precision**, **Recall**, and **F1-Score** provides a granular view of how well the model performs for each specific category.

**Next Steps:**
*   Experiment with different pre-trained embeddings (e.g., GloVe or BERT).
*   Apply these techniques to your own text datasets.
*   Tune hyperparameters (dropout rates, LSTM units) to improve the F1-scores of underperforming classes.
