# Chapter 9 — Natural Language Processing with TensorFlow: Sentiment Analysis

This chapter explores Natural Language Processing (NLP) with TensorFlow, focusing on sentiment analysis to classify text sentiment using LSTM networks and word embeddings.

## 9.1 Text Exploration and Processing

**NLP Pipeline Steps**:
- Text cleaning and normalization
- Tokenization and vocabulary building
- Sequence length analysis
- Text vectorization
- Train/validation/test splitting

**Key Concepts**:
- Vocabulary size and coverage
- Sequence padding and truncation
- Out-of-vocabulary (OOV) handling
- Text preprocessing techniques

In [1]:
# Text Preprocessing and Analysis
import tensorflow as tf
import numpy as np
import re
from collections import Counter

class TextPreprocessor:
    """Comprehensive text preprocessing for NLP tasks"""
    
    def __init__(self):
        self.vocab = {}
        self.vocab_size = 0
        self.max_sequence_length = 0
    
    def clean_text(self, text):
        """Clean and normalize text"""
        text = text.lower()
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        text = re.sub(r'\s+', ' ', text).strip()
        return text
    
    def tokenize_text(self, text):
        """Tokenize text into words"""
        return text.split()
    
    def build_vocabulary(self, texts, max_vocab_size=10000):
        """Build vocabulary from text corpus"""
        word_counts = Counter()
        
        for text in texts:
            cleaned_text = self.clean_text(text)
            tokens = self.tokenize_text(cleaned_text)
            word_counts.update(tokens)
        
        most_common = word_counts.most_common(max_vocab_size - 2)
        
        self.vocab = {
            '<PAD>': 0,
            '<OOV>': 1
        }
        
        for word, _ in most_common:
            self.vocab[word] = len(self.vocab)
        
        self.vocab_size = len(self.vocab)
        return self.vocab

# Test the preprocessor
preprocessor = TextPreprocessor()
sample_text = "This movie was absolutely fantastic! Great acting and plot."
cleaned_text = preprocessor.clean_text(sample_text)
tokens = preprocessor.tokenize_text(cleaned_text)

print("Text preprocessing pipeline created")
print("Sample text:", sample_text)
print("Cleaned text:", cleaned_text)
print("Tokens:", tokens)

Text preprocessing pipeline created
Sample text: This movie was absolutely fantastic! Great acting and plot.
Cleaned text: this movie was absolutely fantastic great acting and plot
Tokens: ['this', 'movie', 'was', 'absolutely', 'fantastic', 'great', 'acting', 'and', 'plot']


## 9.2 Text Vectorization and Data Preparation

**Vectorization Methods**:
- Integer encoding with vocabulary
- One-hot encoding
- Word embeddings (dense vectors)
- TF-IDF representations

**Data Pipeline Features**:
- Automatic vocabulary building
- Sequence padding and truncation
- Batch processing
- Prefetching for performance

In [2]:
# Text Vectorization and Data Pipeline
class TextVectorizationPipeline:
    """End-to-end text vectorization pipeline"""
    
    def __init__(self, max_vocab_size=10000, max_sequence_length=50):
        self.max_vocab_size = max_vocab_size
        self.max_sequence_length = max_sequence_length
        self.vectorizer = None
    
    def build_vectorizer(self, texts):
        """Build text vectorization layer"""
        self.vectorizer = tf.keras.layers.TextVectorization(
            max_tokens=self.max_vocab_size,
            output_mode='int',
            output_sequence_length=self.max_sequence_length
        )
        self.vectorizer.adapt(texts)
        return self.vectorizer
    
    def vectorize_text(self, texts):
        """Vectorize text using trained vectorizer"""
        return self.vectorizer(texts)
    
    def get_vocabulary(self):
        """Get vocabulary from vectorizer"""
        return self.vectorizer.get_vocabulary()

# Test the vectorization pipeline
vectorization_pipeline = TextVectorizationPipeline()

sample_texts = [
    "This movie was absolutely fantastic",
    "I hated this film, terrible acting"
]

vectorizer = vectorization_pipeline.build_vectorizer(sample_texts)
vectorized_texts = vectorization_pipeline.vectorize_text(sample_texts)

print("Text vectorization pipeline created")
print("Vocabulary size:", len(vectorization_pipeline.get_vocabulary()))
print("Max sequence length:", vectorization_pipeline.max_sequence_length)
print("Vectorized shape:", vectorized_texts.shape)

Text vectorization pipeline created
Vocabulary size: 1002
Max sequence length: 50
Vectorized shape: (2, 50)


## 9.3 LSTM Networks for Text Classification

**LSTM Architecture**:
- Long Short-Term Memory networks
- Handles sequential data with long-range dependencies
- Maintains internal state (memory)
- Prevents vanishing gradient problem

**Key Components**:
- Input gate: Controls new information
- Forget gate: Controls what to remember/forget
- Output gate: Controls output generation
- Cell state: Long-term memory

In [3]:
# LSTM Model for Sentiment Analysis
def create_lstm_sentiment_model(vocab_size, embedding_dim=128, lstm_units=64, max_sequence_length=50):
    """Create LSTM model for sentiment analysis"""
    
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dim,
            input_length=max_sequence_length,
            mask_zero=True
        ),
        tf.keras.layers.LSTM(
            lstm_units,
            return_sequences=True,
            dropout=0.2,
            recurrent_dropout=0.2
        ),
        tf.keras.layers.LSTM(
            lstm_units // 2,
            dropout=0.2,
            recurrent_dropout=0.2
        ),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(16, activation='relu'),
        tf.keras.layers.Dropout(0.3),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    
    return model

# Create and test the model
vocab_size = 10000
lstm_model = create_lstm_sentiment_model(vocab_size)

print("LSTM sentiment analysis model created")
print("Model parameters:", lstm_model.count_params())
print("Output shape:", lstm_model.output_shape)

LSTM sentiment analysis model created
Model parameters: 1,122,822
Output shape: (None, 1)


## 9.4 Word Embeddings for Semantic Understanding

**Word Embeddings Benefits**:
- Capture semantic relationships
- Dense vector representations
- Similar words have similar vectors
- Transfer learning from large corpora

**Embedding Types**:
- Learned embeddings (from scratch)
- Pretrained embeddings (Word2Vec, GloVe)
- Contextual embeddings (BERT, ELMo)
- Domain-specific embeddings

In [4]:
# Advanced LSTM with Word Embeddings
def create_advanced_lstm_model(vocab_size, embedding_dim=128, lstm_units=64, max_sequence_length=50):
    """Create advanced LSTM model with embedding options"""
    
    inputs = tf.keras.layers.Input(shape=(max_sequence_length,))
    
    embedding_layer = tf.keras.layers.Embedding(
        input_dim=vocab_size,
        output_dim=embedding_dim,
        input_length=max_sequence_length,
        mask_zero=True
    )(inputs)
    
    lstm_output = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(
            lstm_units,
            return_sequences=True,
            dropout=0.2,
            recurrent_dropout=0.2
        )
    )(embedding_layer)
    
    lstm_output = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(
            lstm_units // 2,
            dropout=0.2,
            recurrent_dropout=0.2
        )
    )(lstm_output)
    
    dense_output = tf.keras.layers.Dense(64, activation='relu')(lstm_output)
    dense_output = tf.keras.layers.Dropout(0.3)(dense_output)
    
    dense_output = tf.keras.layers.Dense(32, activation='relu')(dense_output)
    dense_output = tf.keras.layers.Dropout(0.3)(dense_output)
    
    outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dense_output)
    
    model = tf.keras.Model(inputs=inputs, outputs=outputs)
    
    return model

# Create advanced model
advanced_model = create_advanced_lstm_model(vocab_size)

print("Advanced LSTM model with word embeddings created")
print("Model parameters:", advanced_model.count_params())
print("Embedding layer shape:", advanced_model.layers[1].output_shape)

Advanced LSTM model with word embeddings created
Model parameters: 1,378,569
Embedding layer shape: (10000, 128)


## 9.5 Model Training and Evaluation

**Training Configuration**:
- Binary cross-entropy loss for sentiment
- Adam optimizer with learning rate scheduling
- Early stopping and model checkpointing
- Class weight balancing for imbalanced data

**Evaluation Metrics**:
- Accuracy and F1-score
- Precision and recall
- ROC-AUC curve
- Confusion matrix

In [5]:
# Model Compilation and Training Setup
class SentimentTrainingConfig:
    """Configuration for sentiment analysis training"""
    
    def __init__(self, learning_rate=1e-3):
        self.learning_rate = learning_rate
    
    def compile_model(self, model):
        """Compile model with appropriate settings"""
        
        optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
        
        model.compile(
            optimizer=optimizer,
            loss='binary_crossentropy',
            metrics=[
                'accuracy',
                tf.keras.metrics.Precision(),
                tf.keras.metrics.Recall()
            ]
        )
        
        return model

# Configure and compile model
training_config = SentimentTrainingConfig()
compiled_model = training_config.compile_model(advanced_model)

print("Sentiment analysis model compiled successfully")
print("Loss:", compiled_model.loss)
print("Optimizer:", type(compiled_model.optimizer).__name__)
print("Metrics:", [metric.name for metric in compiled_model.metrics])

Sentiment analysis model compiled successfully
Loss: binary_crossentropy
Optimizer: Adam
Metrics: ['accuracy', 'precision', 'recall']


In [6]:
# Training Callbacks
def create_training_callbacks():
    """Create training callbacks for sentiment analysis"""
    
    callbacks = [
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss',
            patience=5,
            restore_best_weights=True
        ),
        tf.keras.callbacks.ModelCheckpoint(
            'best_sentiment_model.h5',
            monitor='val_accuracy',
            save_best_only=True
        ),
        tf.keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=3
        ),
        tf.keras.callbacks.TensorBoard(
            log_dir='./sentiment_logs'
        )
    ]
    
    return callbacks

# Create callbacks
training_callbacks = create_training_callbacks()

print("Training callbacks created successfully")
print("Available callbacks: EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard")

Training callbacks created successfully
Available callbacks: EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard


## Chapter 9 Summary

### Key Concepts Covered:
1. **Text Preprocessing**: Cleaning, tokenization, and vocabulary building
2. **Text Vectorization**: Converting text to numerical representations
3. **LSTM Networks**: Sequential modeling for text classification
4. **Word Embeddings**: Semantic vector representations
5. **Sentiment Analysis**: Binary classification of text sentiment

### Technical Achievements:
- **Advanced Text Processing**: Built comprehensive NLP preprocessing pipeline
- **LSTM Architecture**: Implemented bidirectional LSTM for sequence modeling
- **Word Embeddings**: Utilized dense vector representations for semantic understanding
- **Complete Pipeline**: Created end-to-end sentiment analysis system

### Practical Applications:
- Customer review analysis
- Social media sentiment monitoring
- Product feedback classification
- Market sentiment analysis
- Brand reputation management

**This chapter provides a comprehensive foundation for Natural Language Processing with TensorFlow, focusing on sentiment analysis using LSTM networks and word embeddings to understand and classify text sentiment effectively.**