# üìù Text Expander Pro - AI-Enhanced Edition

**Expand sentences into paragraphs using multiple AI techniques.**

This notebook offers **4 levels** of text generation:

| Level | Method | Description |
|-------|--------|-------------|
| üü¢ Basic | Markov Chain | Statistical word transitions |
| üü° Enhanced | Word2Vec + Markov | Semantic word relationships |
| üü† Advanced | LSTM Neural Network | Custom trained on your document |
| üî¥ Pro | Fine-tuned GPT-2 | State-of-the-art language model |

---

## ‚ö° Quick Start
1. **Enable GPU**: Runtime ‚Üí Change runtime type ‚Üí T4 GPU (recommended for LSTM/GPT-2)
2. Run cells in order
3. Upload your document
4. Choose your method and expand!

---

## Step 1: Install Dependencies

This installs the required libraries for all AI methods.

In [None]:
# Install required packages
!pip install -q gensim # Update gensim to the latest compatible version
!pip install -q sentence-transformers  # Better similarity
!pip install -q transformers accelerate  # GPT-2
!pip install -q torch  # PyTorch for neural networks

print("‚úÖ All packages installed!")

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m27.9/27.9 MB[0m [31m78.1 MB/s[0m eta [36m0:00:00[0m
[?25h‚úÖ All packages installed!


In [None]:
# Import all libraries
import re
import random
import math
import numpy as np
from collections import defaultdict, Counter
from pathlib import Path
from google.colab import files
import textwrap
import warnings
warnings.filterwarnings('ignore')

# Gensim for Word2Vec
from gensim.models import Word2Vec

# Sentence Transformers for better similarity
from sentence_transformers import SentenceTransformer

# PyTorch for LSTM
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Transformers for GPT-2
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments, AutoModelForCausalLM, AutoTokenizer

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"‚úÖ Libraries imported!")
print(f"üñ•Ô∏è  Device: {device}")
if torch.cuda.is_available():
    print(f"üéÆ GPU: {torch.cuda.get_device_name(0)}")

‚úÖ Libraries imported!
üñ•Ô∏è  Device: cuda
üéÆ GPU: Tesla T4


## Step 2: Define All Model Classes

This cell contains all the AI models and text processing logic.

In [None]:
# =============================================================================
# DOCUMENT PROCESSOR
# =============================================================================

class DocumentProcessor:
    """Process and clean markdown documents"""

    def __init__(self, text: str = None):
        self.raw_text = text if text else ""
        self.sentences = []
        self.words = []
        self.paragraphs = []
        self.tokenized_sentences = []  # For Word2Vec training

    def clean_markdown(self, text: str) -> str:
        """Remove markdown syntax"""
        text = re.sub(r'^#{1,6}\s+', '', text, flags=re.MULTILINE)
        text = re.sub(r'\*{1,3}(.*?)\*{1,3}', r'\1', text)
        text = re.sub(r'_{1,3}(.*?)_{1,3}', r'\1', text)
        text = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', text)
        text = re.sub(r'!\[([^\]]*)\]\([^\)]+\)', '', text)
        text = re.sub(r'```[\s\S]*?```', '', text)
        text = re.sub(r'`([^`]+)`', r'\1', text)
        text = re.sub(r'^[-*_]{3,}\s*$', '', text, flags=re.MULTILINE)
        text = re.sub(r'^>\s+', '', text, flags=re.MULTILINE)
        text = re.sub(r'^[\s]*[-*+]\s+', '', text, flags=re.MULTILINE)
        text = re.sub(r'^[\s]*\d+\.\s+', '', text, flags=re.MULTILINE)
        return text

    def extract_sentences(self, text: str) -> list:
        """Extract sentences from text"""
        # Handle abbreviations
        abbrevs = ['Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'St', 'Jr', 'Sr']
        for abbr in abbrevs:
            text = re.sub(rf'{abbr}\.', abbr, text)

        sentences = re.split(r'(?<=[.!?])\s+', text)
        sentences = [s.strip() for s in sentences if s.strip() and len(s.strip()) > 15]
        return sentences

    def extract_words(self, text: str) -> list:
        """Extract words from text"""
        words = re.findall(r"\b[a-zA-Z]+(?:'[a-zA-Z]+)?\b", text.lower())
        return words

    def extract_paragraphs(self, text: str) -> list:
        """Extract paragraphs from text"""
        paragraphs = re.split(r'\n\s*\n', text)
        paragraphs = [p.strip() for p in paragraphs if p.strip() and len(p.strip()) > 50]
        return paragraphs

    def tokenize_for_training(self, sentences: list) -> list:
        """Tokenize sentences for Word2Vec training"""
        tokenized = []
        for sent in sentences:
            words = re.findall(r"\b[a-zA-Z]+(?:'[a-zA-Z]+)?\b", sent.lower())
            if len(words) > 2:
                tokenized.append(words)
        return tokenized

    def process(self) -> dict:
        """Process the complete document"""
        cleaned = self.clean_markdown(self.raw_text)

        self.sentences = self.extract_sentences(cleaned)
        self.words = self.extract_words(cleaned)
        self.paragraphs = self.extract_paragraphs(cleaned)
        self.tokenized_sentences = self.tokenize_for_training(self.sentences)

        return {
            'sentences': self.sentences,
            'words': self.words,
            'paragraphs': self.paragraphs,
            'tokenized': self.tokenized_sentences,
            'word_count': len(self.words),
            'sentence_count': len(self.sentences),
            'unique_words': len(set(self.words)),
            'cleaned_text': cleaned
        }

print("‚úÖ DocumentProcessor defined!")

‚úÖ DocumentProcessor defined!


In [None]:
# =============================================================================
# üü¢ BASIC: MARKOV CHAIN
# =============================================================================

class MarkovChain:
    """Basic Markov Chain for text generation"""

    def __init__(self, order: int = 2):
        self.order = order
        self.chain = defaultdict(list)
        self.starters = []

    def train(self, sentences: list):
        """Train the model from sentences"""
        for sentence in sentences:
            words = sentence.split()
            if len(words) < self.order + 1:
                continue

            starter = tuple(words[:self.order])
            self.starters.append(starter)

            for i in range(len(words) - self.order):
                key = tuple(words[i:i + self.order])
                next_word = words[i + self.order]
                self.chain[key].append(next_word)

    def generate(self, seed_words: list = None, max_words: int = 50) -> str:
        """Generate text"""
        if not self.starters:
            return ""

        if seed_words and len(seed_words) >= self.order:
            current = self._find_matching_key(seed_words)
        else:
            current = random.choice(self.starters)

        if not current:
            current = random.choice(self.starters)

        result = list(current)

        for _ in range(max_words - self.order):
            if current not in self.chain:
                current = self._find_similar_key(current)
                if not current:
                    break

            next_words = self.chain.get(current, [])
            if not next_words:
                break

            next_word = random.choice(next_words)
            result.append(next_word)
            current = tuple(result[-self.order:])

            if next_word.endswith(('.', '!', '?')):
                break

        return ' '.join(result)

    def _find_matching_key(self, words: list) -> tuple:
        words_lower = [w.lower() for w in words]
        for i in range(len(words_lower) - self.order + 1):
            key = tuple(words_lower[i:i + self.order])
            if key in self.chain:
                return key
        for key in self.chain.keys():
            key_lower = tuple(w.lower() for w in key)
            if any(w in key_lower for w in words_lower):
                return key
        return random.choice(self.starters) if self.starters else None

    def _find_similar_key(self, current: tuple) -> tuple:
        current_lower = tuple(w.lower() for w in current)
        for key in self.chain.keys():
            key_lower = tuple(w.lower() for w in key)
            if any(w in current_lower for w in key_lower):
                return key
        return random.choice(self.starters) if self.starters else None

print("‚úÖ MarkovChain defined!")

‚úÖ MarkovChain defined!


In [None]:
# =============================================================================
# üü° ENHANCED: WORD2VEC + MARKOV
# =============================================================================

class Word2VecMarkov:
    """Enhanced Markov Chain using Word2Vec for semantic word selection"""

    def __init__(self, order: int = 2, vector_size: int = 100, window: int = 5):
        self.order = order
        self.vector_size = vector_size
        self.window = window
        self.chain = defaultdict(list)
        self.starters = []
        self.word2vec = None
        self.vocab = set()

    def train(self, sentences: list, tokenized_sentences: list):
        """Train both Markov Chain and Word2Vec"""
        # Train Markov Chain
        for sentence in sentences:
            words = sentence.split()
            if len(words) < self.order + 1:
                continue
            starter = tuple(words[:self.order])
            self.starters.append(starter)
            for i in range(len(words) - self.order):
                key = tuple(words[i:i + self.order])
                next_word = words[i + self.order]
                self.chain[key].append(next_word)

        # Train Word2Vec
        print("   Training Word2Vec model...")
        self.word2vec = Word2Vec(
            sentences=tokenized_sentences,
            vector_size=self.vector_size,
            window=self.window,
            min_count=1,
            workers=4,
            epochs=50
        )
        self.vocab = set(self.word2vec.wv.key_to_index.keys())
        print(f"   Word2Vec vocabulary: {len(self.vocab)} words")

    def _get_best_next_word(self, candidates: list, context: list) -> str:
        """Select the best next word using Word2Vec similarity"""
        if not candidates or not self.word2vec:
            return random.choice(candidates) if candidates else ""

        # Get context words that are in vocabulary
        context_words = [w.lower() for w in context if w.lower() in self.vocab]

        if not context_words:
            return random.choice(candidates)

        # Score each candidate based on similarity to context
        scored = []
        for candidate in candidates:
            cand_lower = candidate.lower().rstrip('.,!?"\'')
            if cand_lower in self.vocab:
                # Calculate average similarity to context words
                similarities = []
                for ctx_word in context_words[-5:]:  # Use last 5 context words
                    try:
                        sim = self.word2vec.wv.similarity(cand_lower, ctx_word)
                        similarities.append(sim)
                    except:
                        pass

                if similarities:
                    avg_sim = sum(similarities) / len(similarities)
                    # Add some randomness to avoid repetition
                    score = avg_sim + random.uniform(-0.1, 0.1)
                    scored.append((candidate, score))
                else:
                    scored.append((candidate, random.uniform(-0.5, 0.5)))
            else:
                scored.append((candidate, random.uniform(-0.5, 0.5)))

        # Sort by score and pick from top candidates with some randomness
        scored.sort(key=lambda x: x[1], reverse=True)
        top_n = min(3, len(scored))
        return random.choice([s[0] for s in scored[:top_n]])

    def generate(self, seed_words: list = None, max_words: int = 50) -> str:
        """Generate text using Word2Vec-enhanced selection"""
        if not self.starters:
            return ""

        if seed_words and len(seed_words) >= self.order:
            current = self._find_matching_key(seed_words)
        else:
            current = random.choice(self.starters)

        if not current:
            current = random.choice(self.starters)

        result = list(current)

        for _ in range(max_words - self.order):
            if current not in self.chain:
                current = self._find_similar_key(current)
                if not current:
                    break

            candidates = self.chain.get(current, [])
            if not candidates:
                break

            # Use Word2Vec to select best next word
            next_word = self._get_best_next_word(candidates, result)
            result.append(next_word)
            current = tuple(result[-self.order:])

            if next_word.endswith(('.', '!', '?')):
                break

        return ' '.join(result)

    def _find_matching_key(self, words: list) -> tuple:
        words_lower = [w.lower() for w in words]
        for i in range(len(words_lower) - self.order + 1):
            key = tuple(words_lower[i:i + self.order])
            if key in self.chain:
                return key
        for key in self.chain.keys():
            key_lower = tuple(w.lower() for w in key)
            if any(w in key_lower for w in words_lower):
                return key
        return None

    def _find_similar_key(self, current: tuple) -> tuple:
        """Find similar key using Word2Vec"""
        current_lower = [w.lower() for w in current]

        best_key = None
        best_score = -1

        for key in self.chain.keys():
            key_lower = [w.lower() for w in key]

            # Calculate similarity between keys
            score = 0
            count = 0
            for w1 in current_lower:
                for w2 in key_lower:
                    if w1 in self.vocab and w2 in self.vocab:
                        try:
                            score += self.word2vec.wv.similarity(w1, w2)
                            count += 1
                        except:
                            pass

            if count > 0:
                avg_score = score / count
                if avg_score > best_score:
                    best_score = avg_score
                    best_key = key

        return best_key if best_key else (random.choice(self.starters) if self.starters else None)

print("‚úÖ Word2VecMarkov defined!")

‚úÖ Word2VecMarkov defined!


In [None]:
# =============================================================================
# üü† ADVANCED: LSTM LANGUAGE MODEL
# =============================================================================

class CharLSTM(nn.Module):
    """Character-level LSTM for text generation"""

    def __init__(self, vocab_size, embed_size=128, hidden_size=256, num_layers=2):
        super(CharLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True, dropout=0.2)
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x, hidden):
        embed = self.embedding(x)
        output, hidden = self.lstm(embed, hidden)
        output = self.fc(output)
        return output, hidden

    def init_hidden(self, batch_size, device):
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(device)
        return (h0, c0)


class LSTMTextGenerator:
    """LSTM-based text generator trained on your document"""

    def __init__(self, embed_size=128, hidden_size=256, num_layers=2):
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.model = None
        self.char_to_idx = {}
        self.idx_to_char = {}
        self.vocab_size = 0
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.trained = False

    def train(self, text: str, epochs: int = 20, seq_length: int = 100, batch_size: int = 64, lr: float = 0.002):
        """Train the LSTM model on the text"""
        print(f"   Preparing data...")

        # Build vocabulary
        chars = sorted(list(set(text)))
        self.char_to_idx = {ch: i for i, ch in enumerate(chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(chars)}
        self.vocab_size = len(chars)

        print(f"   Vocabulary size: {self.vocab_size} characters")

        # Create sequences
        encoded = [self.char_to_idx[ch] for ch in text]
        sequences = []
        targets = []

        for i in range(0, len(encoded) - seq_length, seq_length // 2):
            sequences.append(encoded[i:i + seq_length])
            targets.append(encoded[i + 1:i + seq_length + 1])

        X = torch.tensor(sequences, dtype=torch.long)
        y = torch.tensor(targets, dtype=torch.long)

        dataset = torch.utils.data.TensorDataset(X, y)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        # Initialize model
        self.model = CharLSTM(
            self.vocab_size,
            self.embed_size,
            self.hidden_size,
            self.num_layers
        ).to(self.device)

        criterion = nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)

        # Training loop
        print(f"   Training LSTM model...")
        self.model.train()

        for epoch in range(epochs):
            total_loss = 0
            for batch_x, batch_y in dataloader:
                batch_x = batch_x.to(self.device)
                batch_y = batch_y.to(self.device)

                hidden = self.model.init_hidden(batch_x.size(0), self.device)

                optimizer.zero_grad()
                output, hidden = self.model(batch_x, hidden)

                loss = criterion(output.view(-1, self.vocab_size), batch_y.view(-1))
                loss.backward()

                torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
                optimizer.step()

                total_loss += loss.item()

            if (epoch + 1) % 5 == 0:
                avg_loss = total_loss / len(dataloader)
                print(f"   Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")

        self.trained = True
        print(f"   ‚úì LSTM training complete!")

    def generate(self, seed_text: str, length: int = 200, temperature: float = 0.8) -> str:
        """Generate text from seed"""
        if not self.trained or not self.model:
            return "Model not trained yet."

        self.model.eval()

        # Encode seed text
        seed_encoded = [self.char_to_idx.get(ch, 0) for ch in seed_text[-100:]]
        input_seq = torch.tensor([seed_encoded], dtype=torch.long).to(self.device)

        hidden = self.model.init_hidden(1, self.device)

        # Generate
        generated = seed_text

        with torch.no_grad():
            for _ in range(length):
                output, hidden = self.model(input_seq, hidden)

                # Apply temperature
                probs = torch.softmax(output[0, -1] / temperature, dim=0)

                # Sample from distribution
                idx = torch.multinomial(probs, 1).item()

                char = self.idx_to_char[idx]
                generated += char

                # Update input
                input_seq = torch.tensor([[idx]], dtype=torch.long).to(self.device)

                # Stop at sentence end
                if char in '.!?' and len(generated) > len(seed_text) + 50:
                    break

        return generated

print("‚úÖ LSTMTextGenerator defined!")

‚úÖ LSTMTextGenerator defined!


In [None]:
# =============================================================================
# üî¥ PRO: FINE-TUNED GPT-2
# =============================================================================

class GPT2TextGenerator:
    """Fine-tuned GPT-2 for text generation"""

    def __init__(self, model_name: str = 'distilgpt2'):
        self.model_name = model_name
        self.model = None
        self.tokenizer = None
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.trained = False

    def train(self, text: str, epochs: int = 3, batch_size: int = 4):
        """Fine-tune GPT-2 on the text"""
        print(f"   Loading {self.model_name}...")

        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForCausalLM.from_pretrained(self.model_name)

        # Add padding token
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.model.config.pad_token_id = self.tokenizer.eos_token_id

        # Move to device
        self.model.to(self.device)

        # Prepare training data
        print(f"   Preparing training data...")

        # Split text into chunks
        max_length = 512
        encodings = self.tokenizer(
            text,
            truncation=True,
            max_length=max_length,
            return_overflowing_tokens=True,
            return_tensors='pt',
            padding=True
        )

        # Create dataset
        class TextDataset(Dataset):
            def __init__(self, encodings):
                self.input_ids = encodings['input_ids']
                self.attention_mask = encodings['attention_mask']

            def __len__(self):
                return len(self.input_ids)

            def __getitem__(self, idx):
                return {
                    'input_ids': self.input_ids[idx],
                    'attention_mask': self.attention_mask[idx],
                    'labels': self.input_ids[idx]
                }

        dataset = TextDataset(encodings)
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

        # Training
        print(f"   Fine-tuning GPT-2...")
        optimizer = torch.optim.AdamW(self.model.parameters(), lr=5e-5)

        self.model.train()
        for epoch in range(epochs):
            total_loss = 0
            for batch in dataloader:
                batch = {k: v.to(self.device) for k, v in batch.items()}

                outputs = self.model(**batch)
                loss = outputs.loss

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                total_loss += loss.item()

            avg_loss = total_loss / len(dataloader)
            print(f"   Epoch {epoch + 1}/{epochs}, Loss: {avg_loss:.4f}")

        self.trained = True
        print(f"   ‚úì GPT-2 fine-tuning complete!")

    def generate(self, seed_text: str, max_length: int = 150, temperature: float = 0.9,
                 top_k: int = 50, top_p: float = 0.95) -> str:
        """Generate text from seed"""
        if not self.trained or not self.model:
            return "Model not trained yet."

        self.model.eval()

        input_ids = self.tokenizer.encode(seed_text, return_tensors='pt').to(self.device)

        with torch.no_grad():
            output = self.model.generate(
                input_ids,
                max_length=max_length,
                temperature=temperature,
                top_k=top_k,
                top_p=top_p,
                do_sample=True,
                num_return_sequences=1,
                pad_token_id=self.tokenizer.eos_token_id,
                no_repeat_ngram_size=3
            )

        generated = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return generated

print("‚úÖ GPT2TextGenerator defined!")

‚úÖ GPT2TextGenerator defined!


In [None]:
# =============================================================================
# SIMILARITY FINDER (Enhanced with Sentence Transformers)
# =============================================================================

class EnhancedSimilarityFinder:
    """Find similar sentences using Sentence Transformers"""

    def __init__(self, sentences: list, paragraphs: list, use_transformers: bool = True):
        self.sentences = sentences
        self.paragraphs = paragraphs
        self.use_transformers = use_transformers
        self.sentence_embeddings = None
        self.paragraph_embeddings = None
        self.model = None

        if use_transformers:
            print("   Loading Sentence Transformer model...")
            self.model = SentenceTransformer('all-MiniLM-L6-v2')
            print("   Encoding sentences...")
            self.sentence_embeddings = self.model.encode(sentences, show_progress_bar=False)
            print("   Encoding paragraphs...")
            self.paragraph_embeddings = self.model.encode(paragraphs, show_progress_bar=False)
            print(f"   ‚úì Encoded {len(sentences)} sentences and {len(paragraphs)} paragraphs")

    def find_similar_sentences(self, query: str, top_n: int = 5) -> list:
        """Find sentences most similar to query"""
        if self.use_transformers and self.model:
            query_embedding = self.model.encode([query])[0]

            similarities = []
            for i, sent_emb in enumerate(self.sentence_embeddings):
                sim = np.dot(query_embedding, sent_emb) / (
                    np.linalg.norm(query_embedding) * np.linalg.norm(sent_emb)
                )
                similarities.append((self.sentences[i], float(sim)))

            similarities.sort(key=lambda x: x[1], reverse=True)
            return similarities[:top_n]
        else:
            # Fallback to simple matching
            query_words = set(query.lower().split())
            similarities = []
            for sent in self.sentences:
                sent_words = set(sent.lower().split())
                overlap = len(query_words & sent_words) / max(len(query_words), 1)
                similarities.append((sent, overlap))
            similarities.sort(key=lambda x: x[1], reverse=True)
            return similarities[:top_n]

    def find_similar_paragraphs(self, query: str, top_n: int = 3) -> list:
        """Find paragraphs most similar to query"""
        if self.use_transformers and self.model:
            query_embedding = self.model.encode([query])[0]

            similarities = []
            for i, para_emb in enumerate(self.paragraph_embeddings):
                sim = np.dot(query_embedding, para_emb) / (
                    np.linalg.norm(query_embedding) * np.linalg.norm(para_emb)
                )
                similarities.append((self.paragraphs[i], float(sim)))

            similarities.sort(key=lambda x: x[1], reverse=True)
            return similarities[:top_n]
        else:
            query_words = set(query.lower().split())
            similarities = []
            for para in self.paragraphs:
                para_words = set(para.lower().split())
                overlap = len(query_words & para_words) / max(len(query_words), 1)
                similarities.append((para, overlap))
            similarities.sort(key=lambda x: x[1], reverse=True)
            return similarities[:top_n]

print("‚úÖ EnhancedSimilarityFinder defined!")

‚úÖ EnhancedSimilarityFinder defined!


In [None]:
# =============================================================================
# MAIN TEXT EXPANDER PRO CLASS
# =============================================================================

class TextExpanderPro:
    """Main class combining all AI methods for text expansion"""

    def __init__(self, text: str):
        self.text = text
        self.data = None
        self.processor = None

        # Models
        self.markov = None
        self.word2vec_markov = None
        self.lstm = None
        self.gpt2 = None
        self.similarity = None

        # Flags
        self.basic_ready = False
        self.enhanced_ready = False
        self.lstm_ready = False
        self.gpt2_ready = False

    def initialize_basic(self):
        """Initialize basic Markov Chain (fast)"""
        print("\n" + "="*60)
        print("üü¢ Initializing BASIC (Markov Chain)")
        print("="*60)

        # Process document
        print("\nüìñ Processing document...")
        self.processor = DocumentProcessor(text=self.text)
        self.data = self.processor.process()

        print(f"   ‚úì {self.data['sentence_count']} sentences")
        print(f"   ‚úì {self.data['word_count']} words")
        print(f"   ‚úì {self.data['unique_words']} unique words")

        # Train Markov
        print("\nüîó Training Markov Chain...")
        self.markov = MarkovChain(order=2)
        self.markov.train(self.data['sentences'])
        print(f"   ‚úì {len(self.markov.chain)} transitions learned")

        # Initialize similarity
        print("\nüîç Building similarity index...")
        self.similarity = EnhancedSimilarityFinder(
            self.data['sentences'],
            self.data['paragraphs'],
            use_transformers=True
        )

        self.basic_ready = True
        print("\n‚úÖ Basic mode ready!")

    def initialize_enhanced(self):
        """Initialize Word2Vec enhanced Markov (medium)"""
        if not self.basic_ready:
            self.initialize_basic()

        print("\n" + "="*60)
        print("üü° Initializing ENHANCED (Word2Vec + Markov)")
        print("="*60)

        self.word2vec_markov = Word2VecMarkov(order=2)
        self.word2vec_markov.train(self.data['sentences'], self.data['tokenized'])

        self.enhanced_ready = True
        print("\n‚úÖ Enhanced mode ready!")

    def initialize_lstm(self, epochs: int = 20):
        """Initialize and train LSTM model (slower)"""
        if not self.basic_ready:
            self.initialize_basic()

        print("\n" + "="*60)
        print("üü† Initializing ADVANCED (LSTM Neural Network)")
        print("="*60)

        self.lstm = LSTMTextGenerator()
        self.lstm.train(self.data['cleaned_text'], epochs=epochs)

        self.lstm_ready = True
        print("\n‚úÖ LSTM mode ready!")

    def initialize_gpt2(self, epochs: int = 3):
        """Initialize and fine-tune GPT-2 (slowest, best quality)"""
        if not self.basic_ready:
            self.initialize_basic()

        print("\n" + "="*60)
        print("üî¥ Initializing PRO (Fine-tuned GPT-2)")
        print("="*60)

        self.gpt2 = GPT2TextGenerator()
        self.gpt2.train(self.data['cleaned_text'], epochs=epochs)

        self.gpt2_ready = True
        print("\n‚úÖ GPT-2 mode ready!")

    def expand(self, input_sentence: str, method: str = 'enhanced',
               num_sentences: int = 4, temperature: float = 0.8) -> str:
        """
        Expand input sentence into a paragraph.

        Methods:
        - 'basic': Simple Markov Chain
        - 'enhanced': Word2Vec + Markov (recommended)
        - 'lstm': LSTM neural network
        - 'gpt2': Fine-tuned GPT-2 (best quality)
        - 'hybrid': Combines multiple methods
        """
        if method == 'basic':
            return self._expand_basic(input_sentence, num_sentences)
        elif method == 'enhanced':
            return self._expand_enhanced(input_sentence, num_sentences)
        elif method == 'lstm':
            return self._expand_lstm(input_sentence, temperature)
        elif method == 'gpt2':
            return self._expand_gpt2(input_sentence, temperature)
        elif method == 'hybrid':
            return self._expand_hybrid(input_sentence, num_sentences, temperature)
        else:
            return f"Unknown method: {method}"

    def _expand_basic(self, input_sentence: str, num_sentences: int) -> str:
        """Expand using basic Markov Chain"""
        if not self.basic_ready:
            return "Basic mode not initialized. Run initialize_basic() first."

        result = [input_sentence]
        used = {input_sentence.lower()}

        for _ in range(num_sentences - 1):
            seed = result[-1].split()[-3:]
            generated = self.markov.generate(seed, max_words=35)
            if generated and generated.lower() not in used:
                result.append(generated)
                used.add(generated.lower())

        return ' '.join(result)

    def _expand_enhanced(self, input_sentence: str, num_sentences: int) -> str:
        """Expand using Word2Vec enhanced Markov"""
        if not self.enhanced_ready:
            return "Enhanced mode not initialized. Run initialize_enhanced() first."

        result = [input_sentence]
        used = {input_sentence.lower()}

        # Get context from similar sentences
        similar = self.similarity.find_similar_sentences(input_sentence, 3)
        context_words = []
        for sent, _ in similar:
            context_words.extend(sent.split()[:10])

        attempts = 0
        while len(result) < num_sentences and attempts < num_sentences * 5:
            attempts += 1

            if attempts % 2 == 0 and context_words:
                seed = random.sample(context_words, min(3, len(context_words)))
            else:
                seed = result[-1].split()[-3:]

            generated = self.word2vec_markov.generate(seed, max_words=35)

            if generated and len(generated.split()) > 4:
                gen_lower = generated.lower()
                is_dup = any(self._similarity_ratio(gen_lower, u) > 0.6 for u in used)
                if not is_dup:
                    result.append(generated)
                    used.add(gen_lower)

        return ' '.join(result)

    def _expand_lstm(self, input_sentence: str, temperature: float) -> str:
        """Expand using LSTM"""
        if not self.lstm_ready:
            return "LSTM mode not initialized. Run initialize_lstm() first."

        return self.lstm.generate(input_sentence, length=300, temperature=temperature)

    def _expand_gpt2(self, input_sentence: str, temperature: float) -> str:
        """Expand using GPT-2"""
        if not self.gpt2_ready:
            return "GPT-2 mode not initialized. Run initialize_gpt2() first."

        return self.gpt2.generate(input_sentence, max_length=200, temperature=temperature)

    def _expand_hybrid(self, input_sentence: str, num_sentences: int, temperature: float) -> str:
        """Expand using combination of methods"""
        result = [input_sentence]

        # Use enhanced if available
        if self.enhanced_ready:
            enhanced_result = self._expand_enhanced(input_sentence, num_sentences)
            enhanced_sents = enhanced_result.split('. ')
            result.extend(enhanced_sents[1:3])

        # Add from GPT-2 if available
        if self.gpt2_ready and len(result) < num_sentences:
            gpt2_result = self._expand_gpt2(result[-1], temperature)
            # Extract new sentences
            gpt2_sents = gpt2_result.split('. ')
            for sent in gpt2_sents[1:]:
                if len(result) >= num_sentences:
                    break
                if sent and len(sent) > 20:
                    result.append(sent.strip())

        # Fill remaining with similarity
        if len(result) < num_sentences and self.similarity:
            similar = self.similarity.find_similar_sentences(input_sentence, num_sentences)
            for sent, _ in similar:
                if len(result) >= num_sentences:
                    break
                if sent not in result:
                    result.append(sent)

        return ' '.join(result)

    def _similarity_ratio(self, s1: str, s2: str) -> float:
        """Calculate word overlap ratio"""
        w1 = set(s1.split())
        w2 = set(s2.split())
        if not w1 or not w2:
            return 0
        return len(w1 & w2) / min(len(w1), len(w2))

    def get_status(self):
        """Show status of all models"""
        print("\n" + "="*50)
        print("üìä Model Status")
        print("="*50)
        print(f"üü¢ Basic (Markov):     {'‚úÖ Ready' if self.basic_ready else '‚ùå Not initialized'}")
        print(f"üü° Enhanced (Word2Vec): {'‚úÖ Ready' if self.enhanced_ready else '‚ùå Not initialized'}")
        print(f"üü† Advanced (LSTM):     {'‚úÖ Ready' if self.lstm_ready else '‚ùå Not initialized'}")
        print(f"üî¥ Pro (GPT-2):         {'‚úÖ Ready' if self.gpt2_ready else '‚ùå Not initialized'}")
        print("="*50)

print("\n" + "="*60)
print("‚úÖ All classes defined successfully!")
print("="*60)


‚úÖ All classes defined successfully!


## Step 3: Upload Your Document

Upload your novel/document in markdown format.

In [None]:
# Upload your markdown file
print("üì§ Please upload your markdown (.md) document:")
uploaded = files.upload()

# Get the uploaded file content
document_text = ""
filename = ""

for fn, content in uploaded.items():
    filename = fn
    document_text = content.decode('utf-8')
    print(f"\n‚úÖ File '{fn}' uploaded!")
    print(f"   Size: {len(content):,} bytes")
    print(f"   Characters: {len(document_text):,}")

## Step 4: Initialize Text Expander

Choose which models to initialize based on your needs.

In [None]:
# Create the Text Expander instance
expander = TextExpanderPro(text=document_text)

In [None]:
# üü¢ Initialize Basic Mode (Fast - ~10 seconds)
# Always run this first!
expander.initialize_basic()

In [None]:
# üü° Initialize Enhanced Mode (Medium - ~30 seconds)
# Adds Word2Vec for better word selection
expander.initialize_enhanced()

In [None]:
# üü† Initialize LSTM Mode (Slower - ~3-5 minutes)
# Trains a neural network on your document
# Adjust epochs: more = better quality, longer training
expander.initialize_lstm(epochs=20)

In [None]:
# üî¥ Initialize GPT-2 Mode (Slowest - ~10-15 minutes)
# Fine-tunes a pre-trained language model
# Best quality results!
expander.initialize_gpt2(epochs=3)

In [None]:
# Check which models are ready
expander.get_status()


üìä Model Status
üü¢ Basic (Markov):     ‚úÖ Ready
üü° Enhanced (Word2Vec): ‚ùå Not initialized
üü† Advanced (LSTM):     ‚úÖ Ready
üî¥ Pro (GPT-2):         ‚ùå Not initialized


## Step 5: Expand Sentences! üöÄ

Now you can expand sentences using different methods.

In [None]:
#@title üñäÔ∏è Text Expander Interface { run: "auto", display-mode: "form" }

input_sentence = "name"  #@param {type:"string"}
method = "gpt2"  #@param ["basic", "enhanced", "lstm", "gpt2", "hybrid"]
num_sentences = 6  #@param {type:"slider", min:2, max:8, step:1}
temperature = 0.7  #@param {type:"slider", min:0.5, max:1.5, step:0.1}

print(f"üìù Input: {input_sentence}")
print(f"‚öôÔ∏è  Method: {method}")
print(f"üå°Ô∏è  Temperature: {temperature}")
print("\n" + "="*70)

result = expander.expand(
    input_sentence,
    method=method,
    num_sentences=num_sentences,
    temperature=temperature
)

print("\nüìÑ OUTPUT:")
print("-"*70)
print(textwrap.fill(result, width=70))
print("-"*70)

## üîÑ Quick Functions

Use these for fast text expansion.

In [None]:
def expand(sentence, method='enhanced', n=4, temp=0.8):
    """Quick expansion function"""
    result = expander.expand(sentence, method=method, num_sentences=n, temperature=temp)
    print(f"\nüìù Input: {sentence}")
    print(f"‚öôÔ∏è  Method: {method}")
    print("\n" + "-"*70)
    print(textwrap.fill(result, width=70))
    print("-"*70)
    return result

def compare_methods(sentence):
    """Compare all available methods"""
    print(f"\nüìù Input: {sentence}")
    print("\n" + "="*70)

    methods = []
    if expander.basic_ready:
        methods.append(('basic', 'üü¢ Basic'))
    if expander.enhanced_ready:
        methods.append(('enhanced', 'üü° Enhanced'))
    if expander.lstm_ready:
        methods.append(('lstm', 'üü† LSTM'))
    if expander.gpt2_ready:
        methods.append(('gpt2', 'üî¥ GPT-2'))

    for method, label in methods:
        print(f"\n{label}:")
        print("-"*70)
        result = expander.expand(sentence, method=method, num_sentences=3)
        print(textwrap.fill(result, width=70))

    print("\n" + "="*70)

print("‚úÖ Quick functions defined!")
print("\nUsage:")
print('  expand("Your sentence here")')
print('  expand("Your sentence", method="gpt2", temp=0.9)')
print('  compare_methods("Your sentence here")')

In [None]:
# Try it out!
expand("The ancient castle stood silent in the moonlight.")

In [None]:
# Compare all methods
compare_methods("She opened the mysterious letter with trembling hands.")

---

## üìö Method Comparison

| Method | Quality | Speed | Best For |
|--------|---------|-------|----------|
| üü¢ Basic | ‚≠ê‚≠ê | ‚ö°‚ö°‚ö° | Quick testing |
| üü° Enhanced | ‚≠ê‚≠ê‚≠ê | ‚ö°‚ö° | Daily use |
| üü† LSTM | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚ö° | Creative generation |
| üî¥ GPT-2 | ‚≠ê‚≠ê‚≠ê‚≠ê‚≠ê | ‚ö° | Best quality |
| üîµ Hybrid | ‚≠ê‚≠ê‚≠ê‚≠ê | ‚ö° | Balanced output |

### Tips:
- **Temperature**: Lower (0.5-0.7) = more focused, Higher (0.9-1.2) = more creative
- **Start with Enhanced** - It provides good quality with reasonable speed
- **Use GPT-2** for final/production quality output
- **Longer documents = better results** - More training data helps all methods

---