# Markov Chain Text Generator

This notebook demonstrates how to build a simple text generator using Markov Chains. A Markov chain is a stochastic model that predicts the next state based only on the current state, making it perfect for generating text that mimics a given style.

## Step 1: Import Required Libraries

In [1]:
import random
from collections import defaultdict
import re

## Step 2: Create the Markov Chain Class

The `MarkovChain` class builds a model from training text and generates new text based on learned patterns.

In [4]:
class MarkovChain:
    def __init__(self, order=2):
        """
        Initialize the Markov Chain text generator
        
        Parameters:
        order (int): The number of words to use as context for prediction (default: 2)
        """
        self.order = order
        self.chain = defaultdict(list)
        self.start_words = []
        
    def train(self, text):
        """
        Train the Markov chain model on the given text
        
        Parameters:
        text (str): The training text
        """
        # Clean and tokenize the text
        words = re.findall(r'\b\w+\b|[.,!?;]', text.lower())
        
        # Store potential starting words (capitalized in original)
        sentences = re.split(r'[.!?]+', text)
        for sentence in sentences:
            sentence = sentence.strip()
            if sentence:
                first_words = re.findall(r'\b\w+\b', sentence)
                if first_words:
                    self.start_words.append(first_words[0].lower())
        
        # Build the Markov chain
        for i in range(len(words) - self.order):
            # Create a tuple of 'order' words as the state
            state = tuple(words[i:i + self.order])
            next_word = words[i + self.order]
            self.chain[state].append(next_word)
    
    def generate(self, length=50, seed=None):
        """
        Generate text using the trained Markov chain
        
        Parameters:
        length (int): Approximate number of words to generate (default: 50)
        seed (str): Optional seed text to start generation
        
        Returns:
        str: Generated text
        """
        if not self.chain:
            return "Error: Model not trained yet!"
        
        # Choose starting state
        if seed:
            words = re.findall(r'\b\w+\b', seed.lower())
            if len(words) >= self.order:
                current = tuple(words[-self.order:])
            else:
                current = random.choice(list(self.chain.keys()))
        else:
            # Start with a random starting word if available
            if self.start_words:
                start = random.choice(self.start_words)
                possible_states = [state for state in self.chain.keys() if state[0] == start]
                if possible_states:
                    current = random.choice(possible_states)
                else:
                    current = random.choice(list(self.chain.keys()))
            else:
                current = random.choice(list(self.chain.keys()))
        
        result = list(current)
        
        # Generate text
        for _ in range(length - self.order):
            if current in self.chain and self.chain[current]:
                next_word = random.choice(self.chain[current])
                result.append(next_word)
                # Update current state by shifting window
                current = tuple(result[-self.order:])
            else:
                # If we hit a dead end, pick a new random state
                current = random.choice(list(self.chain.keys()))
                result.extend(current)
        
        # Format output
        text = ' '.join(result)
        # Clean up spacing around punctuation
        text = re.sub(r'\s+([.,!?;])', r'\1', text)
        # Capitalize first letter
        text = text[0].upper() + text[1:] if text else text
        
        return text

## Step 3: Prepare Training Data

Let's use some sample text to train our model. You can replace this with any text you like!

In [5]:
# Sample training text - feel free to replace with your own!
training_text = """
Artificial intelligence is transforming the world in remarkable ways. Machine learning algorithms 
can now recognize patterns in data that humans might miss. Deep learning models have achieved 
breakthroughs in computer vision, natural language processing, and game playing. Neural networks 
are inspired by the structure of the human brain, with layers of interconnected nodes processing 
information. The future of AI holds tremendous potential for solving complex problems. However, 
it also raises important questions about ethics, privacy, and the role of humans in an 
increasingly automated world. Researchers continue to push the boundaries of what machines can do, 
while also working to ensure AI systems are safe, fair, and beneficial for all. The development 
of artificial intelligence requires collaboration across disciplines, combining computer science, 
mathematics, neuroscience, and philosophy. As AI systems become more sophisticated, they will 
likely play an even greater role in our daily lives, from healthcare to education to entertainment. 
The key is to harness this powerful technology responsibly and ensure it serves humanity's best interests.
"""

print("Training text loaded!")
print(f"Number of words: {len(training_text.split())}")

Training text loaded!
Number of words: 166


## Step 4: Train the Model

Create and train the Markov chain model with different orders to see how it affects output quality.

In [6]:
# Create a Markov chain with order 2 (considers 2 words of context)
markov = MarkovChain(order=2)

# Train the model
markov.train(training_text)

print("Model trained successfully!")
print(f"Number of states in chain: {len(markov.chain)}")

Model trained successfully!
Number of states in chain: 183


## Step 5: Generate Text

Now let's generate some text! Try running this cell multiple times to see different outputs.

In [7]:
# Generate text
generated_text = markov.generate(length=60)
print("Generated Text:")
print("-" * 50)
print(generated_text)

Generated Text:
--------------------------------------------------
However, it also raises important questions about ethics, privacy, and game playing. neural networks are inspired by the structure of the human brain, with layers of interconnected nodes processing information. the development of artificial intelligence is transforming the world in remarkable ways. machine learning algorithms can now recognize patterns in data that humans


## Step 6: Generate with Seed Text

You can also provide seed text to guide the generation in a specific direction.

In [8]:
# Generate text with a seed phrase
seed_phrase = "artificial intelligence"
seeded_text = markov.generate(length=50, seed=seed_phrase)
print(f"Generated Text (starting with '{seed_phrase}'):")
print("-" * 50)
print(seeded_text)

Generated Text (starting with 'artificial intelligence'):
--------------------------------------------------
Artificial intelligence requires collaboration across disciplines, combining computer science, mathematics, neuroscience, and the role of humans in an increasingly automated world. researchers continue to push the boundaries of what machines can do, while also working to ensure ai systems become more sophisticated, they


## Step 7: Compare Different Orders

Let's compare text generation with different Markov chain orders to see the quality difference.

In [9]:
# Compare different orders
print("Comparing Markov Chains with different orders:\n")

for order in [1, 2, 3]:
    print(f"\n{'='*60}")
    print(f"ORDER {order} (looks at {order} word{'s' if order > 1 else ''} of context)")
    print('='*60)
    
    mc = MarkovChain(order=order)
    mc.train(training_text)
    text = mc.generate(length=40)
    print(text)

print("\n" + "="*60)
print("Note: Higher order = more coherent but less creative")
print("      Lower order = more random but potentially creative")

Comparing Markov Chains with different orders:


ORDER 1 (looks at 1 word of context)
However, while also working to harness this powerful technology responsibly and game playing. the role in our daily lives, and the role in computer science, with layers of interconnected nodes processing, fair, and beneficial

ORDER 2 (looks at 2 words of context)
However, it also raises important questions about ethics, privacy, and game playing. neural networks are inspired by the structure of the human brain, with layers of interconnected nodes processing information. the future of ai

ORDER 3 (looks at 3 words of context)
Researchers continue to push the boundaries of what machines can do, while also working to ensure ai systems are safe, fair, and beneficial for all. the development of artificial intelligence requires collaboration across disciplines, combining

Note: Higher order = more coherent but less creative
      Lower order = more random but potentially creative


## Experiment: Use Your Own Text!

Replace the training text below with your own content (books, articles, poems, etc.) and see what the model generates!

In [10]:
# Try your own text here!
your_text = """
Replace this text with your own training data. The more text you provide,
the better the results will be. You can paste entire articles, stories,
or any text corpus you want the generator to mimic.
"""

# Train and generate
custom_markov = MarkovChain(order=2)
custom_markov.train(your_text)
custom_output = custom_markov.generate(length=50)

print("Your Generated Text:")
print("-" * 50)
print(custom_output)

Your Generated Text:
--------------------------------------------------
You want the generator to mimic. more text you provide, the better the results will be. you can paste entire articles, stories, or any text corpus you want the generator to mimic., stories, or any text corpus you want the generator to mimic.
