# OriGen Basic Usage Examples

This notebook demonstrates how to use OriGen to generate plasmid replicon sequences. We'll cover:
1. Loading a pre-trained model
2. Generating new replicon sequences
3. Analyzing sequence similarity

First, let's import the required packages:

In [None]:
import torch
import os 

from transformers import AutoModelForCausalLM, AutoTokenizer
from origen.generate import generate_sequences
from origen.alignment import compute_sequence_similarity
from origen.random_mutants import generate_matched_random_mutant
from origen.tokenizers import extract_oriv

os.environ["TOKENIZERS_PARALLELISM"] = "false"

## Loading a Pre-trained Model

OriGen models are available on the HuggingFace Hub. Here's how to load a pre-trained model:

In [None]:
# Load tokenizer and model from HuggingFace
model_name = "jirvine/origen-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

## Generating New Sequences

You can generate sequences in three ways:
1. Unprompted generation
2. Host-specific generation
3. Rep-specific generation

### Unprompted Generation

First, let's generate some sequences without any conditioning:

In [None]:
# Generate sequences with no conditioning
torch.manual_seed(1)
sequences = generate_sequences(
    model=model,
    tokenizer=tokenizer,
    num_sequences=5,
    max_length=1000,
    temperature=1.0,
    top_p=0.95,
    device=device
)

print(f"Generated {len(sequences)} sequences")
print(f"\nExample sequence:\n{sequences[0]}")

### Host-specific Generation

To generate sequences for a specific bacterial host, include the species name in square brackets as a prompt:

In [None]:
# Generate E. coli specific sequences
torch.manual_seed(1)
host_replicon_sequences = generate_sequences(
    model=model,
    tokenizer=tokenizer,
    num_sequences=5,
    prompt="[Escherichia coli]",  # Note the square brackets
    max_length=1000,
    temperature=1.0,
    top_k=4,
    device=device
)

print(f"Generated {len(host_replicon_sequences)} E. coli sequences")
print(f"\nExample sequence:\n{host_replicon_sequences[0]}")

### Rep-specific Generation

To generate origins compatible with a specific Rep protein, include both the host species and Rep protein sequence in the prompt:

In [None]:
# Example Rep3 protein sequence from the paper
rep3_protein_sequence = "MTSNPLIAYKSNALVEASYKLTLQEQRFLLLCISRLNSGTDVASPELQKTMTITAAEYFDSFPDMGRKNAEVQLQEAIDRLWDRSIILKDDEKREEFRWIQYRAQYARGEGKAQITFSDAVMPYLTQLKGQFTRVVIKNISNLSRSYSIRIYEILQQFRSTGERIIALDDFKSSLMLDGKYKDFKTLNRDLIKPCVDELNKKSDLAVTVETIKKGRTVVALHFRFKEDKQIKMTI"
prompt = f"[Escherichia coli]{rep3_protein_sequence}"

# Generate full sequences compatible with this Rep
torch.manual_seed(1)
rep3_replicon_sequences = generate_sequences(
    model=model,
    tokenizer=tokenizer,
    num_sequences=5,
    prompt=prompt,
    max_length=1000,
    temperature=1.0,
    top_k=4,
    device=device
)
rep3_oriv_sequences = [extract_oriv(seq) for seq in rep3_replicon_sequences]

print(f"Generated {len(rep3_oriv_sequences)} oriv sequences for Rep3")
print(f"\nExample sequence:\n{rep3_oriv_sequences[0]}")

### Partial Sequence Generation

You can also generate completions from a partial sequence. This is useful for exploring variations of known origins:

In [None]:
# Generate variations of a ColE1-type origin
cole1_seed_sequence = "aggatcttcttgagatccca"

torch.manual_seed(1)
cole1_replicon_sequences = generate_sequences(
    model=model,
    tokenizer=tokenizer,
    num_sequences=5,
    prompt=f"[Escherichia coli]{cole1_seed_sequence}",
    max_length=1000,
    temperature=1.0,
    top_k=4,
    device=device
)
cole1_oriv_sequences = [extract_oriv(seq) for seq in cole1_replicon_sequences]

print(f"Generated {len(cole1_oriv_sequences)} ColE1 orivs")
print(f"\nExample oriv:\n{cole1_oriv_sequences[0]}")

## Analyzing Sequence Similarity

To analyze how novel your generated sequences are, you can compare them to known wild-type sequences using global alignment:

In [None]:
# Example wild-type ColE1 sequence
wild_type = "AGGATCTTCTTGAGATCCCATTTGGATCGTCGTAATCTCTTGCTCTGTAAACGAAAAAACCGCCTTGGCGGGCGGTTTTTTCGAAGGTTCGAGGAGTTGGCGCTCTTTGAACCGAGGTAACTGGCTTGGAGGAGCGCAGTAACCAAATTCGTTCTTTCAGTTTAGCCTTAACTGGCACATAACTTCAAGACTAACTCCTCTAAATCAGTTACCAGTGGCTGCTGCCAGTGGCGCTTTTGCATGCCTTTCCGGGTTGGACTCAAGATGACAGTTACCGGATAAGGCGCAGCAGTCGGACTGAACGGGGGGTTCGTGCATACAGTCCAGCTTGGAGCGAACTGCCTACCCGGAACTGAGTGTCAGGCGTGGAATGAGACAAACGCGGCCATAACAGCGGAATGACACCGGTAAACCGAATGGCAGGAACAGGAGAGCGCACGAGGGAGCCATCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGGTTCGCCACCACTGATTTGAGCGTCAAATTCTGTGATGCTTGTCAGGGGGGCGGAGCCTATGGAAAAACGGCCGCTGGGCGGCCTCCTCTTTTCCGCCTCCCTTGCTCGCTCGGTTTTCTCGAGCTTTTATAAGAACGGTCTTGCCGCTCGCCGCAGCCGAACGACCGGAGCGTAGCGACTGAGTGAGCGAGGAAGCGGAAAAGAGACTGGTTTGACACTGAGCACTGACGCTCTGAGGCCTCTT"

In [None]:
# Filter to sequences with similar length to wild-type example (similarity metric is less meaningful with mismatched lengths)
wt_seq_len = len(wild_type)
cole1_oriv_sequences_filtered = [seq for seq in cole1_oriv_sequences if wt_seq_len - 20 <= len(seq) <= wt_seq_len + 20]

In [None]:
# Calculate similarity to wild-type
for i, seq in enumerate(cole1_oriv_sequences_complete):
    similarity = compute_sequence_similarity(seq.upper(), wild_type)
    print(f"Generation {i+1} similarity to wild-type: {100 * similarity:.1f}%")

# Create matched random mutants as controls
print()
for i, seq in enumerate(cole1_oriv_sequences_complete):
    random_mutant = generate_matched_random_mutant(seq.upper(), wild_type)
    similarity = compute_sequence_similarity(random_mutant, wild_type)
    print(f"Random mutant {i+1} similarity to wild-type: {100 * similarity:.1f}%")