# Understanding difflib.SequenceMatcher

This notebook demonstrates how `difflib.SequenceMatcher` works using the abstracts from old_file.tex and new_file.tex

In [27]:
import difflib

## 1. Load the abstracts from both files

In [28]:
old_abstract = r"""\def\mytitle{Navigating Digital Innovation in Asset-Intensive Industries: A Process Model Informed by Design Science}
\def\myabstract{Companies in asset-intensive industries, such as aviation and railways, face unique digital transformation challenges due to the misalignment between the rapid evolution of digital technologies and decades-long asset lifecycles. Existing innovation frameworks are inadequate for managing this complexity, which in turn creates tensions between innovation requirements and operational reliability demands. This paper therefore investigates how asset-intensive companies can systematically integrate digital innovations, while fully complying with regulatory constraints and safety requirements. We employ a design science approach in a study of Nederlandse Spoorwegen (NS), the Dutch national railway operator, focusing specifically on the implementation of AI-driven CCTV systems within the operations of NS. Drawing on a literature review and participant-observer as well as interview data, we develop six design propositions that address the key digital innovation challenges of asset-intensive companies in the area of market readiness assessment, modular architecture, regulatory compliance, temporal coordination, ecosystem governance, and organizational capability development. Using these design propositions, we develop the Iterative Development \& Adoption Model (IDAM) that operationalizes market maturity assessment through market readiness levels to guide make-or-buy transitions across four iterative phases: ideate, assess, realise, and review. This model includes a Development Reference Architecture for emerging technologies and an Integration Reference Architecture for more mature technologies, enabling concurrent sourcing strategies based on technological maturity. IDAM provides actionable guidance for decisions about technology adoption in asset-intensive contexts, thereby offering a systematic approach to innovation management in industries with very long asset lifecycles and huge regulatory constraints.}
\def\mykeywords{design science; engineering design; technology adoption; digital transformation; product development; asset lifecycle; modular architecture; market readiness; operational reliability}

% asdasdasdasdasdadasdlkas


1234
what is a girl to do
hello world
"""

new_abstract = r"""\def\mytitle{Navigating Digital Transformation in Asset-Intensive Companies: A Process Model Informed by Design Science}
\def\myabstract{Companies in \textbf{asset-intensive industries}, such as aviation and railways, face unique digital transformation challenges due to the misalignment between the rapid evolution of digital technologies and decades-long asset lifecycles. Existing innovation frameworks are inadequate for managing this complexity, which in turn creates tensions between innovation requirements and operational reliability demands. This paper therefore investigates how asset-intensive companies can systematically integrate digital technologies, while fully complying with regulatory constraints and safety requirements. We employ a design science approach in a study of Nederlandse Spoorwegen (NS), the Dutch national railway operator, focusing specifically on the implementation of AI-driven CCTV systems within the operations of NS. Drawing on a literature review and participant-observer as well as interview data, we develop six design propositions that address the key digital transformation challenges of asset-intensive companies in the area of market readiness assessment, modular architecture, regulatory compliance, temporal coordination, ecosystem governance, and organizational capability development. Using these design propositions, we develop the Iterative Development \& Adoption Model (IDAM) that operationalizes market maturity assessment through market readiness levels to guide make-or-buy transitions across four iterative phases: ideate, assess, realise, and review. This model includes a Development Reference Architecture for emerging technologies and an Integration Reference Architecture for more mature technologies, enabling concurrent sourcing strategies based on technological maturity. IDAM provides actionable guidance for decisions about technology adoption in asset-intensive contexts, thereby offering a systematic approach to innovation management in industries with very long asset lifecycles and huge regulatory constraints.}
\def\mykeywords{design science; engineering design; technology adoption; digital transformation; product development; asset lifecycle; modular architecture; market readiness; operational reliability}

% asdasdasdasdasdadasdlkas


1234

hello world
what is a girl to do
"""

print(f"Old abstract length: {len(old_abstract)} characters")
print(f"New abstract length: {len(new_abstract)} characters")

Old abstract length: 2336 characters
New abstract length: 2354 characters


## 2. Create a SequenceMatcher object

`SequenceMatcher` compares two sequences (strings, lists, etc.) and finds matching blocks.

**Key parameters:**
- `isjunk`: A function to filter out 'junk' elements (we use `None` to keep everything)
- `a`: The first sequence (old text)
- `b`: The second sequence (new text)

In [29]:
# Create a SequenceMatcher to compare the two abstracts
matcher = difflib.SequenceMatcher(None, old_abstract, new_abstract)

print("SequenceMatcher created successfully!")
print(f"Type: {type(matcher)}")

SequenceMatcher created successfully!
Type: <class 'difflib.SequenceMatcher'>


## 3. Get the similarity ratio

The `ratio()` method returns a measure of similarity as a float in [0, 1].
- 1.0 means the sequences are identical
- 0.0 means they have nothing in common

**Formula:** `2.0 * M / T`
- M = number of matching characters
- T = total number of characters in both sequences

In [30]:
ratio = matcher.ratio()
print(f"Similarity ratio: {ratio:.4f}")
print(f"Similarity percentage: {ratio * 100:.2f}%")

Similarity ratio: 0.9723
Similarity percentage: 97.23%


## 4. Get matching blocks

`get_matching_blocks()` returns a list of triples `(i, j, n)` where:
- `i`: start index in sequence a (old text)
- `j`: start index in sequence b (new text)  
- `n`: length of the matching block

This means: `a[i:i+n] == b[j:j+n]`

In [31]:
matching_blocks = matcher.get_matching_blocks()

print(f"Number of matching blocks: {len(matching_blocks)}\n")
print("First 10 matching blocks:")
for i, (a_idx, b_idx, size) in enumerate(matching_blocks[:10]):
    if size > 0:  # Skip the final dummy block
        matching_text = old_abstract[a_idx:a_idx+size]
        print(f"Block {i}: pos_old={a_idx}, pos_new={b_idx}, length={size}")
        print(f"  Text: '{matching_text[:50]}...'" if len(matching_text) > 50 else f"  Text: '{matching_text}'")
        print()

Number of matching blocks: 12

First 10 matching blocks:
Block 0: pos_old=0, pos_new=0, length=32
  Text: '\def\mytitle{Navigating Digital '

Block 1: pos_old=37, pos_new=41, length=25
  Text: 'ation in Asset-Intensive '

Block 2: pos_old=69, pos_new=72, length=78
  Text: 'ies: A Process Model Informed by Design Science}
\...'

Block 3: pos_old=147, pos_new=158, length=26
  Text: 'asset-intensive industries'

Block 4: pos_old=173, pos_new=185, length=467
  Text: ', such as aviation and railways, face unique digit...'

Block 5: pos_old=650, pos_new=663, length=440
  Text: 's, while fully complying with regulatory constrain...'

Block 6: pos_old=1095, pos_new=1112, length=1208
  Text: 'ation challenges of asset-intensive companies in t...'

Block 7: pos_old=2303, pos_new=2327, length=1
  Text: 'w'

Block 8: pos_old=2323, pos_new=2332, length=1
  Text: '
'

Block 9: pos_old=2330, pos_new=2333, length=1
  Text: 'w'



## 5. Get opcodes (operations)

`get_opcodes()` returns a list of 5-tuples describing how to turn sequence a into sequence b.

Each tuple is: `(tag, i1, i2, j1, j2)` where:
- `tag`: operation type ('equal', 'replace', 'delete', 'insert')
- `i1:i2`: slice of sequence a
- `j1:j2`: slice of sequence b

**Operations:**
- `'equal'`: `a[i1:i2] == b[j1:j2]` (no change)
- `'replace'`: `a[i1:i2]` should be replaced by `b[j1:j2]`
- `'delete'`: `a[i1:i2]` should be deleted (j1 == j2)
- `'insert'`: `b[j1:j2]` should be inserted at a[i1:i1] (i1 == i2)

In [32]:
opcodes = matcher.get_opcodes()

print(f"Number of operations: {len(opcodes)}\n")
print("All operations:")
for tag, i1, i2, j1, j2 in opcodes:
    old_text = old_abstract[i1:i2]
    new_text = new_abstract[j1:j2]
    
    if tag == 'equal':
        print(f"{tag:8s} a[{i1:4d}:{i2:4d}] == b[{j1:4d}:{j2:4d}] (length: {i2-i1})")
        if i2 - i1 < 100:
            print(f"         '{old_text}'")
    elif tag == 'replace':
        print(f"{tag:8s} a[{i1:4d}:{i2:4d}] -> b[{j1:4d}:{j2:4d}]")
        print(f"         OLD: '{old_text}'")
        print(f"         NEW: '{new_text}'")
    elif tag == 'delete':
        print(f"{tag:8s} a[{i1:4d}:{i2:4d}]")
        print(f"         DEL: '{old_text}'")
    elif tag == 'insert':
        print(f"{tag:8s} b[{j1:4d}:{j2:4d}]")
        print(f"         INS: '{new_text}'")
    print()

Number of operations: 21

All operations:
equal    a[   0:  32] == b[   0:  32] (length: 32)
         '\def\mytitle{Navigating Digital '

replace  a[  32:  37] -> b[  32:  41]
         OLD: 'Innov'
         NEW: 'Transform'

equal    a[  37:  62] == b[  41:  66] (length: 25)
         'ation in Asset-Intensive '

replace  a[  62:  69] -> b[  66:  72]
         OLD: 'Industr'
         NEW: 'Compan'

equal    a[  69: 147] == b[  72: 150] (length: 78)
         'ies: A Process Model Informed by Design Science}
\def\myabstract{Companies in '

insert   b[ 150: 158]
         INS: '\textbf{'

equal    a[ 147: 173] == b[ 158: 184] (length: 26)
         'asset-intensive industries'

insert   b[ 184: 185]
         INS: '}'

equal    a[ 173: 640] == b[ 185: 652] (length: 467)

replace  a[ 640: 650] -> b[ 652: 663]
         OLD: 'innovation'
         NEW: 'technologie'

equal    a[ 650:1090] == b[ 663:1103] (length: 440)

replace  a[1090:1095] -> b[1103:1112]
         OLD: 'innov'
         NEW: 'tran

In [33]:
print("All operations with LaTeX markup:")
print("=" * 80)

for tag, i1, i2, j1, j2 in opcodes:
    old_text = old_abstract[i1:i2]
    new_text = new_abstract[j1:j2]
    
    if tag == 'equal':
        print(old_text, end='')
    elif tag == 'replace':
        print(f"\\old{{{old_text}}}\\new{{{new_text}}}", end='')
    elif tag == 'delete':
        print(f"\\old{{{old_text}}}", end='')
    elif tag == 'insert':
        print(f"\\new{{{new_text}}}", end='')

print()  # Final newline


All operations with LaTeX markup:
\def\mytitle{Navigating Digital \old{Innov}\new{Transform}ation in Asset-Intensive \old{Industr}\new{Compan}ies: A Process Model Informed by Design Science}
\def\myabstract{Companies in \new{\textbf{}asset-intensive industries\new{}}, such as aviation and railways, face unique digital transformation challenges due to the misalignment between the rapid evolution of digital technologies and decades-long asset lifecycles. Existing innovation frameworks are inadequate for managing this complexity, which in turn creates tensions between innovation requirements and operational reliability demands. This paper therefore investigates how asset-intensive companies can systematically integrate digital \old{innovation}\new{technologie}s, while fully complying with regulatory constraints and safety requirements. We employ a design science approach in a study of Nederlandse Spoorwegen (NS), the Dutch national railway operator, focusing specifically on the implementa

In [34]:

def tokenize_latex(text):
    """Split LaTeX into meaningful tokens.
    
    Token types:
    - LaTeX commands with all arguments: \textbf{...}, \def\cmd{...}
    - Single braces/brackets: { } [ ]
    - Words (alphanumeric sequences)
    - Whitespace (preserved)
    - Punctuation and special characters
    """
    import re
    # Order matters! Try to match longer patterns first
    pattern = r'''
        \\[a-zA-Z]+\*?                    # LaTeX command (e.g., \textbf, \section*)
        |\\[^a-zA-Z]                       # Single-char commands (e.g., \\, \&, \{)
        |[\{\}\[\]]                        # Braces and brackets (separate tokens)
        |\w+                               # Words (letters, digits, underscore)
        |[ \t]+                            # Horizontal whitespace (keep together)
        |\n                                # Newlines (separate token)
        |%[^\n]*\n                         # Comments (% to end of line)
        |[^\w\s\\{}\[\]%]+                 # Punctuation/special chars
        '''
    return re.findall(pattern, text, re.VERBOSE)

# Test the tokenizer
test_cases = [
    r"in \textbf{asset-intensive industries}, such",
    r"\def\mytitle{Some Title}" + "\n",
    r"text with \& special chars",
    r"math: $\alpha + \beta$",
    r"% This is a comment line" + "\n" + "Next line.",
]

print("Testing tokenizer:")
print("=" * 80)
for test in test_cases:
    tokens = tokenize_latex(test)
    print(f"\nInput:  {test}")
    print(f"Tokens: {tokens}")
    print(f"Rejoined: {''.join(tokens)}")
    print(f"Match: {test == ''.join(tokens)}")



Testing tokenizer:

Input:  in \textbf{asset-intensive industries}, such
Tokens: ['in', ' ', '\\textbf', '{', 'asset', '-', 'intensive', ' ', 'industries', '}', ',', ' ', 'such']
Rejoined: in \textbf{asset-intensive industries}, such
Match: True

Input:  \def\mytitle{Some Title}

Tokens: ['\\def', '\\mytitle', '{', 'Some', ' ', 'Title', '}', '\n']
Rejoined: \def\mytitle{Some Title}

Match: True

Input:  text with \& special chars
Tokens: ['text', ' ', 'with', ' ', '\\&', ' ', 'special', ' ', 'chars']
Rejoined: text with \& special chars
Match: True

Input:  math: $\alpha + \beta$
Tokens: ['math', ':', ' ', '$', '\\alpha', ' ', '+', ' ', '\\beta', '$']
Rejoined: math: $\alpha + \beta$
Match: True

Input:  % This is a comment line
Next line.
Tokens: ['% This is a comment line\n', 'Next', ' ', 'line', '.']
Rejoined: % This is a comment line
Next line.
Match: True


  - LaTeX commands with all arguments: \textbf{...}, \def\cmd{...}


In [35]:

print("\n" + "=" * 80)
print("Comparing abstracts with improved tokenizer:")
print("=" * 80)

old_tokens = tokenize_latex(old_abstract)
new_tokens = tokenize_latex(new_abstract)
matcher = difflib.SequenceMatcher(None, old_tokens, new_tokens)
opcodes = matcher.get_opcodes()

for tag, i1, i2, j1, j2 in opcodes:
    old_text = ''.join(old_tokens[i1:i2])
    new_text = ''.join(new_tokens[j1:j2])

    if tag == 'equal':
        print(old_text, end='')
    elif tag == 'replace':
        print(f"\\old{{{old_text}}}\\new{{{new_text}}}", end='')
    elif tag == 'delete':
        print(f"\\old{{{old_text}}}", end='')
    elif tag == 'insert':
        print(f"\\new{{{new_text}}}", end='')

print()  # Final newline


Comparing abstracts with improved tokenizer:
\def\mytitle{Navigating Digital \old{Innovation}\new{Transformation} in Asset-Intensive \old{Industries}\new{Companies}: A Process Model Informed by Design Science}
\def\myabstract{Companies in \new{\textbf{}asset-intensive industries\new{}}, such as aviation and railways, face unique digital transformation challenges due to the misalignment between the rapid evolution of digital technologies and decades-long asset lifecycles. Existing innovation frameworks are inadequate for managing this complexity, which in turn creates tensions between innovation requirements and operational reliability demands. This paper therefore investigates how asset-intensive companies can systematically integrate digital \old{innovations}\new{technologies}, while fully complying with regulatory constraints and safety requirements. We employ a design science approach in a study of Nederlandse Spoorwegen (NS), the Dutch national railway operator, focusing specifica

In [36]:
text1 = r"This is \textbf{bold} text with \alpha." + "\n" + r"Companies in \textbf{asset-intensive} industries"
text2 = r"This is \textit{italic} text with \beta." + "\n" + r"Companies in asset-intensive industries"

tokens1 = tokenize_latex(text1)
tokens2 = tokenize_latex(text2)

print(tokens1)
matcher = difflib.SequenceMatcher(None, tokens1, tokens2)
opcodes = matcher.get_opcodes()
end_tokens = []
for tag, i1, i2, j1, j2 in opcodes:
    old_tokens = tokens1[i1:i2]
    new_tokens = tokens2[j1:j2]

    if tag == 'equal':
        end_tokens += old_tokens
    elif tag == 'replace':
        end_tokens += ['\\old', '{'] + old_tokens + ['}', '\\new', '{'] + new_tokens + ['}'] 
    elif tag == 'delete':
        end_tokens += ['\\old', '{'] + old_tokens + ['}']
    elif tag == 'insert':
        end_tokens += ['\\new', '{'] + new_tokens + ['}']

print("".join(end_tokens))  # Final newline

['This', ' ', 'is', ' ', '\\textbf', '{', 'bold', '}', ' ', 'text', ' ', 'with', ' ', '\\alpha', '.', '\n', 'Companies', ' ', 'in', ' ', '\\textbf', '{', 'asset', '-', 'intensive', '}', ' ', 'industries']
This is \old{\textbf}\new{\textit}{\old{bold}\new{italic}} text with \old{\alpha}\new{\beta}.
Companies in \old{\textbf{}asset-intensive\old{}} industries


In [37]:
def merge_opcodes(opcodes, old_text, new_text, min_equal_length=10):
    """
    Merge opcodes to avoid fragmented changes.
    If 'equal' segments between changes are very short, treat them as part of the change.
    """
    if not opcodes:
        return opcodes
    
    merged = []
    i = 0
    
    while i < len(opcodes):
        tag, i1, i2, j1, j2 = opcodes[i]
        
        # Start accumulating if it's a change operation
        if tag in ('replace', 'delete', 'insert'):
            # Look ahead to see if we should merge with next operations
            while i + 1 < len(opcodes):
                next_tag, next_i1, next_i2, next_j1, next_j2 = opcodes[i + 1]
                
                # If next is 'equal' and short, check if there's another change after
                if next_tag == 'equal' and (next_i2 - next_i1) < min_equal_length:
                    # Look at the operation after the short equal segment
                    if i + 2 < len(opcodes):
                        after_tag, _, _, _, _ = opcodes[i + 2]
                        if after_tag in ('replace', 'delete', 'insert'):
                            # Merge: extend current operation through the equal segment
                            i2 = next_i2
                            j2 = next_j2
                            i += 1  # Skip the short equal
                            
                            # Now merge the following change operation too
                            i += 1
                            tag2, i1_2, i2_2, j1_2, j2_2 = opcodes[i]
                            i2 = i2_2
                            j2 = j2_2
                            tag = 'replace'  # Combined operation becomes 'replace'
                            continue
                
                break
            
            merged.append((tag, i1, i2, j1, j2))
        else:
            # Keep 'equal' as-is
            merged.append((tag, i1, i2, j1, j2))
        
        i += 1
    
    return merged

print("Using merged opcodes (min_equal_length=10):")
print("=" * 80)

merged_opcodes = merge_opcodes(opcodes, old_abstract, new_abstract, min_equal_length=10)

for tag, i1, i2, j1, j2 in merged_opcodes:
    old_text = old_abstract[i1:i2]
    new_text = new_abstract[j1:j2]
    
    if tag == 'equal':
        print(old_text, end='')
    elif tag == 'replace':
        print(f"\\old{{{old_text}}}\\new{{{new_text}}}", end='')
    elif tag == 'delete':
        print(f"\\old{{{old_text}}}", end='')
    elif tag == 'insert':
        print(f"\\new{{{new_text}}}", end='')

print()  # Final newline

Using merged opcodes (min_equal_length=10):
\def\old{\mytitle{Navigating Di}\new{\mytitle{Navigating}gi


## Alternative approach: Token-level comparison

Instead of character-by-character comparison, we can tokenize the text (words + LaTeX commands) and compare at that level. This naturally groups LaTeX markup with the content it wraps.

In [38]:
print("Token-level comparison with LaTeX markup:")
print("=" * 80)

for tag, i1, i2, j1, j2 in token_matcher.get_opcodes():
    old_segment = ''.join(old_tokens[i1:i2])
    new_segment = ''.join(new_tokens[j1:j2])
    
    if tag == 'equal':
        print(old_segment, end='')
    elif tag == 'replace':
        print(f"\\old{{{old_segment}}}\\new{{{new_segment}}}", end='')
    elif tag == 'delete':
        print(f"\\old{{{old_segment}}}", end='')
    elif tag == 'insert':
        print(f"\\new{{{new_segment}}}", end='')

print()  # Final newline

Token-level comparison with LaTeX markup:


NameError: name 'token_matcher' is not defined

## 6. Practical example with shorter strings

Let's look at a simpler example to understand the mechanics better

In [None]:
# Simple example
old_text = "The quick brown fox jumps"
new_text = "The quick red fox leaps"

simple_matcher = difflib.SequenceMatcher(None, old_text, new_text)

print("=" * 60)
print("SIMPLE EXAMPLE")
print("=" * 60)
print(f"Old: '{old_text}'")
print(f"New: '{new_text}'")
print(f"\nSimilarity: {simple_matcher.ratio():.2%}\n")

print("Matching blocks:")
for i, j, n in simple_matcher.get_matching_blocks():
    if n > 0:
        print(f"  '{old_text[i:i+n]}' at old[{i}:{i+n}] == new[{j}:{j+n}]")

print("\nOperations to transform old -> new:")
for tag, i1, i2, j1, j2 in simple_matcher.get_opcodes():
    if tag == 'equal':
        print(f"  KEEP:    '{old_text[i1:i2]}'")
    elif tag == 'replace':
        print(f"  REPLACE: '{old_text[i1:i2]}' -> '{new_text[j1:j2]}'")
    elif tag == 'delete':
        print(f"  DELETE:  '{old_text[i1:i2]}'")
    elif tag == 'insert':
        print(f"  INSERT:  '{new_text[j1:j2]}'")

SIMPLE EXAMPLE
Old: 'The quick brown fox jumps'
New: 'The quick red fox leaps'

Similarity: 75.00%

Matching blocks:
  'The quick ' at old[0:10] == new[0:10]
  'r' at old[11:12] == new[10:11]
  ' fox ' at old[15:20] == new[13:18]
  'ps' at old[23:25] == new[21:23]

Operations to transform old -> new:
  KEEP:    'The quick '
  DELETE:  'b'
  KEEP:    'r'
  REPLACE: 'own' -> 'ed'
  KEEP:    ' fox '
  REPLACE: 'jum' -> 'lea'
  KEEP:    'ps'


## 7. Using SequenceMatcher for word-level comparison

Instead of comparing character by character, we can compare word by word

In [None]:
# Split abstracts into words
old_words = old_abstract.split()
new_words = new_abstract.split()

word_matcher = difflib.SequenceMatcher(None, old_words, new_words)

print(f"Old abstract: {len(old_words)} words")
print(f"New abstract: {len(new_words)} words")
print(f"Word-level similarity: {word_matcher.ratio():.2%}\n")

print("Word-level differences:")
for tag, i1, i2, j1, j2 in word_matcher.get_opcodes():
    if tag != 'equal':
        old_segment = ' '.join(old_words[i1:i2])
        new_segment = ' '.join(new_words[j1:j2])
        
        print(f"\n{tag.upper()}:")
        print(f"  Position: words {i1}-{i2} -> words {j1}-{j2}")
        if tag == 'replace':
            print(f"  Old: '{old_segment}'")
            print(f"  New: '{new_segment}'")
        elif tag == 'delete':
            print(f"  Deleted: '{old_segment}'")
        elif tag == 'insert':
            print(f"  Inserted: '{new_segment}'")

Old abstract: 284 words
New abstract: 284 words
Word-level similarity: 97.18%

Word-level differences:

REPLACE:
  Position: words 2-3 -> words 2-3
  Old: 'Innovation'
  New: 'Transformation'

REPLACE:
  Position: words 5-6 -> words 5-6
  Old: 'Industries:'
  New: 'Companies:'

REPLACE:
  Position: words 15-17 -> words 15-17
  Old: 'asset-intensive industries,'
  New: '\textbf{asset-intensive industries},'

REPLACE:
  Position: words 74-75 -> words 74-75
  Old: 'innovations,'
  New: 'technologies,'

REPLACE:
  Position: words 138-139 -> words 138-139
  Old: 'innovation'
  New: 'transformation'

INSERT:
  Position: words 276-276 -> words 276-278
  Inserted: 'hello world'

DELETE:
  Position: words 282-284 -> words 284-284
  Deleted: 'hello world'


## 8. Key takeaways about SequenceMatcher

1. **Initialization**: `SequenceMatcher(isjunk, a, b)` creates a matcher for two sequences

2. **Similarity ratio**: `ratio()` returns a float [0, 1] indicating how similar the sequences are

3. **Matching blocks**: `get_matching_blocks()` returns `(i, j, n)` tuples showing where sequences match
   - `a[i:i+n] == b[j:j+n]`

4. **Operations**: `get_opcodes()` returns `(tag, i1, i2, j1, j2)` tuples describing transformations
   - 'equal': sequences match
   - 'replace': substitute a[i1:i2] with b[j1:j2]
   - 'delete': remove a[i1:i2]
   - 'insert': add b[j1:j2]

5. **Flexibility**: Works with any sequences (strings, lists, etc.) and at any granularity (chars, words, lines)

6. **Use cases**: 
   - Text diff tools
   - Finding changes between documents
   - Plagiarism detection
   - Version control systems

In [None]:
old_lines = old_abstract.splitlines()
new_lines = new_abstract.splitlines()

simple_matcher = difflib.SequenceMatcher(None, old_lines, new_lines)

for tag, i1, i2, j1, j2 in simple_matcher.get_opcodes():
    if tag == 'equal':
        print(f"  KEEP:    '{old_lines[i1:i2]}'")
    elif tag == 'replace':
        # Do a finer token-level comparison for each replaced line
        for line_idx in range(i1, i2):
            old_line = old_lines[line_idx]
            # Find corresponding new line (if exists)
            new_line_idx = j1 + (line_idx - i1)
            if new_line_idx < j2:
                new_line = new_lines[new_line_idx]
            
            old_line_tokens = tokenize_latex(old_line)
            new_line_tokens = tokenize_latex(new_line)
            
            line_matcher = difflib.SequenceMatcher(None, old_line_tokens, new_line_tokens)
            
            # print(f"    Line {line_idx} -> {new_line_idx}: ", end='')
            print(f"  REPLACED: ", end="")
            for token_tag, ti1, ti2, tj1, tj2 in line_matcher.get_opcodes():
                old_segment = ''.join(old_line_tokens[ti1:ti2])
                new_segment = ''.join(new_line_tokens[tj1:tj2])
                
                if token_tag == 'equal':
                    print(old_segment, end='')
                elif token_tag == 'replace':
                    print(f"\\old{{{old_segment}}}\\new{{{new_segment}}}", end='')
                elif token_tag == 'delete':
                    print(f"\\old{{{old_segment}}}", end='')
                elif token_tag == 'insert':
                    print(f"\\new{{{new_segment}}}", end='')
            print()  # newline after each line
        print()  # newline after token-level output
        # print(f"  REPLACE: '{old_lines[i1:i2]}'\n -> '{new_lines[j1:j2]}'")
    elif tag == 'delete':
        print(f"  DELETE:  '{old_lines[i1:i2]}'")
    elif tag == 'insert':
        print(f"  INSERT:  '{new_lines[j1:j2]}'")

  REPLACED: \def\mytitle{Navigating Digital \old{Innovation}\new{Transformation} in Asset-Intensive \old{Industries}\new{Companies}: A Process Model Informed by Design Science}
  REPLACED: \def\myabstract{Companies in \new{\textbf{}asset-intensive industries\new{}}, such as aviation and railways, face unique digital transformation challenges due to the misalignment between the rapid evolution of digital technologies and decades-long asset lifecycles. Existing innovation frameworks are inadequate for managing this complexity, which in turn creates tensions between innovation requirements and operational reliability demands. This paper therefore investigates how asset-intensive companies can systematically integrate digital \old{innovations}\new{technologies}, while fully complying with regulatory constraints and safety requirements. We employ a design science approach in a study of Nederlandse Spoorwegen (NS), the Dutch national railway operator, focusing specifically on the implementat

## 9. Matcher for LaTeX markup changes

For your specific requirement, we need a token-level comparison that:
1. Treats LaTeX commands as single tokens (e.g., `\textbf` stays together)
2. Keeps braces with their content context
3. Shows the OLD text without markup and NEW text with markup

In [None]:
# Test with your specific example
old_example = r"\def\myabstract{Companies in asset-intensive industries, such as aviation}"
new_example = r"\def\myabstract{Companies in \textbf{asset-intensive industries}, such as aviation}"

print("Testing with your example:")
print(f"OLD: {old_example}")
print(f"NEW: {new_example}")
print()

# Tokenize
old_tokens = tokenize_latex(old_example)
new_tokens = tokenize_latex(new_example)

print(f"Old tokens: {old_tokens}")
print(f"New tokens: {new_tokens}")
print()

# Compare
matcher = difflib.SequenceMatcher(None, old_tokens, new_tokens)

print("Opcodes:")
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    print(f"{tag:8s} old[{i1}:{i2}] = {old_tokens[i1:i2]}")
    print(f"         new[{j1}:{j2}] = {new_tokens[j1:j2]}")
    print()

# Generate output
output_tokens = []
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    old_segment = old_tokens[i1:i2]
    new_segment = new_tokens[j1:j2]
    
    if tag == 'equal':
        output_tokens.extend(old_segment)
    elif tag == 'replace':
        output_tokens.extend(['\\old{'] + old_segment + ['}\\new{'] + new_segment + ['}'])
    elif tag == 'delete':
        output_tokens.extend(['\\old{'] + old_segment + ['}'])
    elif tag == 'insert':
        output_tokens.extend(['\\new{'] + new_segment + ['}'])

result = ''.join(output_tokens)
print("Result:")
print(result)

Testing with your example:
OLD: \def\myabstract{Companies in asset-intensive industries, such as aviation}
NEW: \def\myabstract{Companies in \textbf{asset-intensive industries}, such as aviation}

Old tokens: ['\\def', '\\myabstract', '{', 'Companies', ' ', 'in', ' ', 'asset', '-', 'intensive', ' ', 'industries', ',', ' ', 'such', ' ', 'as', ' ', 'aviation', '}']
New tokens: ['\\def', '\\myabstract', '{', 'Companies', ' ', 'in', ' ', '\\textbf', '{', 'asset', '-', 'intensive', ' ', 'industries', '}', ',', ' ', 'such', ' ', 'as', ' ', 'aviation', '}']

Opcodes:
equal    old[0:7] = ['\\def', '\\myabstract', '{', 'Companies', ' ', 'in', ' ']
         new[0:7] = ['\\def', '\\myabstract', '{', 'Companies', ' ', 'in', ' ']

insert   old[7:7] = []
         new[7:9] = ['\\textbf', '{']

equal    old[7:12] = ['asset', '-', 'intensive', ' ', 'industries']
         new[9:14] = ['asset', '-', 'intensive', ' ', 'industries']

insert   old[12:12] = []
         new[14:15] = ['}']

equal    old[12:20]

In [None]:
old_example = r"""\section{A new header}
\label{sec:newheader}
\def\mytitle{Navigating Digital Innovation in Asset-Intensive Industries: A Process Model Informed by Design Science}
\def\myabstract{Companies in asset-intensive industries, such as aviation and railways, face unique digital transformation challenges due to the misalignment between the rapid evolution of digital technologies and decades-long asset lifecycles. Existing innovation frameworks are inadequate for managing this complexity, which in turn creates tensions between innovation requirements and operational reliability demands. This paper therefore investigates how asset-intensive companies can systematically integrate digital innovations, while fully complying with regulatory constraints and safety requirements. We employ a design science approach in a study of Nederlandse Spoorwegen (NS), the Dutch national railway operator, focusing specifically on the implementation of AI-driven CCTV systems within the operations of NS. Drawing on a literature review and participant-observer as well as interview data, we develop six design propositions that address the key digital innovation challenges of asset-intensive companies in the area of market readiness assessment, modular architecture, regulatory compliance, temporal coordination, ecosystem governance, and organizational capability development. Using these design propositions, we develop the Iterative Development \& Adoption Model (IDAM) that operationalizes market maturity assessment through market readiness levels to guide make-or-buy transitions across four iterative phases: ideate, assess, realise, and review. This model includes a Development Reference Architecture for emerging technologies and an Integration Reference Architecture for more mature technologies, enabling concurrent sourcing strategies based on technological maturity. IDAM provides actionable guidance for decisions about technology adoption in asset-intensive contexts, thereby offering a systematic approach to innovation management in industries with very long asset lifecycles and huge regulatory constraints.}
\def\mykeywords{design science; engineering design; technology adoption; digital transformation; product development; asset lifecycle; modular architecture; market readiness; operational reliability}

% asdasdasdasdasdadasdlkas


1234
what is a girl to do
hello world
"""

new_example = r"""\section{A new header too}
\label{sec:newheader2}
\def\mytitle{Navigating Digital Transformation in Asset-Intensive Companies: A Process Model Informed by Design Science}
\def\myabstract{Companies in \textbf{asset-intensive industries}, such as aviation and railways, face unique digital transformation challenges due to the misalignment between the rapid evolution of digital technologies and decades-long asset lifecycles. Existing innovation frameworks are inadequate for managing this complexity, which in turn creates tensions between innovation requirements and operational reliability demands. This paper therefore investigates how asset-intensive companies can systematically integrate digital technologies, while fully complying with regulatory constraints and safety requirements. We employ a design science approach in a study of Nederlandse Spoorwegen (NS), the Dutch national railway operator, focusing specifically on the implementation of AI-driven CCTV systems within the operations of NS. Drawing on a literature review and participant-observer as well as interview data, we develop six design propositions that address the key digital transformation challenges of asset-intensive companies in the area of market readiness assessment, modular architecture, regulatory compliance, temporal coordination, ecosystem governance, and organizational capability development. Using these design propositions, we develop the Iterative Development \& Adoption Model (IDAM) that operationalizes market maturity assessment through market readiness levels to guide make-or-buy transitions across four iterative phases: ideate, assess, realise, and review. This model includes a Development Reference Architecture for emerging technologies and an Integration Reference Architecture for more mature technologies, enabling concurrent sourcing strategies based on technological maturity. IDAM provides actionable guidance for decisions about technology adoption in asset-intensive contexts, thereby offering a systematic approach to innovation management in industries with very long asset lifecycles and huge regulatory constraints.}
\def\mykeywords{design science; engineering design; technology adoption; digital transformation; product development; asset lifecycle; modular architecture; market readiness; operational reliability}

% asdasdasdasdasdadas


1234

hello world
what is a girl to do
"""

old_example = r"""\section{A new header}
\label{sec:newheader}
\def\mytitle{Navigating Digital Innovation in Asset-Intensive Industries: A Process Model Informed by Design Science}
\def\myabstract{Companies in asset-intensive industries, such as aviation and railways.}
% another comment
Hallo allemaal
"""

new_example = r"""\section{A new header too}
\label{sec:newheader2}
\def\mytitle{Navigating Digital Transformation in Asset-Intensive Companies: A Process Model Informed by Design Science}
\def\myabstract{Companies in \textbf{asset-intensive industries}, such as aviation and railways.}
% some comment
Hallo allemaal
"""


def group_latex_commands(tokens):
    """
    Group ONLY formatting LaTeX commands with their arguments into single tokens.
    E.g., ['\\textbf', '{', 'text', '}'] becomes ['\\textbf{text}']
    
    This helps the matcher see '\textbf{asset-intensive industries}' as a unit
    rather than separate tokens, which causes it to be recognized as a REPLACEMENT
    of plain text with formatted text.
    
    Only groups these formatting commands:
    - Text formatting: textbf, textit, texttt, textsc, emph, underline, etc.
    - Font commands: bf, it, tt, sc, rm, sf
    - Size commands: tiny, small, large, Large, LARGE, huge, Huge
    """
    # Commands that should be grouped with their arguments
    FORMATTING_COMMANDS = {
        'textbf', 'textit', 'texttt', 'textsc', 'textrm', 'textsf',
        'emph', 'underline', 'textsl', 'textmd', 'textup',
        'bf', 'it', 'tt', 'sc', 'rm', 'sf', 'sl', 'md', 'up',
        'tiny', 'scriptsize', 'footnotesize', 'small', 'normalsize',
        'large', 'Large', 'LARGE', 'huge', 'Huge',
        'textcolor', 'color', 'colorbox',
        'label', 'ref'
        'cite', 'citep', 'citet'
    }
    
    result = []
    i = 0
    
    while i < len(tokens):
        token = tokens[i]
        
        # Check if this is a formatting command followed by braces
        if token.startswith('\\') and len(token) > 1:
            cmd_name = token[1:]  # Remove the backslash
            
            # Only group if it's a formatting command
            if cmd_name in FORMATTING_COMMANDS and i + 1 < len(tokens) and tokens[i + 1] == '{':
                # Find matching closing brace
                brace_count = 0
                group = [token]
                j = i + 1
                
                while j < len(tokens):
                    group.append(tokens[j])
                    if tokens[j] == '{':
                        brace_count += 1
                    elif tokens[j] == '}':
                        brace_count -= 1
                        if brace_count == 0:
                            # Found complete command with arguments
                            result.append(''.join(group))
                            i = j + 1
                            break
                    j += 1
                else:
                    # No matching brace found, just add the command
                    result.append(token)
                    i += 1
            else:
                # Not a formatting command, or no braces - add as-is
                result.append(token)
                i += 1
        else:
            # Not a command, add as-is
            result.append(token)
            i += 1
    
    return result


# Test the grouping function
old_tokens_raw = tokenize_latex(old_example)
new_tokens_raw = tokenize_latex(new_example)

old_tokens_grouped = group_latex_commands(old_tokens_raw)
new_tokens_grouped = group_latex_commands(new_tokens_raw)

print("Grouped tokens:")
print(f"Old: {old_tokens_grouped}")
print(f"New: {new_tokens_grouped}")
print()

# Now compare with grouped tokens
matcher = difflib.SequenceMatcher(None, old_tokens_grouped, new_tokens_grouped)

print("Opcodes with grouped tokens:")
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    print(f"{tag:8s} old[{i1}:{i2}] = {old_tokens_grouped[i1:i2]}")
    print(f"         new[{j1}:{j2}] = {new_tokens_grouped[j1:j2]}")
    print()



Grouped tokens:
Old: ['\\section', '{', 'A', ' ', 'new', ' ', 'header', '}', '\n', '\\label{sec:newheader}', '\n', '\\def', '\\mytitle', '{', 'Navigating', ' ', 'Digital', ' ', 'Innovation', ' ', 'in', ' ', 'Asset', '-', 'Intensive', ' ', 'Industries', ':', ' ', 'A', ' ', 'Process', ' ', 'Model', ' ', 'Informed', ' ', 'by', ' ', 'Design', ' ', 'Science', '}', '\n', '\\def', '\\myabstract', '{', 'Companies', ' ', 'in', ' ', 'asset', '-', 'intensive', ' ', 'industries', ',', ' ', 'such', ' ', 'as', ' ', 'aviation', ' ', 'and', ' ', 'railways', '.', '}', '\n', '% another comment', '\n', 'Hallo', ' ', 'allemaal', '\n']
New: ['\\section', '{', 'A', ' ', 'new', ' ', 'header', ' ', 'too', '}', '\n', '\\label{sec:newheader2}', '\n', '\\def', '\\mytitle', '{', 'Navigating', ' ', 'Digital', ' ', 'Transformation', ' ', 'in', ' ', 'Asset', '-', 'Intensive', ' ', 'Companies', ':', ' ', 'A', ' ', 'Process', ' ', 'Model', ' ', 'Informed', ' ', 'by', ' ', 'Design', ' ', 'Science', '}', '\n', '\\def', 

In [None]:
def is_specific(segment):
    if len(segment) == 0:
        return False
    elif segment[0][0] == "%":
        return True
    elif segment[0].startswith("\\label"):
        return True
    elif segment[0].startswith("\\ref"):
        return True
    else:
        return False

# Generate output
output_tokens = []
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
    old_segment = old_tokens_grouped[i1:i2]
    new_segment = new_tokens_grouped[j1:j2]
    
    if tag == 'equal':
        output_tokens.extend(old_segment)
    elif is_specific(old_segment):
        if tag == 'replace' or tag == 'insert':
            output_tokens.extend(new_segment)
    elif tag == 'delete':
        output_tokens.append('\\old{' + ''.join(old_segment) + '}')
    elif tag == 'replace':
        if new_segment[0] == " ":
            output_tokens.append('\\old{' + ''.join(old_segment) + '} \\new{' + ''.join(new_segment[1:]) + '}')
        else:
            output_tokens.append('\\old{' + ''.join(old_segment) + '}\\new{' + ''.join(new_segment) + '}')
    elif tag == 'insert':
        if new_segment[0] == " ":
            output_tokens.append(' \\new{' + ''.join(new_segment[1:]) + '}')
        else:
            output_tokens.append('\\new{' + ''.join(new_segment) + '}')

result = ''.join(output_tokens)
print("Result with selective grouping:")
print(result)

Result with selective grouping:
\section{A new header \new{too}}
\label{sec:newheader2}
\def\mytitle{Navigating Digital \old{Innovation}\new{Transformation} in Asset-Intensive \old{Industries}\new{Companies}: A Process Model Informed by Design Science}
\def\myabstract{Companies in \old{asset-intensive industries}\new{\textbf{asset-intensive industries}}, such as aviation and railways.}
Hallo allemaal

