# difflib – Compare sequences

URL: https://pymotw.com/2/difflib/

URL: https://docs.python.org/3/library/difflib.html

Purpose:	Compare sequences, especially lines of text.

The difflib module contains tools for computing and working with differences between sequences. It is especially useful for comparing text, and includes functions that produce reports using several common difference formats.

In [None]:
text1 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
eu lacus accumsan arcu fermentum euismod. Donec pulvinar porttitor
tellus. Aliquam venenatis. Donec facilisis pharetra tortor.  In nec
mauris eget magna consequat convallis. Nam sed sem vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Suspendisse eu lectus. In nunc. Duis vulputate tristique
enim. Donec quis lectus a justo imperdiet tempus."""
text1_lines = text1.splitlines()

text2 = """Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Integer
eu lacus accumsan arcu fermentum euismod. Donec pulvinar, porttitor
tellus. Aliquam venenatis. Donec facilisis pharetra tortor. In nec
mauris eget magna consequat convallis. Nam cras vitae mi vitae odio
pellentesque interdum. Sed consequat viverra nisl. Suspendisse arcu
metus, blandit quis, rhoncus ac, pharetra eget, velit. Mauris
urna. Morbi nonummy molestie orci. Praesent nisi elit, fringilla ac,
suscipit non, tristique vel, mauris. Curabitur vel lorem id nisl porta
adipiscing. Duis vulputate tristique enim. Donec quis lectus a justo
imperdiet tempus. Suspendisse eu lectus. In nunc. """
text2_lines = text2.splitlines()

# print("text1= \n{}\n".format(text1_lines))
# print("text2= \n{}\n".format(text2_lines))

## Comparing Bodies of Text

The default output produced by Differ is similar to the diff command line tool is simple with the Differ class. It includes the original input values from both lists, including common values, and markup data to indicate what changes were made.

- Lines prefixed with - indicate that they were in the first sequence, but not the second.
- Lines prefixed with + were in the second sequence, but not the first.
- If a line has an incremental difference between versions, an extra line prefixed with ? is used to highlight the change within the new version.
- If a line has not changed, it is printed with an extra blank space on the left column so that it it lines up with the other lines that may have differences.

In [None]:
import difflib
# from difflib.data import *

d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
print('\n'.join(diff))

The ndiff() function produces essentially the same output.

In [None]:
import difflib
# from difflib_data import *

diff = difflib.ndiff(text1_lines, text2_lines)
print('\n'.join(list(diff)))

## Other Output Formats

While the Differ class shows all of the input lines, a unified diff only includes modified lines and a bit of context.

In [None]:
import difflib
# from difflib_data import *

diff = difflib.unified_diff(text1_lines, text2_lines, lineterm='')
print('\n'.join(list(diff)))

Using context_diff() produces similar readable output

In [None]:
import difflib
# from difflib_data import *

diff = difflib.context_diff(text1_lines, text2_lines, lineterm='')
print('\n'.join(list(diff)))

# HTML Output

HtmlDiff produces HTML output with the same information as Diff.

In [None]:
import difflib

d = difflib.HtmlDiff()
print(d.make_table(text1_lines, text2_lines))

## Junk Data

All of the functions that produce difference sequences accept arguments to indicate which lines should be ignored, and which characters within a line should be ignored. These parameters can be used to skip over markup or whitespace changes in two versions of a file, for example.

In [None]:
# This example is taken from the source for difflib.py.

from difflib import SequenceMatcher

A = " abcd"
B = "abcd abcd"

print('A = %r' % A)
print('B = %r' % B)

print('\nWithout junk detection:')

s = SequenceMatcher(None, A, B)
i, j, k = s.find_longest_match(0, 5, 0, 9)
print('  i = %d' % i)
print('  j = %d' % j)
print('  k = %d' % k)
print('  A[i:i+k] = %r' % A[i:i+k])
print('  B[j:j+k] = %r' % B[j:j+k])

print('\nTreat spaces as junk:')

s = SequenceMatcher(lambda x: x==" ", A, B)
i, j, k = s.find_longest_match(0, 5, 0, 9)
print('  i = %d' % i)
print('  j = %d' % j)
print('  k = %d' % k)
print('  A[i:i+k] = %r' % A[i:i+k])
print('  B[j:j+k] = %r' % B[j:j+k])

## Comparing Arbitrary Types

The SequenceMatcher class compares two sequences of any types, as long as the values are hashable. It uses an algorithm to identify the longest contiguous matching blocks from the sequences, eliminating “junk” values that do not contribute to the real data.

This example compares two lists of integers and uses get_opcodes() to derive the instructions for converting the original list into the newer version. The modifications are applied in reverse order so that the list indexes remain accurate after items are added and removed.

In [None]:
import difflib
# from difflib_data import *

s1 = [ 1, 2, 3, 5, 6, 4 ]
s2 = [ 2, 3, 5, 4, 6, 1 ]

print('Initial data:')
print('s1 =', s1)
print('s2 =', s2)
print('s1 == s2:', s1==s2)
print()

matcher = difflib.SequenceMatcher(None, s1, s2)
for tag, i1, i2, j1, j2 in reversed(matcher.get_opcodes()):

    if tag == 'delete':
        print('Remove %s from positions [%d:%d]' % (s1[i1:i2], i1, i2))
        del s1[i1:i2]

    elif tag == 'equal':
        print('The sections [%d:%d] of s1 and [%d:%d] of s2 are the same' % \
            (i1, i2, j1, j2))

    elif tag == 'insert':
        print('Insert %s from [%d:%d] of s2 into s1 at %d' % \
            (s2[j1:j2], j1, j2, i1))
        s1[i1:i2] = s2[j1:j2]

    elif tag == 'replace':
        print('Replace %s from [%d:%d] of s1 with %s from [%d:%d] of s2' % (
            s1[i1:i2], i1, i2, s2[j1:j2], j1, j2))
        s1[i1:i2] = s2[j1:j2]

    print('s1 =', s1)
    print('s2 =', s2)
    print()

print('s1 == s2:', s1==s2)

In [10]:
# testing to understand 
import difflib

seqa = "ATTATAT"
seqb = "CGCGTAC"
seqc = "ATTGCGC"
seqd = "GCATTAT"

s = difflib.SequenceMatcher(None, seqa, seqc)

print("{}".format(s.find_longest_match(0, len(seqa), 0, len(seqc))))

Match(a=0, b=0, size=2)


In [2]:
help(difflib.SequenceMatcher)

Help on class SequenceMatcher in module difflib:

class SequenceMatcher(builtins.object)
 |  SequenceMatcher is a flexible class for comparing pairs of sequences of
 |  any type, so long as the sequence elements are hashable.  The basic
 |  algorithm predates, and is a little fancier than, an algorithm
 |  published in the late 1980's by Ratcliff and Obershelp under the
 |  hyperbolic name "gestalt pattern matching".  The basic idea is to find
 |  the longest contiguous matching subsequence that contains no "junk"
 |  elements (R-O doesn't address junk).  The same idea is then applied
 |  recursively to the pieces of the sequences to the left and to the right
 |  of the matching subsequence.  This does not yield minimal edit
 |  sequences, but does tend to yield matches that "look right" to people.
 |  
 |  SequenceMatcher tries to compute a "human-friendly diff" between two
 |  sequences.  Unlike e.g. UNIX(tm) diff, the fundamental notion is the
 |  longest *contiguous* & junk-free ma