For a number written in Roman numerals to be considered valid there are basic rules which must be followed. Even though the rules allow some numbers to be expressed in more than one way there is always a "best" way of writing a particular number.
For example, it would appear that there are at least six ways of writing the number sixteen:
<p class="margin_left monospace">IIIIIIIIIIIIIIII<br>
VIIIIIIIIIII<br>
VVIIIIII<br>
XIIIIII<br>
VVVI<br>
XVI</p>
However, according to the rules only <span class="monospace">XIIIIII</span> and <span class="monospace">XVI</span> are valid, and the last example is considered to be the most efficient, as it uses the least number of numerals.
The 11K text file, <a href="PE089_roman.txt">roman.txt</a> (right click and 'Save Link/Target As...'), contains one thousand numbers written in valid, but not necessarily minimal, Roman numerals; see <a href="about=roman_numerals">About... Roman Numerals</a> for the definitive rules for this problem.
Find the number of characters saved by writing each of these in their minimal form.
<p class="smaller">Note: You can assume that all the Roman numerals in the file contain no more than four consecutive identical units.</p>

In [60]:
import re

In [61]:
with open('PE089_roman.txt', 'r') as my_file:
    roman = my_file.read().split()

In [62]:
pattern = r'M+'

a = re.match(pattern=pattern, string=roman[0])

In [63]:
# Patterns
dcccc = re.compile(r'DC{4}') # To be replaced with CM
cccc = re.compile(r'C{4}')   # To be replaced with CD
lxxxx = re.compile(r'LX{4}') # To be replaced with XC
xxxx = re.compile(r'X{4}')   # To be replaced with XL
viiii = re.compile(r'VI{4}') # To be replaced with IX
iiii = re.compile(r'I{4}')   # To be replaced with IV
final = []
charac = 0
for st in roman:
    final.append(re.sub(dcccc, r'CM', st))
    final[-1] = re.sub(cccc, r'CD', final[-1])
    final[-1] = re.sub(lxxxx, r'XC', final[-1])
    final[-1] = re.sub(xxxx, r'XL', final[-1])
    final[-1] = re.sub(viiii, r'IX', final[-1])
    final[-1] = re.sub(iiii, r'IV', final[-1])
    charac += (len(st) - len(final[-1]))
print(charac)

743


In [64]:
for orig, changed in zip(roman, final):
    if orig != changed:
        print(f'The original {orig} was corrected to {changed}')

The original MMMDLXVIIII was corrected to MMMDLXIX
The original MMCCCLXXXXIX was corrected to MMCCCXCIX
The original MDCCCXXIIII was corrected to MDCCCXXIV
The original MMMMDCCCCI was corrected to MMMMCMI
The original MCCLXXVIIII was corrected to MCCLXXIX
The original MMMMCCXXXXI was corrected to MMMMCCXLI
The original MMMDCCCCXXXIV was corrected to MMMCMXXXIV
The original CDXVIIII was corrected to CDXIX
The original MMMMCCCLXXXXVI was corrected to MMMMCCCXCVI
The original MMMDCCCVIIII was corrected to MMMDCCCIX
The original DCCLXXXIIII was corrected to DCCLXXXIV
The original MDCCCCXXXII was corrected to MCMXXXII
The original MMMMCMLXXXXVIII was corrected to MMMMCMXCVIII
The original MMDCCCLXXXIIII was corrected to MMDCCCLXXXIV
The original MMCCCCXXXXV was corrected to MMCDXLV
The original MMMMDLXXXVIIII was corrected to MMMMDLXXXIX
The original MMDCCCCLXXVI was corrected to MMCMLXXVI
The original MCCCCLXX was corrected to MCDLXX
The original MMCDLVIIII was corrected to MMCDLIX
The ori