http://codekata.com/kata/kata08-conflicting-objectives/

For this kata, we’re going to write a program to solve a simple problem, and we’re going to write it with three different sub-objectives. Our program is going do process the dictionary we used in previous kata, this time looking for all six letter words which are composed of two concatenated smaller words. For example:


~~~
  al + bums => albums
  bar + ely => barely
  be + foul => befoul
  con + vex => convex
  here + by => hereby
  jig + saw => jigsaw
  tail + or => tailor
  we + aver => weaver
~~~

Write the program three times.

    The first time, make program as readable as you can make it.
    The second time, optimize the program to run fast fast as you can make it.
    The third time, write as extendible a program as you can.

Now look back at the three programs and think about how each of the three subobjectives interacts with the others. For example, does making the program as fast as possible make it more or less readable? Does it make easier to extend? Does making the program readable make it slower or faster, flexible or rigid? And does making it extendible make it more or less readable, slower or faster? Are any of these correlations stronger than others? What does this mean in terms of optimizations you may perform on the code you write?

In [None]:
!! head -13 ../data/wordlist.txt

In [None]:
!! egrep '^bar.?.?.?$' ../data/wordlist.txt | sort

<h1>Solution 1</h1>
Redability

In [None]:
import codecs

class JoinedWords():
    """
    Reads a dictionary file and finds the six letter words made of two shorter words.
    e.g. 
        jig + saw => jigsaw
    """
    def __init__(self, dictionary_file, encoding='utf-8'):
        self.shortWords = set()   # a set of words
        self.longWordsDict = {}  # a dict of words list of the pairs of shortWords that make up the dict key word. 
        
        with codecs.open(dictionary_file, 'r', encoding) as f:
            for word in f.read().split():
                if len(word) < 6:
                    self.shortWords.add(word)
                elif len(word) == 6:
                    self.longWordsDict.setdefault(word,list())
                    
        for keyWord in self.longWordsDict.keys():
            for i in range(1,6,1):
                leftWord = keyWord[:i] # leftmost i chars
                rightWord = keyWord[-6+i:]
                if leftWord in self.shortWords and rightWord in self.shortWords:
                    self.longWordsDict[keyWord].append([leftWord, rightWord])
                    
    def getJoinedWords(self, longWord):
        """
        Returns the list of shortWords pairs that combine to make longWord
        """
        return self.longWordsDict.get(longWord, [])

                

<h1>Unit Tests</h1>

In [None]:
from unittest import *

class JoinedWordsTests(TestCase):
    
    @classmethod
    def setUpClass(self):
        self.jw = JoinedWords('../data/wordlist.txt', 'iso-8859-1')
        
    def setUp(self):
        pass
        
        
    def test_joinedWords_bulk1(self):
        # Check expected results
        self.testWords = {
              'albums' :[['al', 'bums'], ['alb', 'ums'], ['album', 's']], # Example incomplete
              'barely' :[['ba','rely'],],                                 # Example wrong
              'befoul' :[['be','foul'],],
              'convex' :[['con','vex'],],
              'hereby' :[['here','by'],],
              'jigsaw' :[['jig', 'saw'], ['jigs', 'aw']],                 # Example incomplete
              'tailor' :[['tai', 'lor'], ['tail', 'or']],                 # Example incomplete
              'weaver' :[['we', 'aver'], ['weave', 'r']],                 # Example incomplete
        }
        for tk in self.testWords.keys():
            joinedWords = self.jw.getJoinedWords(tk)
            self.assertEqual(self.testWords[tk], joinedWords)
            

jwt = JoinedWordsTests()

suite = TestLoader().loadTestsFromModule(jwt)
TextTestRunner().run(suite)

<h1>Solution 2</h1>
Performance

baseline
~~~
%time [JoinedWords('../data/wordlist.txt', 'iso-8859-1', 10) for x in range(33)]

CPU times: user 40.7 s, sys: 16.9 s, total: 57.6 s
Wall time: 1min 3s
~~~


In [10]:
import codecs

# Best so far
class JoinedWords():
    def __init__(self, dictionary_file, encoding='utf-8', wordSize=6):
        self.shortWords = set()  # a set of words
        self.longWordsDict = {}  # a dict of words list of the pairs of shortWords that make up the dict key word. 
        
        with codecs.open(dictionary_file, 'r', encoding) as f:
            for word in f.read().split():
                if len(word) < wordSize:
                    self.shortWords.add(word)
                elif len(word) == wordSize:
                    self.longWordsDict.setdefault(word,list())
                    
        neg = -1*wordSize
        for keyWord in self.longWordsDict:            
            for i in range(1,wordSize,1):
                # Inline leftWord and rightWord interim assignments
                if keyWord[neg+i:] in self.shortWords \
                  and keyWord[:i] in self.shortWords: # rightWord will be longer/less probable so short-circuit on that
                    self.longWordsDict[keyWord].append([keyWord[:i], keyWord[neg+i:]])
        
        # Hmmm - list comprehensions haven't helped performance ( or legibility )
        #for keyWord in self.longWordsDict: 
        #    self.longWordsDict[keyWord] = [ [keyWord[:i], keyWord[neg+i:]] \
        #                                    for i in range(1,wordSize,1)   \
        #                                    if keyWord[neg+i:] in self.shortWords and keyWord[:i] in self.shortWords]
                    
    def getJoinedWords(self, longWord):
        """
        Returns the list of shortWords pairs that combine to make longWord
        """
        return self.longWordsDict.get(longWord, [])
    

In [None]:
# Worse :(
# CPU times: user 1min 10s, sys: 20.3 s, total: 1min 30s
# Wall time: 1min 36s
class JoinedWords():
    """
    Split dicts.
    Does making many small sets improve search time?
    Empirically - No.
    """
    def __init__(self, dictionary_file, encoding='utf-8', wordSize=6):
        self.shortWords = {}  # a dict of sets of words. dict keysed by 1st letter
        self.longWordsDict = {}  # a dict of words list of the pairs of shortWords that make up the dict key word. 
        
        with codecs.open(dictionary_file, 'r', encoding) as f:
            for word in f.read().split():
                if len(word) < wordSize:
                    self.shortWords.setdefault(word[0].upper(),set())
                    self.shortWords[word[0].upper()].add(word)
                elif len(word) == wordSize:
                    self.longWordsDict.setdefault(word,list())
                    
        neg = -1*wordSize
        for keyWord in self.longWordsDict:            
            for i in range(1,wordSize,1):
                # Inline leftWord and rightWord interim assignments
                if keyWord[neg+i:] in self.shortWords.get(keyWord[neg+i:][0].upper(),set()) \
                  and keyWord[:i] in self.shortWords.get(keyWord[:i][0].upper(),set()): # rightWord will be longer/less probable so short-circuit on that
                    self.longWordsDict[keyWord].append([keyWord[:i], keyWord[neg+i:]])
                            
    def getJoinedWords(self, longWord):
        """
        Returns the list of shortWords pairs that combine to make longWord
        """
        return self.longWordsDict.get(longWord, [])
    

In [None]:
%time [JoinedWords('../data/wordlist.txt', 'iso-8859-1', 10) for x in range(33)]

In [2]:
%reload_ext line_profiler

In [5]:
%lprun -f jw.__init__ jw.__init__('../data/wordlist.txt', 'iso-8859-1', 10)

<h3>&#37;lprun output</h3>
~~~
Timer unit: 1e-06 s

Total time: 3.33134 s
File: <ipython-input-1-eccc2ac6af9d>
Function: __init__ at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                               def __init__(self, dictionary_file, encoding='utf-8', wordSize=6):
     5         1        40836  40836.0      1.2          self.shortWords = set()  # a set of words
     6         1        49784  49784.0      1.5          self.longWordsDict = {}  # a dict of words list of the pairs of shortWords that make up the dict key word. 
     7                                                   
     8         1          143    143.0      0.0          with codecs.open(dictionary_file, 'r', encoding) as f:
     9    338883       474724      1.4     14.3              for word in f.read().split():
    10    338882       438456      1.3     13.2                  if len(word) < wordSize:
    11    195220       337116      1.7     10.1                      self.shortWords.add(word)
    12    143662       190462      1.3      5.7                  elif len(word) == wordSize:
    13     43229       206047      4.8      6.2                      self.longWordsDict.setdefault(word,list())
    14                                                               
    15         1            6      6.0      0.0          neg = -1*wordSize
    16     43230        54559      1.3      1.6          for keyWord in self.longWordsDict:            
    17    432290       558961      1.3     16.8              for i in range(1,wordSize,1):
    18                                                           # Inline leftWord and rightWord interim assignments
    19    389061       858321      2.2     25.8                  if keyWord[neg+i:] in self.shortWords                   and keyWord[:i] in self.shortWords: # rightWord will be longer/less probable so short-circuit on that
    20     36838       121929      3.3      3.7                      self.longWordsDict[keyWord].append([keyWord[:i], keyWord[neg+i:]])
    ~~~

<h1>Solution 3</h1>
Extensability

In [8]:
import codecs

class JoinedWords():
    """
    Reads a dictionary file and finds the n letter words made of two shorter words.
    e.g. 
        jig + saw => jigsaw
    """
    def __init__(self, dictionary_file, encoding='utf-8', wordSize=6):
        self.shortWords = set()   # a set of words
        self.longWordsDict = {}  # a dict of words list of the pairs of shortWords that make up the dict key word. 
        
        with codecs.open(dictionary_file, 'r', encoding) as f:
            for word in f.read().split():
                if len(word) < wordSize:
                    self.shortWords.add(word)
                elif len(word) == wordSize:
                    self.longWordsDict.setdefault(word,list())
                    
        for keyWord in self.longWordsDict.keys():
            for i in range(1,wordSize,1):
                leftWord = keyWord[:i] # leftmost i chars
                rightWord = keyWord[-1*wordSize+i:]
                if leftWord in self.shortWords and rightWord in self.shortWords:
                    self.longWordsDict[keyWord].append([leftWord, rightWord])
                    
    def getJoinedWords(self, longWord):
        """
        Returns the list of shortWords pairs that combine to make longWord
        """
        return self.longWordsDict.get(longWord, [])

                

In [9]:
from unittest import *

class JoinedWords3Tests(TestCase):
    
    @classmethod
    def setUpClass(self):
        self.jw = JoinedWords('../data/wordlist.txt', 'iso-8859-1', wordSize=10)
        
    def setUp(self):
        pass
        
        
    def test_joinedWords_bulk1(self):
        # Check expected results
        self.testWords = {
              'demonesses': [['demo', 'nesses'], ['demon', 'esses'], ['demoness', 'es']],
              'deliberate': [['de', 'liberate'], ['deli', 'berate']],
              'threadless': [['thread', 'less']],
              "longshot's": [['long', "shot's"]],
              'threadless': [['thread', 'less']],
        }
        for tk in self.testWords.keys():
            joinedWords = self.jw.getJoinedWords(tk)
            self.assertEqual(self.testWords[tk], joinedWords)
            

jwt3 = JoinedWords3Tests()

suite = TestLoader().loadTestsFromModule(jwt3)
TextTestRunner().run(suite)

.
----------------------------------------------------------------------
Ran 1 test in 1.194s

OK


<unittest.runner.TextTestResult run=1 errors=0 failures=0>

In [4]:
jw = JoinedWords('../data/wordlist.txt', 'iso-8859-1', 10)

In [None]:
jw.longWordsDict

In [None]:
jw.shortWords