# Problem Set 5: Text Splitting
Kim Merchant

We have a string of text from which all the spacing has been lost, like `asknotwhatyourcountrycandoforyou`.

We want to split the text back into a sequence of valid words, like:

`['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']`.

In [1]:
# This code gets us a set of the 1000 most common English words.
# We'll consider a word to be valid if it's in this set.
from requests import get
url = "https://myslu.stlawu.edu/~ltorrey/algorithms/common_words.txt"
english = set(get(url).text.split())

### 1) Greedy attempt

Fill in the function below to approach this problem in a greedy way: take the first valid word you can find at the end of the text, and move on from there. If you can't find any valid words at the end of the text, just return an empty list.

In [2]:
def greedy_split(text):
    ourList = []
    i = len(text)
    while i > 0: # i is the counter of how much of the text is left to assess
        j = 1 # j represents the number of letters we're assessing
        while j < i:
            if text[len(text)-j:i] in english: # if j to i forms a word
                ourList.append(text[len(text)-j:i]) # add the word to the list
                i = len(text)-j # update i so no words have used letters
                break
            j += 1 # update j if the loop continues
        if j == i: # if the end is reached with no successful word
            i = 0
    ourList.reverse()
    return ourList

In [3]:
# Testing
print(greedy_split("asknotwhatyourcountrycandoforyou"))

['and', 'of', 'or', 'you']


### 2) Combinatorial algorithm

Fill in the function below to approach this problem combinatorially: consider *every* valid word you can find at the end of the text, and try moving on from there. Go with the option that gets you the largest number of valid words.

In [4]:
def combinatorial_split(text):
    # store what will ultimately be the solution
    best = []
    
    # base case, no more characters to assess
    if(len(text) == 0):
        return []
    
    else:
        # try each group of 1 or more letters from the end to see if there's a word
        for i in range(1, len(text)+1): 
            temp = text[len(text)-i:]
            
            if temp in english:
                # if there is a word, see how many can be made after it
                tempList = combinatorial_split(text[:len(text)-i]) + [temp]
                
                # if this one word leads to a longer total list, replace the best list
                if len(tempList) > len(best):
                    best = tempList
    
    # after trying each of the possible word combinations, return the longest resultant list
    return best

In [5]:
# Testing
print(combinatorial_split("asknotwhatyourcountrycandoforyou"))
print(combinatorial_split("usetheaskofwhatyourcando"))

['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']
['set', 'he', 'ask', 'of', 'what', 'your', 'can', 'do']


### 3) Dynamic programming

Fill in the function below to approach this problem with dynamic programming.

Let `sub[i]` be the best word list for `text[:i]`.

In [6]:
def dynamic_split(text):
    sub = dict([])
    sub[0] = []
    
    # Subproblems: sub[i] will be the best combination of words for text[:i]
    for i in range(1, len(text)+1):
        cur = text[:i]
        best = []

        # try each group of 1 or more letters from the end to see if there's a word
        for n in range(1, len(cur)+1): 
            temp = cur[len(cur)-n:]

            if temp in english:
                # if there is a word, see how many can be made after it
                tempList = sub[i-n] + [temp]

                # if this one word leads to a longer total list, replace the best list
                if len(tempList) > len(best):
                    best = tempList

        # after trying each of the possible word combinations, return the longest resultant list
        sub[i] = best
    
    # The overall solution is the best subproblem solution
    return max(sub.values(), key=len)

In [7]:
# Testing
print(dynamic_split("asknotwhatyourcountrycandoforyou"))

['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']


### 4) Final algorithm

Now make that last improvement to your dynamic programming algorithm.

Instead of storing full subproblem solutions, store just a few pieces of essential information.

In [8]:
def best_split(text):
    sub = dict()
    sub[0] = (0, None, None)
    
    # Subproblems: sub[i] will be the length of the list so far, the current word, 
    # and the index to go back to for the previous word
    for i in range(1, len(text)+1):
        cur = text[:i]
        best = (0, None, None)

        # try each group of 1 or more letters from the end to see if there's a word
        for n in range(1, len(cur)+1):
            temp = cur[len(cur)-n:]

            if temp in english:
                # if there is a word, see how many can be made after it
                tempList = (1+sub[i-n][0], temp, i-n)

                # if this one word leads to a longer total list, replace the best list tuple
                if tempList[0] > best[0]:
                    best = tempList

        # after trying each of the possible word combinations, set the dictionary value at the key i to the most promising word
        sub[i] = best
        
    # Find the index of the last word that has the most words preceding it in the original string to split
    idx = max(sub, key=lambda idx:sub[idx][0])

    # Now we can reconstruct the overall solution by working backwards from there.
    lis = []
    while idx is not None:
        # find the next word
        lis.append(sub[idx][1])
        
        # set the idx to the next value in the chain
        idx = sub[idx][2]
            
    lis.pop()
    lis.reverse()
    return lis

In [9]:
# Testing
print(best_split("asknotwhatyourcountrycandoforyou"))

['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']
