## Processing Strings

Recall that the python **string** method **split** breaks up a string into words (separated by *white space*)

In [1]:
"abcd efgh".split()

['abcd', 'efgh']

In [2]:
"ijkl,   mnop".split()

['ijkl,', 'mnop']

Let's try to write such a function ourselves now.

In [3]:
''' Space holder for Code '''

def ws(c):
    if c == ' ' or c == '\t' or c=='\n':
        return True
    return False

def mysplit(s):
    wordlist = []
    k = 0
    currword = ''
    for i in s:
        if ws(i):
            if currword != '':
                wordlist.append(currword)
                currword = ''
            continue
        else:
            currword = currword + i
    return(wordlist)


In [4]:
mysplit("abcd      efgh ")

['abcd', 'efgh']

Sir's code was confusing so I created my own code

In [None]:
def ownsplit(s):
    finallist=[]
    temp=""
    for each in s:
        if each!=" " and each!= "\t" and each!= "\n":
            temp+=each
        else:
            finallist.append(temp)
            temp=""
    finallist.append(temp)
    newfinal=[]
    for each2 in finallist:
        if(len(each2)!=0):
            newfinal.append(each2)
    return newfinal

In [None]:
y=ownsplit("abcd      efgh")
y

The right approach to solve this problem is to think of what is called a *state machine* and implement that.
Think of the program as a machine which is in one of two states: **inside** intended to mean that the scanner is currently in the middle of reading a word and **Outside** intended to mean it is currently
not in the middle of reading a word.

The machine, reads a letter and based on it current state and this letter, updates it state correctly (correct w.r.t. the intended meaning) and also computes some outputs once in a while. In our example: there are 4 basic cases:

  1) In state **Inside** and
  
      (i) the next letter is a whitespace.
      Action: Change state to **Outside** and output the current word (so you have to track that).
      
      (ii) the next letter is a non-whitespace.
      Action: Stay in the same state.
      
  2) In state **Outside** and 
  
     (i) the next letter is a whitespace.
     Action: Stay in the same state
     
     (ii) the next letter is a non-whitespace.
     Action: Change state to **Inside**. Also start recording the current word.
    


In [5]:
st = " This is a long and boring sentence. \nThis is another, and   so on.  k"
Outside = 0
Inside = 1

def ws(c):
    if c in [' ','\t','\n']:
        return True
    return False
  
def words(s):
    State = 0 # 0 means outside, 1 means inside 
    ans = []
    curr = ""
    for i in s:
        if State == Inside:
            if ws(i):
                State = Outside
                ans.append(curr)
            else:
                curr = curr+i
        elif State == Outside:
            if not ws(i):
                State = Inside
                curr = i 
    return(ans)

print(words(st))


['This', 'is', 'a', 'long', 'and', 'boring', 'sentence.', 'This', 'is', 'another,', 'and', 'so', 'on.']


Our state diagram should also output the current word, if any, when the input ends (i.e. the last word)

In [6]:
st = " This is a long and boring sentence. \nThis is another, and   so on."
Outside = 0
Inside = 1

def ws(c):
    if c in [' ','\t','\n']:
        return True
    return False

def words(s):
    State = 0 # 0 means outside, 1 means inside 
    ans = []
    curr = ""
    for i in s:
        if State == Inside:
            if ws(i):
                State = Outside
                ans.append(curr)
            else:
                curr = curr+i
        elif State == Outside:
            if not ws(i):
                State = Inside
                curr = i
    if State == Inside:
        ans.append(curr)
    return(ans)

print(words(st))
print(words("This is a test jk"))


['This', 'is', 'a', 'long', 'and', 'boring', 'sentence.', 'This', 'is', 'another,', 'and', 'so', 'on.']
['This', 'is', 'a', 'test', 'jk']


Let's make it a bit more interesting by searching for all the **integers** in a string.

In [None]:
'''
This is a long sentence, of more than 20 letters and 8 words. I may have
real numbers such as 3.14 in this, but it should not print them out. But
integers can be the last word in a sentence such as 30. The word a39b contains
the integer 39..45 and .382. That was just to confuse the issue. Does it work with .45?
'''


Here is a possible state machine for this:

There are 3 states:

 <li>  0 --- Outside and unwilling to start a new integer
  
 <li>  1 --- Outside and willing to start a new integer
  
 <li>  2 --- currently inside a potential integer.


The inputs are classified as 

  <li> digits 

  <li> '.' 
  
  <li> all other characters

And the transitions of this machine are:
    
  <li> State 0: (Outside and Unwilling)

       '.' or digit  ---> stay in State 0
       anything else ---> change to State 1

  <li> State 1: (Outside and Willing)
    
       '.'           ---> change to State 0
       digit         ---> change to State 2 (also record this digit)
       anything else ---> stay in State 1

  <li> State 2: (Inside)
    
       digit         ---> stay in State 2 (and record this digit)
       '.'           ---> Peep at the next character (look at it but don't advance the reading point).
                          If that character is:
                             digit     --->  Discard current integer and go to state 0 (its real number)
                             otherise  --->  Output this integer and move to State 1.
       anything else ---> Output this integer and move to State 1
       
       

Thus, handling this more involved example uses another idea, peeping ahead at the input. The technical term for this is **look-ahead**

For the record, here is a python program implementing this state machine. Clearly translating state machines to python programs is routine.

In [None]:
def findIntegers(s):
        '''
        0  ---> outside and not ready to enter
        1  ---> outside but ready to enter
        2  ---> inside
        '''
        state = 1
        ans = []
        i = 0
        nextint = []
        e = len(s)
        while (i < e):
                if state == 0 :
                        if  (not s[i].isdigit()) and (not (s[i] == ".")):
                                state = 1
                elif (state == 1):
                        if s[i].isdigit():
                                state = 2
                                nextint = [s[i]]
                        elif s[i] == ".":
                                state = 0
                elif (state == 2):
                        if s[i].isdigit():
                                nextint.append(s[i])
                        elif (s[i] == ".") and (i < e-1) and (s[i+1].isdigit()):
                                state = 0
                        else:
                                ans.append("".join(nextint))
                                state = 1
                i = i+1
        return(ans)

ls = '''
This is a long sentence, of more than 20 letters and 8 words. I may have
real numbers such as 3.14 in this, but it should not print them out. But
integers can be the last word in a sentence such as 30. The word a309 contains
the integer 39..45 and .38. 2. That was just to confuse the issue. Does it work with .45?
'''

print(findIntegers(ls))

In summary pattern matching and searching in words is best solved via appropriate state machines. These machines sometimes use **look-aheads** to make their choices. Translating these state machines to programs is easy (can be done automatically). 

What about developing the state machines themselves? It turns out that for a very generous language of patterns, the state machine can be generated automatically. This comes from the study of "finite state automata" (essentially state machines) and "regular expressions" and is beyond the scope of this course. But we shall now see how to use the Python library that does this translation.

### State machine based programming

We saw that pattern matching in strings is one possible application of state based programming. But there are others.
For instance, consider the program that controls a set of lifts. The states record information  regarding the  location of the lift, the status of the doors, the buttons that have been pressed in each floor etc. The transitions are to do with pressing of buttons, the reaching of a floor .. 



Another  example is the windowing system on all machines.  The state records the location of the mouse, the set of windows, the menu that is currently open, whatever. The events correspond to clicking of a mouse or the exit or the entry of the mouse to a window or its header or ...


These are examples of **event based programming** which is just another name for state machines.