## Code challenge - num2word

I am using Jupyter Notebook to solve this task, mainly for 2 reasons.

- I like how one can run segments of the code in separate cells. This helps people new to python in that it is easy to play around with the code.
- I also like how one can comment on the flow of the script, explain assumptions, etc..

### Read input from txt file
- User can select input file in pop-up dialog box, for ease of everyday use of the app. Alternatively, type filename in the code below if the user dialog fails.
- The code puts every line of input in a list separately. This starting format will help later for generating auditable outputs.

In [1]:
# prompt user to select input file from pop-up window
from tkinter import Tk, filedialog
root = Tk()
root.withdraw()
root.attributes('-topmost',True)
filename = filedialog.askopenfilename(parent=root)

# edit input filename manually if user dialog fails
if filename == '':
    filename = 'Test input.txt' # <<-- add filename of inputs stored in same folder with this notebook or add full path for files stored elsewhere

# read input file line by line, and put every line in a list as a separate item
with open(filename, 'r') as f:
    lines = [line.rstrip() for line in f]

### Functions
I will build a few functions first to take care of the core tasks: process lines of input text and convert numbers to words. Then I will use the functions via a loop to generate final outputs.

#### Function #1: number to word converter
- Input: integer 
- Output: number spelled out in words 
- Based on expected outputs provided in the exercise, I built the function in such a way that it will return words according to British grammar, hence the few 'if' checks in the second function below.
- Function design: Numbers in word format are essentially combinations of a finite set of building blocks. There is one overarching pattern: you always have hundreds, twenties and ones, which are combined with thousands, millions, billions, etc: one new 'type' for every additional 3 digits. But the hundreds, twenties and ones always come back and we always write them the same way. Therefore one idea to construct the function is to build one subfunction num000 that spells out the hundreds, twenties and ones correctly and another subfunction num2word that loops over num000 to bring in the additional layer of thousands, millions, billions, etc.. depending on the original input.

In [2]:
# building blocks of words in lists
ones = ['','one','two','three','four','five','six','seven','eight','nine','ten','eleven','twelve','thirteen','fourteen','fifteen','sixteen','seventeen','eighteen','nineteen']  
twenties = ['','','twenty','thirty','forty','fifty','sixty','seventy','eighty','ninety'] 
thousands = ['','thousand','million','billion','trillion','quadrillion','quintillion','sextillion']  # every word in this list adds '000. Can add more if need to deal with very large numbers

# num000 takes a number, zooms on the last 3 digits, and spells those out 

def num000(n): 
    c = n % 10 # single digit 
    b = ((n % 100) - c) // 10 # tens digit 
    a = ((n % 1000) - (b * 10) - c) // 100 # hundreds digit 
    t = "" 
    h = "" 
    if a != 0 and b == 0 and c == 0: 
        t = ones[a] + " hundred" 
    elif a != 0: 
        t = ones[a] + " hundred and " 
    if b <= 1: 
        h = ones[n%100] 
    elif b > 1 and c == 0: 
        h = twenties[b] + ones[c]
    elif b > 1 and c != 0:
        h = twenties[b] + '-' + ones[c]
    st = t + h 
    return st 


# num2word looks at the number and creates a loop for every additional three digits 
# i.e. input > 999 num2word will loop 2 times, input > 999,999 num2 word will loop 3 times
# every time referring back to num000 for the correct spelling of the 3 new digits and adding another layer from the 'thousands' list
# this process splits up the inputs into 3-digit groups and every step in the loop (k+1) takes care of one group of xxx (3 digits)
# note that n and nxxx recalculate at every step in below, and when n becomes '', there is no more digits, so the loop is finished (i+1)

def num2word(num): 
    if num == 0: return 'zero' 
    i = 3 
    n = str(num) 
    word = "" 
    k = 0 
    while(i == 3): 
        nxxx = n[-i:] 
        n = n[:-i] 
        if int(nxxx) == 0: 
            word = word 
        else:  # I create a few different if checks to handle the ',' and 'and' according to British grammar 
            if k == 0 and int(nxxx) < 100 and n != '': 
                word = ' and ' + num000(int(nxxx))
            elif k == 0 and int(nxxx) >= 100 and n != '':
                word = ', ' + num000(int(nxxx))
            elif k==0:
                word = num000(int(nxxx))
            elif k == 1: 
                word = num000(int(nxxx)) + ' ' + thousands[k] + word
            else:
                if word == '' or word[:1] == ',' or word[:5] == ' and ' :
                    word = num000(int(nxxx)) + ' ' + thousands[k] + word
                else:
                    word = num000(int(nxxx)) + ' ' + thousands[k] + ', ' + word
        if n == '': 
            i = i+1 
        k += 1  # every k is '000 jump, i.e. every k is one step in the loop
    return word

#### Function #2: text splitter
- Input: 1 line of text
- Output: list of words & numbers that can be easily processed further ('tokens')
- The function splits by whitespace and removes sentence ending signs to make sure we can capture proper numbers that are finishing a sentence in a text.

In [3]:
def splitter(line):
    splitted = []
    for i in range(len(line.split())):
        if line.split()[i][-1:] in ['.',',',';','?','!']:
            splitted.append(line.split()[i][:-1]) 
        else:
            splitted.append(line.split()[i])
    return splitted

#### Function #3: result generator
- Input: list of words & numbers (output from splitter function)
- Output: dictionary of identified numbers in the list (both valid and invalid), AND corresponding outputs per below  
- The function checks the words ('tokens') one by one and does the following.
 - Ignores pure text/string.
 - Identifies proper whole numbers, and converts them to words with num2word function.
 - Identifies invalid numbers, returns 'number invalid'. I defined these as a mix of numbers and text or other special characters. I also put the numbers that start with 0 into this category, because my working assumption was that those typically represent serial number type numbers as opposed to quantities. This could be an area for development of course.
- For input lines that include more than one number or invalid number, I want to capture all of them and generate outputs for all of them for completeness.
- ADDITION: I improved the original task by capturing numbers that start with +/- sign and converting these to words too with a 'positive/negative' starting word. In markets, people sometimes write things like '+50 OAS', so I consider this an easy win in terms of improvements.

Note: For inputs such as 'I received 23 456,9 KGs.', I assumed it is fine to see two outputs, one for '23' and one for '456,9'. This approach keeps the code simpler and it works for many other similar inputs too where you have more than one number next to each other or a number and a serial number of an item for example. This approach could be developed further by examining whether such pairs of numbers are meant to be one number (with whitespace between thousands) or two separate numbers, depending on context. That is an area for development. 

In [4]:
def result(splitted):
    results = {}
    for i in range(len(splitted)):
        if splitted[i][:1] == '0':
            results[splitted[i]] = 'invalid number'
        elif splitted[i].isdigit():
            results[splitted[i]] = num2word(splitted[i])
        elif splitted[i][:1] == '+' and splitted[i][1:].isdigit():
            results[splitted[i]] = 'positive ' + num2word(splitted[i][1:])
        elif splitted[i][:1] == '-' and splitted[i][1:].isdigit():
            results[splitted[i]] = 'negative ' + num2word(splitted[i][1:])    
        elif any(x.isnumeric() for x in splitted[i]):
            results[splitted[i]] = 'invalid number' 
    return results

### Loop to build a dictionary of inputs & final outputs
- Goal # 1: be able to provide simple outputs as it was requested. i.e. simply writing down the numbers in words.
- Goal # 2: put original inputs and corresponding outputs in one data container in an intuitive way, so inputs and outputs are easily auditable even for bigger inputs.
- Format that will help reach above goals: dictionary

In [5]:
# I put every line into splitter function and maintain the input & output pairs via building a dictionary of them
splitted_dict = {}
for i in range(len(lines)):
    splitted_dict[lines[i]] = splitter(lines[i])

# I put every list of splitted input lines into result generator function and maintain the original input lines paired with all outputs via dictionary
results_dict = {}
for i in splitted_dict.keys():
    results_dict[i] = result(splitted_dict[i])

### Output
- Python's print function is stdout, as requested.
- Two sections: 1) simple output; 2) line-by-line input & output pairs to enhance testing, auditing and further development.
- Some whitespace in between to help the eye. 
- Output can be formatted in a lot of different ways depending on what this is used for exactly. Here I assumed we are primarily interested in checking if the main functionalities and text processing all work properly.

In [6]:
print('OUTPUT:')
for i in results_dict:
    for j in results_dict[i]:
        print(results_dict[i][j])
print('')
print('')
print('INPUTS & OUTPUTS FOR AUDITING:')

for i in results_dict:
    print(i)
    print(results_dict[i])
    print('')

OUTPUT:
five hundred and thirty-six
nine thousand, one hundred and twenty-one
invalid number
ten thousand and twenty-two
sixty-six billion, seven hundred and twenty-three million, one hundred and seven thousand and eight
twenty-three
invalid number
negative forty-eight
positive two hundred and fifty
positive one hundred and seventy-five
eighty billion, five hundred and thirty-two million, six hundred and fifty thousand


INPUTS & OUTPUTS FOR AUDITING:
The pump is 536 deep underground.
{'536': 'five hundred and thirty-six'}

We processed 9121 records.
{'9121': 'nine thousand, one hundred and twenty-one'}

Variables reported as having a missing type #65678.
{'#65678': 'invalid number'}

Interactive and printable 10022 ZIP code.
{'10022': 'ten thousand and twenty-two'}

The database has 66723107008 records.
{'66723107008': 'sixty-six billion, seven hundred and twenty-three million, one hundred and seven thousand and eight'}

I received 23 456,9 KGs.
{'23': 'twenty-three', '456,9': 'invali

### Notes

##### 1) For someone new to python/coding, I recommend the below solution.
This one simply takes all text input, generates a list of words, identifies numbers (and invalid numbers), and produces outputs for them. It is using the same functions, but without the loop to keep things simpler.

In [7]:
# #remove '#' if you want to use this code
# result(splitter(open(filename, 'r').read()))

##### 2) If one needs output in txt file, below takes care of that.
This code prints same output to txt file. If someone needs simpler outputs in a txt, it is easy from here, since the technique is all here.

In [8]:
# # to activate code: select all lines, and hit Ctrl + '/'

# fh=open('Test output.txt','w')  # <<-- enter desired filename

# print('OUTPUT:', file=fh)
# for i in results_dict:
#     for j in results_dict[i]:
#         print(results_dict[i][j], file=fh)
# print('', file=fh)
# print('', file=fh)
# print('INPUTS & OUTPUTS FOR AUDITING:', file=fh)

# for i in results_dict:
#     print(i, file=fh)
#     print(results_dict[i], file=fh)
#     print('', file=fh)
# fh.close()

##### 3) Potential further developments
- Extending the code to deal with decimals, so we can generate outputs for floats.
- Could use dataframes to put the inputs and outputs in nice tables.