Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

01a Scanning #1

Merged
merged 3 commits into from
Apr 14, 2022
Merged

01a Scanning #1

merged 3 commits into from
Apr 14, 2022

Conversation

ngjunsiang
Copy link
Contributor

Pseudocode 9608

9608 is a somewhat-inconsistent syntax that is meant to describe algorithms for GCE A Level Computing. Let's try to turn it into a real-ish programming language, implementing as many of its features as we can with basic concepts.

A sample of 9608 pseudocode is shown below.

DECLARE i : INTEGER
DECLARE String : STRING
i <- 0
WHILE i < 10 DO
    CASE OF i
        0: String <- "0"
        1: String <- "1"
        2: String <- "2"
        3: String <- "3"
        OTHERWISE String <- "large number can't count"
    ENDCASE
    OUTPUT String
    i <- i + 1
ENDWHILE

It's a lot for a program to interpret at a go. In Python, this is just a string of letters, numbers, symbols, and whitespace (space and linebreaks). We've got to break it up into groupings of characters that represent things in 9608 pseudocode.

Lexical analysis

From a quick analysis, we can identify the following:

  • Keywords: These are words which seem to have special meaning in 9608 pseudocode: WHILE, CASE, ENDCASE, etc.
  • Integers: 0, 1, 2, 3, etc.
  • Strings: Unlike keywords, these are words which do not appear to have special meaning. I.e. they are just regular words and stuff. They are called strings, and are demarcated in double-quotes like "large number can't count"
  • Symbols: +, <-, :, which appear to relate numbers and other chunks to each other.

Let's put this all in Python code. We need to have a process by which we turn src above into chunks, which we shall call tokens. Since we are writing a program, that means we need rules for doing so.

Scanning code into tokens

This is how I would describe that process to a computer:

Check the first character.

  1. If it is a space, ignore it. It typically has no special meaning.
  2. If it is a line break (detected as the special character \n, that forms its own word. This is used to mark the end of a statement, or demarcate parts of statements.)
  3. If it is an alphabet letter, keep going until a non-alphabet letter is encountered. This gives us a word.
  4. If it is a number digit (0-9), keep checking the next character until a non-number is encountered. This gives us an integer.
  5. If it is a double-quote ("), keep checking the next character until another double-quote is encountered. This gives us a string.
    The word may be a keyword if it is found in KEYWORDS, otherwise it is a variable that may be used to refer to values.
  6. If it appears to be a symbol ... we'll figure that out in a bit.

@ngjunsiang
Copy link
Contributor Author

Helper functions [a48c005]

Throwing all the code we write into a single massive ball of script is going to make it difficult to see the abstract picture of what's going on. Let's make some helper functions for the following tasks:

  • check the current character: This simply returns us the current character we are looking at.
  • consume the current character, removing it from src, and returning the rest of src.
  • atEnd tells us if we are at the end of src. True means there are no more characters to scan.

@ngjunsiang
Copy link
Contributor Author

Wrapping code

https://github.com/nyjc-computing/pseudo/blob/ccadc8dd31441f9e654debdca1f40f3314b580b0/scanner.py#L62

Because the functions cannot directly modify the source code, we wrap the code in an object (here, I use a dict) that enables them to access and modify the source code string.

In future we will modify the scanner so it does not have to do so. But meanwhile let's keep it simple.

@ngjunsiang
Copy link
Contributor Author

https://github.com/nyjc-computing/pseudo/blob/ccadc8dd31441f9e654debdca1f40f3314b580b0/scanner.py#L61-L91

We loop the scanner as long as there are more characters, checking the first letter each time. We pick a scanning function to pass code to, for tokenising a word, symbol, integer, or string. And we add that token into a list of tokens.

@ngjunsiang
Copy link
Contributor Author

Scanning functions [396ba6d]

These scanning functions are responsible for recognising words, integers, strings, and symbols. They are invoked by scan(), and run until they detect a terminating condition.

@ngjunsiang
Copy link
Contributor Author

Detect line breaks [ccadc8d]

The scanner cannot recognise line breaks yet. This change enables it to do so, returning line breaks as a '\n' token.

@ngjunsiang
Copy link
Contributor Author

Testing

Testing code:

from scanner import scan

src = """
DECLARE i : INTEGER
DECLARE String : STRING
i <- 0
WHILE i < 10 DO
    CASE OF i
        0: String <- "0"
        1: String <- "1"
        2: String <- "2"
        3: String <- "3"
        OTHERWISE String <- "large number can't count"
    ENDCASE
    OUTPUT String
    i <- i + 1
ENDWHILE
"""

tokens = scan(src)

print('Tokens:', tokens)

Sample output:

Tokens: ['\n', 'DECLARE', 'i', ':', 'INTEGER', '\n', 'DECLARE', 'String', ':', 'STRING', '\n', 'i', '<-', '0', '\n', 'WHILE', 'i', '<', '10', 'DO', '\n', 'CASE', 'OF', 'i', '\n', '0', ':', 'String', '<-', '"0"', '\n', '1', ':', 'String', '<-', '"1"', '\n', '2', ':', 'String', '<-', '"2"', '\n', '3', ':', 'String', '<-', '"3"', '\n', 'OTHERWISE', 'String', '<-', '"large number can\'t count"', '\n', 'ENDCASE', '\n', 'OUTPUT', 'String', '\n', 'i', '<-', 'i', '+', '1', '\n', 'ENDWHILE', '\n']

@ngjunsiang ngjunsiang merged commit b3fdfc1 into main Apr 14, 2022
@ngjunsiang ngjunsiang changed the title 01 Scanning 01a Scanning Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant