Skip to content

Lexer V2

Lancine Doumbia edited this page Jul 31, 2022 · 13 revisions

Lexical analysis in a nutshell - Modified version of my lexical analyzer

1015pm 7/13/2022 Wednesday

Go to compiler mod branch to see.

7/13/2022 1130ish pm

can i read through external file?

12am 7/14/2022

The answer is a resounding Yes.

const {readFileSync} = require('fs');
print: readFileSync('./streamer.txt', 'utf-8')

1238am

Why my Lexer seemed to skip certain characters that could otherwise be used as a lexeme by other if statement blocks\

I noticed that a final period character in the streamer.txt file was present in the input array, but my scanner ignored it entirely when loading up the lexeme array. In fact, the periods that were in front of the alphabet characters and not separated by a white space were ignored as well.\

It's because of how the block was made:

	if (char.match(/[A-Za-z]/)) {
		var word = '';
		//once the while loop reaches a character the match function is not looking for,
		that loop terminates. The character may be a valid one that can be leximized by other
		blocks, but it gets ignored entirely because when the statement is finished executing.
		So instead, the scanner head moves on to the next element to be scanned.

		while (char.match(/[A-Za-z]/)) {  
			if (index + 1 == train.length - 1) {
				break;
			}
			word += char;
			char = train[++index];
		}

		divvy.push(word);
		//But since I included this line as an experiment, the problem is solved. This line
		moves the scanner head to the left by one on purpose so when the while loops that store a group
		of characters terminate, the scanner will focus on the last element it scanned last and had it
		stored as a multi char lexeme, but the scanner will move to the right by one as the statement block
		is already done executing everything in there. So now the previously forgotten character will become
		a lexeme.
		char = train[--index] 
	}

1240am

The scanner works well now!

112am

Testing the improved scanner by copying and pasting a portion of the transcript of RWBY volume 7 chapter 9 for funsies. Minor flaw. The single quote handler got triggered by the apostrophe on the word aren't. The scanner thought that it was the start of the string literal. I will have to fix both of the quote handlers for ' and " characters

120am

"" is meant for string literals. ' is meant to be used as an apostrophe. Thought it would be a grand feature to handle string literals that look like 'this string'. But it was not a good feature

450am

Fixing the string literal handler now

   if (char.match(/["]/)) {
   	var dstring = '';
   	char = train[++index];
   	while (char.match(/[^"]/)) {
   		if (index + 1 == train.length - 1) {
   			break;
   		}
   		dstring += char;
   		char = train[++index];
   	}
   	//remove this line
   	char = train[++index];

   	divvy.push(dstring);
   	//remove this line too
   	char = train[--index]

   } else 
   //GET RID OF THIS BLOCK
   if (char.match(/[']/)) {
   	var sstring = '';
   	char = train[++index];
   	while (char.match(/[^']/)) {
   		if (index + 1 == train.length - 1) {
   			break;
   		}
   		sstring += char;
   		char = train[++index];
   	}
   	char = train[++index];

   	divvy.push(sstring);
   	char = train[--index] 

   }

516am

File system read error. Lexeme array wouldn't print out to the terminal

Solution:

const fileSystem = require("fs");
var data = "";

const readStream = fileSystem.createReadStream("input.txt");

readStream.setEncoding("UTF8");

readStream.on("data", (chunk) => {
	data += chunk;
});

readStream.on("end", () => {
	console.log(data);
});

readStream.on("error", (error) => {
	console.log(error.stack);
});


It works. Now to connect the read stream stuff to the lexical analyzer. Strike that. Theres an error.

529am

Back to readFileSync. Did I forget to include a space after the end of the file? Yes. Yes I did.

554am

Updated my string literal handler:

	if (char.match(/["]/)) {
		var string = '';

		//add opening quote
		string += char;
		char = train[++index];

		while (!char.match(/["]/)) {
			if (index == train.length - 1) {
				break;
			}
			string += char;
			char = train[++index];
		}

		//add closing quote
		string += char;
		char = train[++index];

		divvy.push(string);
		//leave this line alone.
		char = train[--index]

	} 

917am

the break condition inside all 5 while loops of each scanner statement might be causing the problem of why the scanner can't properly display the lexemes after divvying up the text. The same issue happened 12 days prior because of the lingering whitespace at the very end of the string that the lexer needed. since the while loops used a match function to add a varied amount of specific characters which were searched by a regular expression and were proven valid, there had to be way to keep the while loops from either moving the scanner head out of bounds, or running forever as soon as the end of file is reached.

250pm

Possible solution:

		while (char.match(/[A-Za-z]/)) {
			if (index == everything.length - 1) {
				break;
			} 
			console.log(`Beep Beep ${index}`);
			word += char;
			char = everything[++index];

			//adding this. next character is not the alphabet letter, break.
			if (!char.match(/[A-Za-z]/)){
				break;
			}
			
		}

Didn't work.

804pm

Analysis - the while loop will still be stuck on the last valid character as the string if fully read.

		//rios
		//char stuck on index 3 - s
		while (char.match(/[A-Za-z]/)) {
			
				console.log(`Beep Beep ${index}`);
				word += char; //r  i   o   s   
                //change
                if( index == everything.length - 1 ){
                    break;
                }  
                char = everything[++index];
		}

822pm

Last whitespace character after the final character serves as the end of the string. Use that. Literally no workarounds on that.

904pm

Making local lexeme variable for the scanner statement blocks to use.

926pm

Rewriting the regular expression of the TAB_INDENT constant from /\t/ to /^[ ]{4}$/. Using the tab key seems impractical

938pm

Got rid of TAB_INDENT. Not needed.

7/15/2022 1201pm

Question since 2 hours ago: Can comments be tokenized? Absolutely not. Since comments are ignored anyway.

5pm 7/15/2022

Building my lexer now. A few hours ago, I was thinking of how to implement the function. I didn't want to use a cluster of the if else if conditional chain because it'll make the program run slower by forcing it to check every condition. And it'll take up too many lines. I could use the switch case statements to make the code neater and quicker, but in this case for me, it won't work for string literals. I can however enclose the decision core in a if else statement. If the lexeme is a string literal, make a string token. Or else run the core and generate the corresponding reserved token of the word if it is found. The id token is generated by default. Another if statement outside the core can be used to make number literal tokens.

Lexer:

	for every element in lexeme array:
		get the element from the array and evaluate it
		if string literal:
			make string token
		else if number literal:
			make number token
		else 
			switch stmt based on lexeme:
				case is any keyword:
					make reserved token
				default
					make id token

//This pseudocode is not accurate as I haven't planned out the algorithm fully yet

I was also marinating with the idea of putting the keyword names and their assigned token info onto a separate file. It'll drastically reduce lines, but getting the information you want from the file will be tricky. I haven't gone over reading and writing out files since almost 3 years ago, and that was for a C++ class course.

The switch cases will be long. I used arrays for this problem of determining whether the input is a keyword or not before and it was a bit of a hassle. Especially since it required a if statement to check if the input matches with the element inside that array. Might as well put a if else if ladder to replace the if array train. The performance is the same.

709pm

What are symbol tables? Arrays? Semantic analysis stuff. Come back to that later.

716pm

Now implementing my lexer. I planned it out

819pm

Pseudocode of lexer V2 is done

	if string literal:
		make string token
	else if number literal:
		make number token
	else if identifier:
		switch stmt based on id lexeme:
			case is any keyword:
				make reserved token
			default
				make id token
	else if punctuation:
		switch stmt based on punctuation lexeme:
			case is any keyword:
				make punctuation token
	else if operator:
		switch stmt based on operator lexeme:
			case is any keyword:
				make operator token
			default
				invalid operator
	else:
		skip through the rest of the lexeme array 
		until a newline or end of multiline comment is found

844pm

Constructing the lexer for real this time

942pm

Back to work on the implementation

1115pm

Lexer V2 is now complete prepare for testing.

put The Mandalorian rules! into the streamer.txt file

1121pm

The token array is not being displayed.

7 mins later

It could be because I anchored the regular expressions on the lexer's if statements.

Actually, the issue could be the regexp for the operator evaluator section. Before: ^[-&|!\+\*\/=<>%?]$ After: ^[-&|!\+\*\/=<>%?]+$

1137pm

The issue was the excess whitespace in the streamer.txt file itself!!!

1142pm

The tempCodeRunnerFile.js file is keeping the Lexical Analyzer V2 from executing. It keeps popping back up everytime I try to run my lexer.

Restarting VSCode to alleviate this problem.

1213am 7/16/2022 Saturday

The lexer v2 is now fully operational. Also, VSCode got updated to v1.69.1

1245am

const content = 'Some content!';

try {
  fs.writeFileSync('/Users/joe/test.txt', content);
  // file written successfully
} catch (err) {
  console.error(err);
}

Using a separate file to print the data there. The source code characters were too much for the terminal to handle. 1603 characters, 255 whitespaces.

A section of a volume 1 chapter transcript was all it took.

204am 7/16

divvy array contents can now be written to wordplay.txt file. writeFileSync function takes string values only.

233am

Trying to print array contents that are objects.

[
	{ type: 'left_paren', value: '(' },
	{ type: 'name', value: 'add' },
	{ type: 'number', value: '2' },
	{ type: 'left_paren', value: '(' },
	{ type: 'name', value: 'subtract' },
	{ type: 'number', value: '4' },
	{ type: 'number', value: '2' },
	{ type: 'right_paren', value: ')' },
	{ type: 'right_paren', value: ')' }
];

wordplay.txt

[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]

1124am

w3schools.com - solutions to display JavaScript objects

  • Displaying the Object Properties by name
  • Displaying the Object Properties in a Loop
  • Displaying the Object using Object.values()
  • Displaying the Object using JSON.stringify()

1130am

Found the solution. Object.keys(object1)
Object.values(object1)

1208pm

Almost close to getting the token info displayed correctly.

${Object.keys(fruits[0]).toString()} ${Object.values(fruits[0]).toString()}

type,value number,2

1241pm

Trying out Object.entries

Returns an array of [all keys, all values] pairs.

1259pm

Getting even closer in seeing the object info

var see_all_tokens = []
for (let index = 0; index < fruits.length; index++) {
    see_all_tokens.push(`${Object.keys(fruits[index]).toString()} \n ${Object.values(fruits[index]).toString()}`)
    
}
see_all_tokens.toString()

Results:

type,value 
 number,2,type,value 
 number,8,type,value 
 string,hera,type,value 
 number,4,type,value 
 left_paren,(,type,value 
 name,multiply,type,value 
 number,4,type,value 
 number,17,type,value 
 right_paren,)

104pm

I can see the tokens now!

127pm

Multiline comment symbols are throwing invalid operator warnings. Will need to fix my tokenizer code

130pm

Tokenizer works!!!

341pm

Note when enabling comment ignoring functionality: The scanner treats single line and multi line comment symbols as operator lexemes since they're made of characters / and * which are primarily used as arithmetic operators.

Move comment ignoring code from else statement to inside the operator lexeme statement in the lexer.

	} else if (wordup.match(/^[-&|!\+\*\/=<>%?]+$/)) {
		switch (wordup){
			case "<op symbol>":
				bucks.push({type: 'operator' , value: wordup});
				break;
			//new
			/*
			case "//":
			case "<start multi line comment symbol>":
			case "<end multi line comment symbol>":
				//to here
				while(!wordup.match(/^(\n|\*\/)$/)){
					wordup = divvy[++index]
					if (index + 1 == divvy.length) {
						break;
					}
				}
				divvy[--index]
				break;
			default:
				//so this line doesn't fire
				console.log(`${wordup} is not a valid operator lexeme`)
		}
	//then erase else statement. you don't need it
	} else {
			//move this block from here
			while(!wordup.match(/^(\n|\*\/)$/)){
				wordup = divvy[++index]
				if (index + 1 == divvy.length) {
					break;
				}
			}
			divvy[--index]
		}

355pm

Now the lexer can better ignore single line comments

4pm

Lexical Analyzer V2 is fully operational; for real this time
It can now write lexemes and tokens 1 line each to their respective files.

649pm

Testing the comment ignorer again. I think it is flawed. The while loop condition may cause the ignorer to not ignore stuff properly

			case "//":
				//do I need to put ignorer code here too?
			case "<start multi line comment symbol>":
			case "<end multi line comment symbol>":	

				//if case "/*", wouldn't the content just get ignored on that line only?
				while(!wordup.match(/^(\n|\*\/)$/)){
					wordup = divvy[++index]
					if (index + 1 == divvy.length) {
						break;
					}
				}
				divvy[--index]
				break;

649pm

Testing the comment ignorer again. I think it is flawed. The while loop condition may cause the ignorer to not ignore stuff properly

			case "//":
				//do I need to put ignorer code here too?
			case "<start multi line comment symbol>":
			case "<end multi line comment symbol>":	

				//if case "/*", wouldn't the content just get ignored on that line only?
				while(!wordup.match(/^(\n|\*\/)$/)){
					wordup = divvy[++index]
					if (index + 1 == divvy.length) {
						break;
					}
				}
				divvy[--index]
				break;

823pm

Testing the comment ignorer with lines of character Jacques Schnee from RWBY Volume 7 chapters 4 to 9. All 59 lines. Now 38 lines.

1017pm

4512 input characters, 792 whitespaces The tokenizer is not firing

?! is not a valid operator lexeme
!? is not a valid operator lexeme

The files are not in the right filepath. Also, the blob is too big.

1047pm

	case "//":
	case "<start multi line comment symbol>":
	case "<end multi line comment symbol>":	 //Is causing to tokenizer to not execute entirely.
		while(!wordup.match(/^(\n|\*\/)$/)){
			wordup = divvy[++index]
				if (index + 1 == divvy.length) {
					break;
				}
			}
		        divvy[--index] //that line may be unnecessary. Make this block skip the closing multiline comment symbol
				break;

Actually, leave the "" case alone. It'll keep the default case from firing.

1059pm

As I suspected. The multi line comment ignorer's condition is bad. The ignorer stopped after reaching the newline

1115pm

The multi line ignorer is working flawlessly now. I also copy and pasted that code in the // case The prefix decrementer was actually causing the problem of the lexer not tokenizing. So now the lexer can handle massive source code blobs.

1120pm

Hopefully this is the last time I'm documenting this. The Lexer V2 is now fully operational. Absolutely operational. 100% fully operational.

916pm 7/23/2022

Having the lexer v2 make digit tokens that carry decimal and scientific notation values. I'm essentially giving the v2 that same functionality as the v1.

1014pm

This regex string ^(\d+|\d+\.\d*|\d*\.\d*[Ee]([-+]\d{1,2}|\d{1,2}))$
Can the regexp handle:

  • 12 - Yes
  • 12.4 - Yes
  • .76 - No. Needs to be a digit before decimal point
  • 0.8 - Yes
  • 3e4 - No. A decimal point must be there
  • 5E+2 - No. A decimal point must still be there
  • 4e-1 - No. The decimal point must absolutely be there
  • .7e-5 - Yes
  • 7.e+3 - Yes Invalid test. Should reject this:
  • .E1 - Yes. That's not allowed

1118pm

Updated the regexp to this: ^(\d+|(\d+\.\d*|\d*\.\d+)|(\d+|\d+\.\d*|\d*\.\d+)[Ee]([-+]\d{1,2}|\d{1,2}))$
Can the regexp handle:

  • 12 - Yes
  • 12.4 - Yes
  • .76 - Yes
  • 0.8 - Yes
  • 3e4 - Yes
  • 5E+2 - Yes
  • 4e-1 - Yes
  • .7e-5 - Yes
  • 7.e+3 - Yes Invalid test. Should reject this:
  • .E1 - No. It's all good

1211am 7/24

In: 123.e56 [{"type":"number","value":"123."},{"type":"identifier","value":"e"},{"type":"number","value":"56"}]

The algorithm is too impressive. It split the supposed number literal into 3 tokens. I need to fix the if statement inside the digit handler while loop.

} else if (char.match(NUMBER_STEW)) {
	while (char.match(NUMBER_STEW)) {
		if (index + 1 == everything.length) {
			break;
		}

		lexeme += char;
		char = everything[++index];

		//fix this block.
		if (char.match(/[-+\.Ee]/)) {
			lexeme += char;
			char = everything[++index];
		}
			
	}

	divvy.push(lexeme);
	char = everything[--index];
}

1244am

Make that block into a while loop. What happens? The problem is solved! [{"type":"number","value":"123.e56"}]

306pm 7/25 Monday First merge

Planning to merge the compiler-mod branch holding lexer v2 file with save-point branch. I want the lexer v2 file to be plugged in to the JSON formatter file.

536pm

Modularized the scanner and evaluator and plugged them in to base.js. The main file will take in the txt input and output tokens for now using the V2. Will update momentarily.

636pm

The formatter algorithm should just be used for the ast. The algorithm indented the rest of the results when it saw the punctuator value of { .

[        
        //normal
        {
		"type":"punctuator",
		"value":")"
	},
        //not normal
	{
		"type":"punctuator",
		"value":"{
			"
		},
		{
			"type":"identifier",
			"value":"total"
		},
		{
			"type":"operator",
			"value":"+="
		},
		{
			"type":"identifier",
			"value":"number"
		},
		{
			"type":"punctuator",
			"value":"
		}"
	}
]

927pm

Or I can just write another algorithm for the lexer's arrays. The formatter algorithm was meant for the parser, not the lexer.

1043pm

Deleted wordplay.txt. All I need to see is the array of tokens. I'm not going to make the formatter algorithm for the lexer. Too much of a hassle for me.

1am 7/26/2022 Done for now

The V2 is modular and ready to go!

449pm 7/30/22 Saturday

The lexer 2 is not handling inputs like 5+10*3 well now. Because for the number handler, I added a while loop inside the while loop. The inner loop collects anything else within the digits: -, +, . , E, e. That loop was meant for the lexer to accommodate for decimal numbers and scientific notation numbers.

5+10 triggers an invalid digit lexeme error

Plan: make the lexer more precise. No need for the lexer 2 to rely on whitespace as main separators

-3-5. Split to - 3 - 5
-3+5. Split to - 3 + 5
-3++5. Split to - 3 ++ 5
3+++5. Don't throw invalid error. Instead, Split them to 3 ++ + 5
3.0 Stays 3.0
4..0 Split to 4 . . 0

1015pm 7/13/2022

Building a separate scanner and tokenizer to see what I've learned from Vaidehi Joshi's blog.

616pm 7/16/2022

Separate scanner and tokenizer completed since 4pm. Syntactic Analyzer operation now in progress.

Clone this wiki locally