-
Notifications
You must be signed in to change notification settings - Fork 0
Lexer V2
Go to compiler mod branch to see.
can i read through external file?
The answer is a resounding Yes.
const {readFileSync} = require('fs');
print: readFileSync('./streamer.txt', 'utf-8')
Why my Lexer seemed to skip certain characters that could otherwise be used as a lexeme by other if statement blocks\
I noticed that a final period character in the streamer.txt file was present in the input array, but my scanner ignored it entirely when loading up the lexeme array. In fact, the periods that were in front of the alphabet characters and not separated by a white space were ignored as well.\
It's because of how the block was made:
if (char.match(/[A-Za-z]/)) {
var word = '';
//once the while loop reaches a character the match function is not looking for,
that loop terminates. The character may be a valid one that can be leximized by other
blocks, but it gets ignored entirely because when the statement is finished executing.
So instead, the scanner head moves on to the next element to be scanned.
while (char.match(/[A-Za-z]/)) {
if (index + 1 == train.length - 1) {
break;
}
word += char;
char = train[++index];
}
divvy.push(word);
//But since I included this line as an experiment, the problem is solved. This line
moves the scanner head to the left by one on purpose so when the while loops that store a group
of characters terminate, the scanner will focus on the last element it scanned last and had it
stored as a multi char lexeme, but the scanner will move to the right by one as the statement block
is already done executing everything in there. So now the previously forgotten character will become
a lexeme.
char = train[--index]
}
The scanner works well now!
Testing the improved scanner by copying and pasting a portion of the transcript of RWBY volume 7 chapter 9 for funsies. Minor flaw. The single quote handler got triggered by the apostrophe on the word aren't. The scanner thought that it was the start of the string literal. I will have to fix both of the quote handlers for ' and " characters
"" is meant for string literals. ' is meant to be used as an apostrophe. Thought it would be a grand feature to handle string literals that look like 'this string'. But it was not a good feature
Fixing the string literal handler now
if (char.match(/["]/)) {
var dstring = '';
char = train[++index];
while (char.match(/[^"]/)) {
if (index + 1 == train.length - 1) {
break;
}
dstring += char;
char = train[++index];
}
//remove this line
char = train[++index];
divvy.push(dstring);
//remove this line too
char = train[--index]
} else
//GET RID OF THIS BLOCK
if (char.match(/[']/)) {
var sstring = '';
char = train[++index];
while (char.match(/[^']/)) {
if (index + 1 == train.length - 1) {
break;
}
sstring += char;
char = train[++index];
}
char = train[++index];
divvy.push(sstring);
char = train[--index]
}
File system read error. Lexeme array wouldn't print out to the terminal
Solution:
const fileSystem = require("fs");
var data = "";
const readStream = fileSystem.createReadStream("input.txt");
readStream.setEncoding("UTF8");
readStream.on("data", (chunk) => {
data += chunk;
});
readStream.on("end", () => {
console.log(data);
});
readStream.on("error", (error) => {
console.log(error.stack);
});
It works. Now to connect the read stream stuff to the lexical analyzer. Strike that. Theres an error.
Back to readFileSync. Did I forget to include a space after the end of the file? Yes. Yes I did.
Updated my string literal handler:
if (char.match(/["]/)) {
var string = '';
//add opening quote
string += char;
char = train[++index];
while (!char.match(/["]/)) {
if (index == train.length - 1) {
break;
}
string += char;
char = train[++index];
}
//add closing quote
string += char;
char = train[++index];
divvy.push(string);
//leave this line alone.
char = train[--index]
}
the break condition inside all 5 while loops of each scanner statement might be causing the problem of why the scanner can't properly display the lexemes after divvying up the text. The same issue happened 12 days prior because of the lingering whitespace at the very end of the string that the lexer needed. since the while loops used a match function to add a varied amount of specific characters which were searched by a regular expression and were proven valid, there had to be way to keep the while loops from either moving the scanner head out of bounds, or running forever as soon as the end of file is reached.
Possible solution:
while (char.match(/[A-Za-z]/)) {
if (index == everything.length - 1) {
break;
}
console.log(`Beep Beep ${index}`);
word += char;
char = everything[++index];
//adding this. next character is not the alphabet letter, break.
if (!char.match(/[A-Za-z]/)){
break;
}
}
Didn't work.
Analysis - the while loop will still be stuck on the last valid character as the string if fully read.
//rios
//char stuck on index 3 - s
while (char.match(/[A-Za-z]/)) {
console.log(`Beep Beep ${index}`);
word += char; //r i o s
//change
if( index == everything.length - 1 ){
break;
}
char = everything[++index];
}
Last whitespace character after the final character serves as the end of the string. Use that. Literally no workarounds on that.
Making local lexeme variable for the scanner statement blocks to use.
Rewriting the regular expression of the TAB_INDENT constant from /\t/
to /^[ ]{4}$/
. Using the tab key seems impractical
Got rid of TAB_INDENT. Not needed.
Question since 2 hours ago: Can comments be tokenized? Absolutely not. Since comments are ignored anyway.
Building my lexer now. A few hours ago, I was thinking of how to implement the function. I didn't want to use a cluster of the if else if conditional chain because it'll make the program run slower by forcing it to check every condition. And it'll take up too many lines. I could use the switch case statements to make the code neater and quicker, but in this case for me, it won't work for string literals. I can however enclose the decision core in a if else statement. If the lexeme is a string literal, make a string token. Or else run the core and generate the corresponding reserved token of the word if it is found. The id token is generated by default. Another if statement outside the core can be used to make number literal tokens.
Lexer:
for every element in lexeme array:
get the element from the array and evaluate it
if string literal:
make string token
else if number literal:
make number token
else
switch stmt based on lexeme:
case is any keyword:
make reserved token
default
make id token
//This pseudocode is not accurate as I haven't planned out the algorithm fully yet
I was also marinating with the idea of putting the keyword names and their assigned token info onto a separate file. It'll drastically reduce lines, but getting the information you want from the file will be tricky. I haven't gone over reading and writing out files since almost 3 years ago, and that was for a C++ class course.
The switch cases will be long. I used arrays for this problem of determining whether the input is a keyword or not before and it was a bit of a hassle. Especially since it required a if statement to check if the input matches with the element inside that array. Might as well put a if else if ladder to replace the if array train. The performance is the same.
What are symbol tables? Arrays? Semantic analysis stuff. Come back to that later.
Now implementing my lexer. I planned it out
Pseudocode of lexer V2 is done
if string literal:
make string token
else if number literal:
make number token
else if identifier:
switch stmt based on id lexeme:
case is any keyword:
make reserved token
default
make id token
else if punctuation:
switch stmt based on punctuation lexeme:
case is any keyword:
make punctuation token
else if operator:
switch stmt based on operator lexeme:
case is any keyword:
make operator token
default
invalid operator
else:
skip through the rest of the lexeme array
until a newline or end of multiline comment is found
Constructing the lexer for real this time
Back to work on the implementation
Lexer V2 is now complete prepare for testing.
put The Mandalorian rules!
into the streamer.txt file
The token array is not being displayed.
It could be because I anchored the regular expressions on the lexer's if statements.
Actually, the issue could be the regexp for the operator evaluator section.
Before: ^[-&|!\+\*\/=<>%?]$
After: ^[-&|!\+\*\/=<>%?]+$
The issue was the excess whitespace in the streamer.txt file itself!!!
The tempCodeRunnerFile.js file is keeping the Lexical Analyzer V2 from executing. It keeps popping back up everytime I try to run my lexer.
Restarting VSCode to alleviate this problem.
The lexer v2 is now fully operational. Also, VSCode got updated to v1.69.1
const content = 'Some content!';
try {
fs.writeFileSync('/Users/joe/test.txt', content);
// file written successfully
} catch (err) {
console.error(err);
}
Using a separate file to print the data there. The source code characters were too much for the terminal to handle. 1603 characters, 255 whitespaces.
A section of a volume 1 chapter transcript was all it took.
divvy array contents can now be written to wordplay.txt file. writeFileSync function takes string values only.
Trying to print array contents that are objects.
[
{ type: 'left_paren', value: '(' },
{ type: 'name', value: 'add' },
{ type: 'number', value: '2' },
{ type: 'left_paren', value: '(' },
{ type: 'name', value: 'subtract' },
{ type: 'number', value: '4' },
{ type: 'number', value: '2' },
{ type: 'right_paren', value: ')' },
{ type: 'right_paren', value: ')' }
];
wordplay.txt
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
[object Object]
w3schools.com - solutions to display JavaScript objects
- Displaying the Object Properties by name
- Displaying the Object Properties in a Loop
- Displaying the Object using Object.values()
- Displaying the Object using JSON.stringify()
Found the solution.
Object.keys(object1)
Object.values(object1)
Almost close to getting the token info displayed correctly.
${Object.keys(fruits[0]).toString()} ${Object.values(fruits[0]).toString()}
type,value number,2
Trying out Object.entries
Returns an array of [all keys, all values] pairs.
Getting even closer in seeing the object info
var see_all_tokens = []
for (let index = 0; index < fruits.length; index++) {
see_all_tokens.push(`${Object.keys(fruits[index]).toString()} \n ${Object.values(fruits[index]).toString()}`)
}
see_all_tokens.toString()
Results:
type,value
number,2,type,value
number,8,type,value
string,hera,type,value
number,4,type,value
left_paren,(,type,value
name,multiply,type,value
number,4,type,value
number,17,type,value
right_paren,)
I can see the tokens now!
Multiline comment symbols are throwing invalid operator warnings. Will need to fix my tokenizer code
Tokenizer works!!!
Note when enabling comment ignoring functionality: The scanner treats single line and multi line
comment symbols as operator lexemes since they're made of characters /
and *
which are primarily used as
arithmetic operators.
Move comment ignoring code from else statement to inside the operator lexeme statement in the lexer.
} else if (wordup.match(/^[-&|!\+\*\/=<>%?]+$/)) {
switch (wordup){
case "<op symbol>":
bucks.push({type: 'operator' , value: wordup});
break;
//new
/*
case "//":
case "<start multi line comment symbol>":
case "<end multi line comment symbol>":
//to here
while(!wordup.match(/^(\n|\*\/)$/)){
wordup = divvy[++index]
if (index + 1 == divvy.length) {
break;
}
}
divvy[--index]
break;
default:
//so this line doesn't fire
console.log(`${wordup} is not a valid operator lexeme`)
}
//then erase else statement. you don't need it
} else {
//move this block from here
while(!wordup.match(/^(\n|\*\/)$/)){
wordup = divvy[++index]
if (index + 1 == divvy.length) {
break;
}
}
divvy[--index]
}
Now the lexer can better ignore single line comments
Lexical Analyzer V2 is fully operational; for real this time
It can now write lexemes and tokens 1 line each to their respective files.
Testing the comment ignorer again. I think it is flawed. The while loop condition may cause the ignorer to not ignore stuff properly
case "//":
//do I need to put ignorer code here too?
case "<start multi line comment symbol>":
case "<end multi line comment symbol>":
//if case "/*", wouldn't the content just get ignored on that line only?
while(!wordup.match(/^(\n|\*\/)$/)){
wordup = divvy[++index]
if (index + 1 == divvy.length) {
break;
}
}
divvy[--index]
break;
Testing the comment ignorer again. I think it is flawed. The while loop condition may cause the ignorer to not ignore stuff properly
case "//":
//do I need to put ignorer code here too?
case "<start multi line comment symbol>":
case "<end multi line comment symbol>":
//if case "/*", wouldn't the content just get ignored on that line only?
while(!wordup.match(/^(\n|\*\/)$/)){
wordup = divvy[++index]
if (index + 1 == divvy.length) {
break;
}
}
divvy[--index]
break;
Testing the comment ignorer with lines of character Jacques Schnee from RWBY Volume 7 chapters 4 to 9. All 59 lines. Now 38 lines.
4512 input characters, 792 whitespaces The tokenizer is not firing
?! is not a valid operator lexeme
!? is not a valid operator lexeme
The files are not in the right filepath. Also, the blob is too big.
case "//":
case "<start multi line comment symbol>":
case "<end multi line comment symbol>": //Is causing to tokenizer to not execute entirely.
while(!wordup.match(/^(\n|\*\/)$/)){
wordup = divvy[++index]
if (index + 1 == divvy.length) {
break;
}
}
divvy[--index] //that line may be unnecessary. Make this block skip the closing multiline comment symbol
break;
Actually, leave the "" case alone. It'll keep the default case from firing.
As I suspected. The multi line comment ignorer's condition is bad. The ignorer stopped after reaching the newline
The multi line ignorer is working flawlessly now. I also copy and pasted that code in the //
case
The prefix decrementer was actually causing the problem of the lexer not tokenizing. So now the lexer can handle massive source code blobs.
Hopefully this is the last time I'm documenting this. The Lexer V2 is now fully operational. Absolutely operational. 100% fully operational.
Having the lexer v2 make digit tokens that carry decimal and scientific notation values. I'm essentially giving the v2 that same functionality as the v1.
This regex string
^(\d+|\d+\.\d*|\d*\.\d*[Ee]([-+]\d{1,2}|\d{1,2}))$
Can the regexp handle:
- 12 - Yes
- 12.4 - Yes
- .76 - No. Needs to be a digit before decimal point
- 0.8 - Yes
- 3e4 - No. A decimal point must be there
- 5E+2 - No. A decimal point must still be there
- 4e-1 - No. The decimal point must absolutely be there
- .7e-5 - Yes
- 7.e+3 - Yes Invalid test. Should reject this:
- .E1 - Yes. That's not allowed
Updated the regexp to this:
^(\d+|(\d+\.\d*|\d*\.\d+)|(\d+|\d+\.\d*|\d*\.\d+)[Ee]([-+]\d{1,2}|\d{1,2}))$
Can the regexp handle:
- 12 - Yes
- 12.4 - Yes
- .76 - Yes
- 0.8 - Yes
- 3e4 - Yes
- 5E+2 - Yes
- 4e-1 - Yes
- .7e-5 - Yes
- 7.e+3 - Yes Invalid test. Should reject this:
- .E1 - No. It's all good
In: 123.e56
[{"type":"number","value":"123."},{"type":"identifier","value":"e"},{"type":"number","value":"56"}]
The algorithm is too impressive. It split the supposed number literal into 3 tokens. I need to fix the if statement inside the digit handler while loop.
} else if (char.match(NUMBER_STEW)) {
while (char.match(NUMBER_STEW)) {
if (index + 1 == everything.length) {
break;
}
lexeme += char;
char = everything[++index];
//fix this block.
if (char.match(/[-+\.Ee]/)) {
lexeme += char;
char = everything[++index];
}
}
divvy.push(lexeme);
char = everything[--index];
}
Make that block into a while loop. What happens?
The problem is solved! [{"type":"number","value":"123.e56"}]
Planning to merge the compiler-mod branch holding lexer v2 file with save-point branch. I want the lexer v2 file to be plugged in to the JSON formatter file.
Modularized the scanner and evaluator and plugged them in to base.js. The main file will take in the txt input and output tokens for now using the V2. Will update momentarily.
The formatter algorithm should just be used for the ast. The algorithm indented the rest of the results when it saw the punctuator value of {
.
[
//normal
{
"type":"punctuator",
"value":")"
},
//not normal
{
"type":"punctuator",
"value":"{
"
},
{
"type":"identifier",
"value":"total"
},
{
"type":"operator",
"value":"+="
},
{
"type":"identifier",
"value":"number"
},
{
"type":"punctuator",
"value":"
}"
}
]
Or I can just write another algorithm for the lexer's arrays. The formatter algorithm was meant for the parser, not the lexer.
Deleted wordplay.txt. All I need to see is the array of tokens. I'm not going to make the formatter algorithm for the lexer. Too much of a hassle for me.
The V2 is modular and ready to go!
The lexer 2 is not handling inputs like 5+10*3 well now. Because for the number handler, I added a while loop inside the while loop. The inner loop collects anything else within the digits: -, +, . , E, e. That loop was meant for the lexer to accommodate for decimal numbers and scientific notation numbers.
5+10 triggers an invalid digit lexeme error
Plan: make the lexer more precise. No need for the lexer 2 to rely on whitespace as main separators
-3-5
. Split to -
3
-
5
-3+5
. Split to -
3
+
5
-3++5
. Split to -
3
++
5
3+++5
. Don't throw invalid error. Instead, Split them to 3
++
+
5
3.0
Stays 3.0
4..0
Split to 4
.
.
0