Skip to content

Latest commit

 

History

History
423 lines (298 loc) · 18.2 KB

bison_analysis.md

File metadata and controls

423 lines (298 loc) · 18.2 KB

table分析

和flex类似,bison会有一系列用于parse的表,而且经过压缩处理。在理顺这些表的作用和索引方式后,对yyparse的理解就方便很多了。

yytranslate

这个表映射了token字符和symbol token之间的关系。如果是预定义的%token,则会映射到别的token序号去;如果只是简单的终结符,则会把对应字符的ascii码映射为对应的token。 例如S: '\n' ;规则在生成后如下

static const yytype_uint8 yytranslate[] =
{
       0,     2,     2,     2,     2,     2,     2,     2,     2,     2,
       3,     2,     2,     2,     2,     2,     2,     2,     2,     2,
...
       2,     2,     2,     2,     2,     2,     2,     2,     2,     2,
       2,     2,     2,     2,     2,     2,     1,     2
};

\n ascii为10,此处yytranslate[10]=3 大部分映射的token为2,而2是YYEMPTY未定义的。当yyparse()调用yylex()然后获取对应字符的symbol token内部码。

yydefact

yydefact存储了每个状态下,产生式规则的序号(rule number)。

  /* YYDEFACT[STATE-NUM] -- Default reduction number in state STATE-NUM.
     Performed when YYTABLE does not specify something else to do.  Zero
     means the default is an error.  */
static const yytype_uint8 yydefact[] =
{
       0,     2,     0,     1
};

在STATE-NUM状态下,通过YYDEFACT[STATE-NUM]获取其对应的产生式序号。

其中,0号表示error。由于默认($accept → L $end)rule(其rule number为1)的存在,我们定义的所有rule索引都会增加1。即我们定义的第一个rule S: '\n' ;, 其rule number在此处为2。

yydefgoto

yydefgoto是GOTO跳转表的压缩形式。它的元素数量为语法中非终结符的数量。元素的值表示每个非终结符要跳转的state

  /* YYDEFGOTO[NTERM-NUM].  */
static const yytype_int8 yydefgoto[] =
{
      -1,     2
};

其规则为

yydefgoto[nth non-terminal] = most common GOTO state for the nth non-terminal.

索引方式为rule number减去非终结符的数目

  /* yyn is the number of a rule to reduce with.  */

yystate = yydefgoto[yyn - YYNTOKENS];

parser在用rule归约栈上内容的时候会查询yydefgoto,但在某些状态下会使用yytable表。

第一个非终结符$accept的rule为-1(即yydefgoto[0]),$accept不会被归约。

yyr1 & yyr2

yyr1是每个rule左部的symbol number。0号不作为rule,其共有NRULES+1个元素

  /* YYR1[YYN] -- Symbol number of symbol that rule YYN derives.  */
static const yytype_uint8 yyr1[] =
{
       0,     4,     5
};

通过yyr1,在归约发生的时候,我们可以知道这个归约式左部的symbol并转移到对应状态。

yyr2表示每个rule右部的symbol数量


  /* YYR2[YYN] -- Number of symbols on the right hand side of rule YYN.  */
static const yytype_uint8 yyr2[] =
{
       0,     2,     1
};

例如rule2的S: '\n'右侧有1个symbol,因此yyr2[2]=1。在归约的时候,通过这个表中查找在对应的rule下,我们要从栈中pop出多少state来作为归约。

yytable

yytable和yycheck, yypact, yypgoto 配合来表示当前的状态是应该放入parse stack中还是符合rule来进行归约。

#define YYTABLE_NINF -1

  /* YYTABLE[YYPACT[STATE-NUM]] -- What to do in state STATE-NUM.  If
     positive, shift that token.  If negative, reduce the rule whose
     number is the opposite.  If YYTABLE_NINF, syntax error.  */
static const yytype_uint8 yytable[] =
{
       1,     3
};

正数表示移入。负数表示归约,(其绝对值表示归约产生式的rule值,即如果为-3,则表示使用rule3来归约栈上的符号)。

YYTABLE_NINF用来表示我们在产生式中指定error情况下的状态。

If the value in YYTABLE is positive, we shift the token and go to that state. If the value is negative, it is minus a rule number to reduce by. If the value is YYTABLE_NINF, it's a syntax error.

yypgoto

因为同一个非终结符,根据之前状态的不同,其跳转的状态也是不同的。公用的GOTO表定义在前文提到的YYDEFGOTO中。那么对于存在不同情况的GOTO情况,则由yypgoto记录。

  /* YYPGOTO[NTERM-NUM].  */
static const yytype_int8 yypgoto[] =
{
      -4,    -4
};

这个表的索引对应着所有的非终结符,此处的两个元素分别是$accept,S的入口

比如现在在rule2 (S -> '\n')归约后,栈顶的状态为4,parser会将yypgoto[S]的值(此处yypgoto[1]=-4)加上原先的状态值作为现在的状态索引,去yytable找到新的状态值,即此时状态索引为0 (4-4), 那么yytable[0]为1,那么现在的状态就是1,被压入状态栈顶。

那么parser怎么知道是选择yypgoto还是yydefgoto来设置接下来的状态呢?答案在接下来会将的yycheck中,即通过yycheck来选择对应的模式(yypgoto或yydefgoto)

yypact

yypact定义了在S状态(初始状态)下的部分yytable。它由token标号来索引。

#define YYPACT_NINF -4
  /* YYPACT[STATE-NUM] -- Index in YYTABLE of the portion describing
     STATE-NUM.  */
static const yytype_int8 yypact[] =
{
      -3,    -4,     1,    -4
};

在parse循环的开始时,如果yypact[cur-state] = YYPACT_NINF, 意味着使用yydefact来归约。也即在这种状态下,只有归约没有移入的操作。 例如此处yypact中为YYPACT_NINF (-4)的状态有1和3.即在1号状态下,只进行对应的归约此操作。

如果现在在状态0,展望符为\n(即3号symbol,由yytranslate可得),那么现在的yypact[0]为-3,所以在yytable中的索引应该为-3+3=0, yytable[0]为1,表示移入并转移到状态1。

yycheck

yycheck主要有两个作用,一个是判断归约还是移入;一个是判断选择哪个跳转表。(设计yydefgoto 和yypgoto,将公共的goto提取出来到yydefgoto ,可以节省空间,而后再通过check的方式选择具体哪个表)

static const yytype_uint8 yycheck[] =
{
       3,     0
};

yytable or yydefact

源码注释src/tables.h#L58中可以看到,YYCHECK通过索引的界限来判断是否采用yytable

YYCHECK = a vector indexed in parallel with YYTABLE. It indicates, in a roundabout way, the bounds of the portion you are trying to examine. Suppose that the portion of YYTABLE starts at index P and the index to be examined within the portion is I. Then if YYCHECK[P+I] != I, I is outside the bounds of what is actually allocated, and the default (from YYDEFACT or YYDEFGOTO) should be used. Otherwise, YYTABLE[P+I] should be used.

如果现在在状态0,展望符为\n(即3号symbol),那么现在的yypact[0]为-3,此时我们要通过yycheck来选择是使用yytable还是yydefact。我们查看yycheck[0]的值为3.

parse此时会判断yycheck[0]是否为所有有效token的symbol number, 如果不是,则进行yydefault的归约。因为要保证在yytable里面索引的值是有效的symbol序号才有意义。

/* If the proper action on seeing token YYTOKEN is to reduce or to
     detect an error, take that action.  */
  yyn += yytoken;
  if (yyn < 0 || YYLAST < yyn || yycheck[yyn] != yytoken)
    goto yydefault;

此处yycheck[0]为3,而\n正好是3号symbol,所以我们可以从yytable中去索引接下来的状态。

yydefgoto or yypgoto

除此以外,yycheck还帮助判断选择yydefgoto亦或是yypgoto.

  /* Now 'shift' the result of the reduction.  Determine what state
     that goes to, based on the state we popped back to and the rule
     number reduced by.  */

  yyn = yyr1[yyn];

  yystate = yypgoto[yyn - YYNTOKENS] + *yyssp;
  if (0 <= yystate && yystate <= YYLAST && yycheck[yystate] == *yyssp)
    yystate = yytable[yystate];
  else
    yystate = yydefgoto[yyn - YYNTOKENS];

比如现在在rule2 (S -> '\n')归约后,栈顶的状态为4,parser会将yypgoto[S]的值(此处yypgoto[1]=-4)加上原先的状态值作为现在的状态索引4-4=0,查询yycheck,如果为4,则表示状态4为特殊状态,选择yytable; 否则就用yydefgoto表来决定跳转。

其他表

yyrhs - A -1 separated list of RHS symbol numbers of all rules. yyrhs[n] = first symbol on the RHS of rule #n.

yyprhs[n] - Index in yyrhs of the first RHS symbol of rule n.

yyrline[n] - Line # in the .y grammar source file where rule n is defined.

yytname[n] - A string specifying the symbol for symbol number n. yytoknum[n] - Token number of token n.

小结

  • yytranslate: maps token numbers returned by yylex() to Bison's internal symbol
  • yydefact: lists default reduction rules for each state. 0 represents an error.
  • yydefgoto: lists default GOTOs for each non-terminal symbol. It is only used after checking with yypgoto.
  • yyr1: symbol number of lhs of each rule. Used at the time of a reduction to find the next state.
  • yyr2: length of rhs of each rule. Used at the time of reduction to pop the stack.
  • yytable: a highly compressed representation of the actions in each state. Negative entries represent reductions. There is a negative infinity to detect errors.
  • yypgoto: accounts for non-default GOTOs for all non-terminal symbols.
  • yypact: directory into yytable indexed by state number. The displacements in yytable are indexed by symbol number.
  • yycheck: guard table used to check legal bounds within portions of yytable.

bison parse

当bison读入一个终结符token,它会将该终结符及其语意值一起压入分析器栈parser stack。把一个token压入栈就是移入shifting操作。由于采用LALR,故parser会分析展望符,当一个终结符被读进来后,并不会马上移入parser stack,根据对应的展望符来确定接下来的移入归约/状态

yyparse主体框架(详见注释)

/* look-ahead symbol */
int yychar;

/* The semantic value of the look-ahead symbol.  */
YYSTYPE yylval;

int 
yyparse()
{
	int yystate; 	/* current state */
	int yyn;	/* This is an all purpose variable! One time it may represent a state, the next time, it might be a rule! */
	
	int yyresult;	/* Result of parse to be returned to the called */
	
	int yytoken=0;	/* current token */
	
	/* The state stack: This parser does not shift symbols on to the stack. Only a stack of states is maintained. */	
	
	 int yyssa[YYINITDEPTH];	/*YYINITDEPTH is 200 */
	 int *yyss = yyssa	/* Bottom of state stack */
	 int *yyssp;	/* Top of state stack */
	 
	 /* The semantic value stack: This stack grows parallel to the state stack. At each reduction, semantic values are popped 
	  * off this stack and the semantic action is executed
	  */
	  YYSTYPE yyvsa[YYINITDEPTH];
	  YYSTYPE *yyvs = yyvsa;	/* Bottom of semantic stack */
	  YYSTYPE *yyvsp;	/* Top of semantic stack */
	
	  /* POP the state and semantic stacks by N symbols - useful for reduce actions */
	  #define YYPOPSTACK(N)   (yyvsp -= (N), yyssp -= (N))
	  
	  
	  int yylen = 0;	/* This variable is used in reduce actions to store the length of RHS of a rule; */
	  
	  /* Ok done declaring variables. Set the ball rolling */
	  
	  yystate = 0;	/* Initial state */
	  yychar = YYEMPTY /* YYEMPTY is -2 */
	  
	  yyssp = yyss; 	/* Top = bottom for state stack */
	  yyvsp = yyvs;	/* Same for semantic stack */
	  
	  goto yysetstate; 	/* Well, gotos are used for extracting maximum performance. */
	  
	  
	  /* Each label can be thought of as a function */
	  
	  yynewstate:  /* Push a new state on the stack */
	  
	  		yyssp ++;	/*Just increment the stack top; actual 'pushing' will happen in yysetstate */
	  		
	  		
	  yysetstate:
	  		
	  		*yyssp = yystate;	/* Ok pushed state on state stack top */
	  		
	  		goto yybackup;	/* This is where you will find some action */
	  		
	  		
	  yybackup:	/* The main parsing code starts here */
	  	
	  		yyn = yypact[yystate];	/* Refer to what yypact is saying about the current state */
	  		
	  		if ( yyn == YYPACT_NINF) 	/* If negative infinity its time for a default reduction */
	  		
	  			goto yydefault;	/* This label implements default reductions; see below */
	  			
	  		
	  		/* Check if we have a look-ahead token ready. This is LALR(1) parsing */
	  		
	  		if (yychar == YYEMPTY)
	  			
	  			yychar = YYLEX; /* Macro YYLEX is defined as yylex() */
	  				
	  		if (yychar <= YYEOF) 	/* YYEOF is 0 - the token returned by lexer at end of input */
	  			
	  			yychar = yytoken = YYEOF; /* set all to EOF */
	  				
	  		else
	  		
	  			yytoken = yytranslate[yychar];	/* Translate the lexer token into internal symbol number */
	  				
	  		
	  		/* Now we have a look-ahead token. Let the party begin ! */
	  		
	  		
	  		yyn = yyn + yytoken;	/* This is yypact[yystate] + yytoken */
	  		
	  		
	  		/* Observe this check carefully. We are checking that yyn is within the bounds of yytable
	  		 * and also if yycheck contains the current token number.
	  		 */
	  		if ( yyn < 0 || YYLAST < yyn  || yycheck[yyn] != yytoken )	/* YYLAST is the highest index in yytable */
	  			  			 
	  			goto yydefault; 	/* Its time for a default reduction */
	  				
	  		/* Ok, yyn is within bounds of yytable */
	  		
	  		yyn = yytable[yyn];	/* This is yytable[ yypact[yystate] + yytoken ] */
	  		
	  		if (yyn <= 0)	/* If yytable happens to contain a -ve value, its not a shift - its a reduce */
	  		{
	  			if (yyn == 0 || yyn == YYTABLE_NINF)	/* But check for out of bounds condition*/
	  				goto yyerrlab;	/* Label to handle errors */
	  		
	  			/* Other wise reduce with rule # -yyn */
	  			
	  			yyn = -yyn;
	  			goto yyreduce; 	/* Label to implement reductions */
	  		}
	  		
	  		/* Last check: See if we reached final state! */
	  		if (yyn == YYFINAL)	/* YYFINAL is 8 in our case */
	  			YYACCEPT;	/* macro deined as 'goto acceptlab - a label to finish up */
	  			
	  		/* That completes all checks; If we reached here, there is no other option but to shift */
	  		
	  		yystate = yyn;	/* Now, yyn (= yytable[ yypact[yystate] + yytoken ]) is a state that has to be pushed */
	  		
	  		*++yyvsp = yylval; /* Push the semantic value of the symbol on the semantic stack */
	  		
	  		goto yynewstate;	/* This will increment state stack top and the following yysetstate that will do the pushing */
	  		
	  
	  yydefault:	/* A label to implement default reductions */
	  	
	  		yyn = yydefact[yystate];	/* Get the default reduction rule for this state */
	  		
	  		if ( yyn == 0 )	/* This state has no default reduction. Something is wrong */
	  		
	  			goto yyerrlab;
	  			
	  		goto yyreduce;	/* Ok, got the default reduction rule # in yyn; go ahead and reduce the stack */
	  		
	  	
	  
	  yyreduce:	/* A lablel that implements reductions on stack. */
	  	
	  		/* By the time we are here, yyn contains the rule# to use for reducing the stack. */
	  		
	  		/* Steps for reduction:
	  		 * 1. Find the length of RHS of rule #yyn
	  		 * 2. Execute any semantic actions by taking the values from the semantic stack
	  		 * 3. POP 'length' symbols from the state stack and 'length' values from semantic stack
	  		 * 4. Find the LHS of rule #yyn
	  		 * 5. Find the GOTO of state currently on top of stack on LHS symbol
	  		 * 6. Push that state on top of stack
	  		 * 
	  		 */
	  		 
	  		 yylen = yyr2[yyn];	/* Get length of RHS */
	  		 
	  		 /* Default semantic action - $$=$1 */
	  		 yyval = yyvsp[1-yylen];
	  		 
	  		 /* Execute semantic actions */
	  		 switch ( yyn )	/* Each rule has its own semantic action */
	  		 {
	  		 
	  		 	default:	break;	/* We didn't have any semantic actions in the grammar.*/
	  		 	
	  		 }
	  		 
	  		 YYPOPSTACK (yylen);	/* This will pop both state and semantic stacks. See definition of this macro above */
	  		 
	  		 yylen = 0;	/* re-initialize yylen */
	  		 
	  		 *++yyvsp  = yyval;	/* Push the result of semantic evaluation on top of semantic stack */
	  		 
	  		 
	  		 /* Now shift the result of reduction (steps 4 - 6) */
	  		 
	  		 yyn = yyr1[yyn];	/* Reuse yyn at every opportunity.  For now, yyn is the LHS symbol (number) of the rule */
	  		 
			 /* First check for anomalous GOTOs, otherwise use Default GOTO (YYDEFGOTO)
			  * 
			  * Observe that if we subtract no. of terminals (YYNTOKENS) from symbol number of a nonterminal, we get
			  * an index into yypgoto or yydefgoto for that non-terminal.
			  */
			  
	  		 yystate = yypgoto[yyn - YYNTOKENS] + *yyssp;
	  		 
	  		 /* A couple of checks are needed before we know this is not a default GOTO
	  		  * 1. yystate must be within bounds of yytable. ( 0 to YYLAST )
	  		  * 2. yycheck must contain the state currently on top of the stack
	  		  */	  		 
	  		 if ( 0 <= yystate && yystate <= YYLAST && yycheck[yystate] = *yyssp)
	  		 
	  		 	yystate = yytable[yystate];	/* Take the GOTO from yytable */
	  		 	
	  		 else
	  		 	
	  		 	yystate = yydefgoto[yyn - YYNTOKENS];	/* Otherwise use the default GOTO */
	  		 	
	  		 goto yynewstate;	/* Simply push the newly found state on top of stack and continue */
	  		 
}	/* End of yyparse() */
	  		 

总结

bison将上下文无关文法转换为LALR(1)。 可以看到,bison在压缩了table之后的查找索引方面有很多机制,在理顺了这些表之后,整个自动机的雏形也就差不多了。

murmur:感觉此处的表压缩效率还是不太高,有很多表例如yytranslate等等还是有很多重复的字符。这个地方或许可以用类似哈夫曼树的方式来继续压缩,然后在parse前做一段解压的过程,算是时间换空间,这样不会影响parse时的效率(既然追求空间,那就做到极致咯)bison生成的parser代码缩进还是比较舒服的。

另:

  • 使用--report=all选项编译可以输出一个LALR的状态信息。
  • 设置yydebug=1/--debug可以看到parse时移入归约操作的输出,比较直观。

reference