Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayoutReader.read handling levels incorrectly #86

Open
solvingproblemswithtechnology opened this issue May 23, 2024 · 0 comments
Open

Comments

@solvingproblemswithtechnology
Copy link

solvingproblemswithtechnology commented May 23, 2024

I want to use nlm-ingestor from C#, so I was layout_reader.py to create the code. And I found a really bad bug regarding level handling.

It will hard to explain for me, but I'll do my best and provide the test data I'm using.

Given the following structure:

heading 1
    heading 2
    heading 2
        para 3
    heading 2
para 1

When the reader goes to read the last para, in the if block['tag'] == 'para': it doesn't take into account the level change to change its parent.

Since my knowledge of python is limited, I'll paste some visual debugging information to make it easier to understand:

In this image, you can see a Level 0 header with another Level 0 header as children:
imagen_2024-05-23_180401090

And in this image, you can see a Level 0 paragraph inside a Level 2 header:
imagen_2024-05-23_180531229

This is the translated code with the bug:

public Block Read(List<Block> blocksJson)
{
	var root = new Block();
	Block parent = root;
	var parentStack = new Stack<Block>();
	parentStack.Push(root);
	var prevNode = root;
	var listStack = new Stack<Block>();

	foreach (var block in blocksJson)
	{
		if (block.Tag.ToString() != "list_item" && listStack.Count > 0)
			listStack.Clear();

		Block node = null;

		switch (block.Tag)
		{
			case "para":
				node = new Paragraph(block);
				parent.AddChild(node);
				break;
			case "table":
				node = new Table(block, prevNode);
				parent.AddChild(node);
				break;
			case "list_item":
				node = new ListItem(block);

				if (prevNode.Tag == "para" && prevNode.Level == node.Level)
					listStack.Push(prevNode);
				else if (prevNode.Tag == "list_item")
				{
					if (node.Level > prevNode.Level)
						listStack.Push(prevNode);
					else
					{
						while (listStack.Count > 0 && listStack.Peek().Level > node.Level)
							listStack.Pop();
					}
				}

				if (listStack.Count > 0)
					listStack.Peek().AddChild(node);
				else
					parent.AddChild(node);

				break;
			case "header":
				node = new Section(block);

				if (node.Level > parent.Level)
				{
					parentStack.Push(node);
					parent.AddChild(node);
				}
				else
				{
					while (parentStack.Count > 1 && parentStack.Peek().Level > node.Level)
						parentStack.Pop();

					parentStack.Peek().AddChild(node);
					parentStack.Push(node);
				}

				parent = node;
				break;
		}

		prevNode = node;
	}

	return root;
}

Here's the fixed method, handling the level change within the rest of the cases. I know it's not pretty, I didn't refactor it yet:

public Block Read(List<Block> blocksJson)
{
	var root = new Block();
	Block parent = root;
	var parentStack = new Stack<Block>();
	parentStack.Push(root);
	var prevNode = root;
	var listStack = new Stack<Block>();

	foreach (var block in blocksJson)
	{
		if (block.Tag != "list_item" && listStack.Count > 0)
			listStack.Clear();

		Block node = null;

		switch (block.Tag)
		{
			case "header":
				node = new Section(block);

				if (node.Level > parent.Level)
				{
					parentStack.Push(node);
					parent.AddChild(node);
				}
				else
				{
					while (parentStack.Count > 1 && parent.Level >= node.Level)
					{
						parentStack.Pop();
						parent = parentStack.Peek();
					}

					parent.AddChild(node);
					parentStack.Push(node);
				}

				parent = node;
				break;
			case "para":
				node = new Paragraph(block);

			        if (node.Level < parent.Level)
			        {
				        while (parentStack.Count > 1 && parentStack.Peek().Level > node.Level)
					        parentStack.Pop();
        
				        parent = parentStack.Peek();
			        }

				parent.AddChild(node);
				break;
			case "table":
				node = new Table(block, prevNode);

			        if (node.Level < parent.Level)
			        {
				        while (parentStack.Count > 1 && parentStack.Peek().Level > node.Level)
					        parentStack.Pop();
        
				        parent = parentStack.Peek();
			        }

				parent.AddChild(node);
				break;
			case "list_item":
				node = new ListItem(block);

				if (prevNode.Tag == "para" && prevNode.Level == node.Level)
					listStack.Push(prevNode);
				else if (prevNode.Tag == "list_item")
				{
					if (node.Level > prevNode.Level)
						listStack.Push(prevNode);
					else
					{
						while (listStack.Count > 0 && listStack.Peek().Level > node.Level)
							listStack.Pop();
					}
				}

				if (listStack.Count > 0)
					listStack.Peek().AddChild(node);
				else
				{
					if (node.Level < parent.Level)
					{
						while (parentStack.Count > 1 && parentStack.Peek().Level > node.Level)
							parentStack.Pop();

						parent = parentStack.Peek();
					}

					parent.AddChild(node);
				}

				break;
		}

		prevNode = node;
	}

	return root;
}

Here's the first case fixed. You can notice that there's no Level 0 header and that there's more Level 1 items in there:
imagen_2024-05-23_180850023

The second case, now there's no childrens as it should:
imagen_2024-05-23_180931495

Here's the used json for testing:
https://gist.github.com/solvingproblemswithtechnology/a1d79d1892284375855a7205d5eb4ea5

P.S.: Thanks for your understanding, currently I'm not knowledgable enough on python to fix this issue with confidence, but I tried to add all the context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant