Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence is not sorted #21

Closed
abdullah-saal opened this issue Dec 7, 2021 · 9 comments
Closed

Sequence is not sorted #21

abdullah-saal opened this issue Dec 7, 2021 · 9 comments

Comments

@abdullah-saal
Copy link

Got the following files:

ls trie_data/
1-grams.sorted.gz  2-grams.sorted.gz  3-grams.sorted.gz

I am trying the command:

./build_trie  ef_trie 3 count --dir ./trie_data/ --out ef_trie.count.bin

But getting the error

error at position 23/186616
360087 < 400844
terminate called after throwing an instance of 'std::runtime_error'
  what():  sequence is not sorted

I did use sort_grams command, on the Ngrams files. but still getting the error.

@abdullah-saal
Copy link
Author

Full Log

2021-12-07 11:03:45: Reading 1-grams counts
2021-12-07 11:03:45: Reading 2-grams counts
2021-12-07 11:03:45: Reading 3-grams counts
2021-12-07 11:03:46: Building vocabulary
2021-12-07 11:03:46: Hypergraph generation: trial 0
2021-12-07 11:03:46: Using 17 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 76.6394% nodes remaining
2021-12-07 11:03:46: Round 1, 66.0535% nodes remaining
2021-12-07 11:03:46: Round 2, 60.309% nodes remaining
2021-12-07 11:03:46: Round 3, 56.5286% nodes remaining
2021-12-07 11:03:46: Round 4, 53.8259% nodes remaining
2021-12-07 11:03:46: Round 5, 51.7457% nodes remaining
2021-12-07 11:03:46: Round 6, 50.1449% nodes remaining
2021-12-07 11:03:46: Round 7, 48.7746% nodes remaining
2021-12-07 11:03:46: Round 8, 47.6151% nodes remaining
2021-12-07 11:03:46: Round 9, 46.6423% nodes remaining
2021-12-07 11:03:46: Round 10, 45.8192% nodes remaining
2021-12-07 11:03:46: Round 11, 45.1056% nodes remaining
2021-12-07 11:03:46: Round 12, 44.4748% nodes remaining
2021-12-07 11:03:46: Round 13, 43.9136% nodes remaining
2021-12-07 11:03:46: Round 14, 43.4082% nodes remaining
2021-12-07 11:03:46: Round 15, 42.9121% nodes remaining
2021-12-07 11:03:46: Round 16, 42.4299% nodes remaining
2021-12-07 11:03:46: Round 17, 41.9951% nodes remaining
2021-12-07 11:03:46: Round 18, 41.6049% nodes remaining
2021-12-07 11:03:46: Round 19, 41.2314% nodes remaining
2021-12-07 11:03:46: Round 20, 40.8858% nodes remaining
2021-12-07 11:03:46: Round 21, 40.5578% nodes remaining
2021-12-07 11:03:46: Round 22, 40.2466% nodes remaining
2021-12-07 11:03:46: Round 23, 39.9325% nodes remaining
2021-12-07 11:03:46: Round 24, 39.6297% nodes remaining
2021-12-07 11:03:46: Round 25, 39.3389% nodes remaining
2021-12-07 11:03:46: Round 26, 39.0379% nodes remaining
2021-12-07 11:03:46: Round 27, 38.7396% nodes remaining
2021-12-07 11:03:46: Round 28, 38.447% nodes remaining
2021-12-07 11:03:46: Round 29, 38.1552% nodes remaining
2021-12-07 11:03:46: Round 30, 37.8579% nodes remaining
2021-12-07 11:03:46: Round 31, 37.5699% nodes remaining
2021-12-07 11:03:46: Round 32, 37.2652% nodes remaining
2021-12-07 11:03:46: Round 33, 36.9465% nodes remaining
2021-12-07 11:03:46: Round 34, 36.6269% nodes remaining
2021-12-07 11:03:46: Round 35, 36.2692% nodes remaining
2021-12-07 11:03:46: Round 36, 35.8901% nodes remaining
2021-12-07 11:03:46: Round 37, 35.4916% nodes remaining
2021-12-07 11:03:46: Round 38, 35.1014% nodes remaining
2021-12-07 11:03:46: Round 39, 34.6758% nodes remaining
2021-12-07 11:03:46: Round 40, 34.2169% nodes remaining
2021-12-07 11:03:46: Round 41, 33.7143% nodes remaining
2021-12-07 11:03:46: Round 42, 33.1847% nodes remaining
2021-12-07 11:03:46: Round 43, 32.5929% nodes remaining
2021-12-07 11:03:46: Round 44, 31.9314% nodes remaining
2021-12-07 11:03:46: Round 45, 31.2039% nodes remaining
2021-12-07 11:03:46: Round 46, 30.3807% nodes remaining
2021-12-07 11:03:46: Round 47, 29.4368% nodes remaining
2021-12-07 11:03:46: Round 48, 28.4278% nodes remaining
2021-12-07 11:03:46: Round 49, 27.2813% nodes remaining
2021-12-07 11:03:46: Round 50, 25.924% nodes remaining
2021-12-07 11:03:46: Round 51, 24.3325% nodes remaining
2021-12-07 11:03:46: Round 52, 22.4594% nodes remaining
2021-12-07 11:03:46: Round 53, 20.2715% nodes remaining
2021-12-07 11:03:46: Round 54, 17.6924% nodes remaining
2021-12-07 11:03:46: Round 55, 14.6664% nodes remaining
2021-12-07 11:03:46: Round 56, 11.1303% nodes remaining
2021-12-07 11:03:46: Round 57, 7.34526% nodes remaining
2021-12-07 11:03:46: Round 58, 3.72094% nodes remaining
2021-12-07 11:03:46: Round 59, 1.13161% nodes remaining
2021-12-07 11:03:46: Round 60, 0.142148% nodes remaining
2021-12-07 11:03:46: Round 61, 0.000929074% nodes remaining
2021-12-07 11:03:46: Assigning values
2021-12-07 11:03:46: Building 2-grams
2021-12-07 11:03:46: Writing 2-grams
2021-12-07 11:03:46: Hypergraph generation: trial 0
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 79.7639% nodes remaining
2021-12-07 11:03:46: Round 1, 70.1518% nodes remaining
2021-12-07 11:03:46: Round 2, 64.9241% nodes remaining
2021-12-07 11:03:46: Round 3, 62.7319% nodes remaining
2021-12-07 11:03:46: Round 4, 61.2142% nodes remaining
2021-12-07 11:03:46: Round 5, 60.371% nodes remaining
2021-12-07 11:03:46: Round 6, 59.5278% nodes remaining
2021-12-07 11:03:46: Round 7, 59.0219% nodes remaining
2021-12-07 11:03:46: Round 8, 58.516% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 1
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.1667% nodes remaining
2021-12-07 11:03:46: Round 1, 65% nodes remaining
2021-12-07 11:03:46: Round 2, 59.1667% nodes remaining
2021-12-07 11:03:46: Round 3, 55.5% nodes remaining
2021-12-07 11:03:46: Round 4, 52.3333% nodes remaining
2021-12-07 11:03:46: Round 5, 49.3333% nodes remaining
2021-12-07 11:03:46: Round 6, 47.1667% nodes remaining
2021-12-07 11:03:46: Round 7, 45.1667% nodes remaining
2021-12-07 11:03:46: Round 8, 44.3333% nodes remaining
2021-12-07 11:03:46: Round 9, 43.3333% nodes remaining
2021-12-07 11:03:46: Round 10, 42.3333% nodes remaining
2021-12-07 11:03:46: Round 11, 41.3333% nodes remaining
2021-12-07 11:03:46: Round 12, 40.3333% nodes remaining
2021-12-07 11:03:46: Round 13, 39.5% nodes remaining
2021-12-07 11:03:46: Round 14, 38.5% nodes remaining
2021-12-07 11:03:46: Round 15, 38% nodes remaining
2021-12-07 11:03:46: Round 16, 37.6667% nodes remaining
2021-12-07 11:03:46: Round 17, 37.3333% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 2
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.1252% nodes remaining
2021-12-07 11:03:46: Round 1, 63.773% nodes remaining
2021-12-07 11:03:46: Round 2, 57.2621% nodes remaining
2021-12-07 11:03:46: Round 3, 53.5893% nodes remaining
2021-12-07 11:03:46: Round 4, 51.586% nodes remaining
2021-12-07 11:03:46: Round 5, 48.581% nodes remaining
2021-12-07 11:03:46: Round 6, 47.0785% nodes remaining
2021-12-07 11:03:46: Round 7, 45.9098% nodes remaining
2021-12-07 11:03:46: Round 8, 44.2404% nodes remaining
2021-12-07 11:03:46: Round 9, 42.7379% nodes remaining
2021-12-07 11:03:46: Round 10, 41.7362% nodes remaining
2021-12-07 11:03:46: Round 11, 40.4007% nodes remaining
2021-12-07 11:03:46: Round 12, 38.5643% nodes remaining
2021-12-07 11:03:46: Round 13, 36.7279% nodes remaining
2021-12-07 11:03:46: Round 14, 35.7262% nodes remaining
2021-12-07 11:03:46: Round 15, 34.8915% nodes remaining
2021-12-07 11:03:46: Round 16, 34.0568% nodes remaining
2021-12-07 11:03:46: Round 17, 33.389% nodes remaining
2021-12-07 11:03:46: Round 18, 32.8881% nodes remaining
2021-12-07 11:03:46: Round 19, 32.7212% nodes remaining
2021-12-07 11:03:46: Round 20, 32.5543% nodes remaining
2021-12-07 11:03:46: Round 21, 32.2204% nodes remaining
2021-12-07 11:03:46: Round 22, 32.0534% nodes remaining
2021-12-07 11:03:46: Round 23, 31.8865% nodes remaining
2021-12-07 11:03:46: Round 24, 31.7195% nodes remaining
2021-12-07 11:03:46: Round 25, 31.3856% nodes remaining
2021-12-07 11:03:46: Round 26, 31.2187% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 3
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 75.2941% nodes remaining
2021-12-07 11:03:46: Round 1, 63.1933% nodes remaining
2021-12-07 11:03:46: Round 2, 58.1513% nodes remaining
2021-12-07 11:03:46: Round 3, 54.6218% nodes remaining
2021-12-07 11:03:46: Round 4, 51.5966% nodes remaining
2021-12-07 11:03:46: Round 5, 49.4118% nodes remaining
2021-12-07 11:03:46: Round 6, 48.2353% nodes remaining
2021-12-07 11:03:46: Round 7, 47.7311% nodes remaining
2021-12-07 11:03:46: Round 8, 47.2269% nodes remaining
2021-12-07 11:03:46: Round 9, 46.7227% nodes remaining
2021-12-07 11:03:46: Round 10, 46.2185% nodes remaining
2021-12-07 11:03:46: Round 11, 45.8824% nodes remaining
2021-12-07 11:03:46: Round 12, 45.5462% nodes remaining
2021-12-07 11:03:46: Round 13, 45.042% nodes remaining
2021-12-07 11:03:46: Round 14, 44.8739% nodes remaining
2021-12-07 11:03:46: Round 15, 44.7059% nodes remaining
2021-12-07 11:03:46: Round 16, 44.5378% nodes remaining
2021-12-07 11:03:46: Round 17, 44.3697% nodes remaining
2021-12-07 11:03:46: Hypergraph is not peelable
2021-12-07 11:03:46: Hypergraph generation: trial 4
2021-12-07 11:03:46: Using 10 bits per node
2021-12-07 11:03:46: Generating hyperedges
2021-12-07 11:03:46: Sorting 0-orientation edges
2021-12-07 11:03:46: Populating 0-orientation lists
2021-12-07 11:03:46: Sorting 1-orientation edges
2021-12-07 11:03:46: Populating 1-orientation lists
2021-12-07 11:03:46: Sorting 2-orientation edges
2021-12-07 11:03:46: Populating 2-orientation lists
2021-12-07 11:03:46: Round 0, 77.1382% nodes remaining
2021-12-07 11:03:46: Round 1, 65.9539% nodes remaining
2021-12-07 11:03:46: Round 2, 59.2105% nodes remaining
2021-12-07 11:03:46: Round 3, 55.2632% nodes remaining
2021-12-07 11:03:46: Round 4, 52.4671% nodes remaining
2021-12-07 11:03:46: Round 5, 50.1645% nodes remaining
2021-12-07 11:03:46: Round 6, 48.5197% nodes remaining
2021-12-07 11:03:46: Round 7, 47.0395% nodes remaining
2021-12-07 11:03:46: Round 8, 45.7237% nodes remaining
2021-12-07 11:03:46: Round 9, 44.2434% nodes remaining
2021-12-07 11:03:46: Round 10, 43.4211% nodes remaining
2021-12-07 11:03:46: Round 11, 42.4342% nodes remaining
2021-12-07 11:03:46: Round 12, 41.1184% nodes remaining
2021-12-07 11:03:46: Round 13, 39.9671% nodes remaining
2021-12-07 11:03:46: Round 14, 38.6513% nodes remaining
2021-12-07 11:03:46: Round 15, 37.0066% nodes remaining
2021-12-07 11:03:46: Round 16, 35.3618% nodes remaining
2021-12-07 11:03:46: Round 17, 32.8947% nodes remaining
2021-12-07 11:03:46: Round 18, 30.0987% nodes remaining
2021-12-07 11:03:46: Round 19, 25.8224% nodes remaining
2021-12-07 11:03:46: Round 20, 21.2171% nodes remaining
2021-12-07 11:03:46: Round 21, 16.1184% nodes remaining
2021-12-07 11:03:46: Round 22, 10.5263% nodes remaining
2021-12-07 11:03:46: Round 23, 4.76974% nodes remaining
2021-12-07 11:03:46: Round 24, 0.493421% nodes remaining
2021-12-07 11:03:46: Round 25, 0% nodes remaining
2021-12-07 11:03:46: Assigning values
error at position 3/641633
27133 < 131071
terminate called after throwing an instance of 'std::runtime_error'
  what():  sequence is not sorted

@jermp
Copy link
Owner

jermp commented Dec 7, 2021

Hi,
this is strange. It looks like the files are malformed.
Could you share your input files, so that I can take a closer look to the problem?
Thanks!

@abdullah-saal
Copy link
Author

@jermp
Copy link
Owner

jermp commented Dec 7, 2021

I gave a quick look at your files and they are, as I suspected before, malformed:
on each line, there is the count followed by the ngram string but it should be opposite,
as explained here https://github.com/jermp/tongrams#input-data-format.
You should format them as follows:

# of rows
<ngram> TAB <count>
...

And be sure they are sorted with the sort_grams utility as you did before.
Let me know if it works after reformatting.

@abdullah-saal
Copy link
Author

Actually, it's just the rendering. this is a common problem with RTL languages.
If you try to parse it. you can confirm that the first field is indeed the ngram

tail 2-grams.sorted | cut -f 1
ييفارت توفماسيان
ييفانغ ونادي
ييكسيان من
ييلان في
ييمسومارواي مواليد
يينا ونادي
يينال في
يينال من
يينان دياوو
يييانغ في
 tail 2-grams.sorted | cut -f 2
1
1
1
1
1
1
1
1
3
1

@jermp
Copy link
Owner

jermp commented Dec 7, 2021

Ah ok, I did not know this. Sorry.
I will investigate further then.

@jermp
Copy link
Owner

jermp commented Dec 7, 2021

It fails on the third bigrams because that third bigram is آباء متعددي but there is no متعددي among the uni-grams (vocabulary).
The trie topology should be complete, that is: if a bigram X Y appears, then the unigrams must contain both X and Y.

@jermp
Copy link
Owner

jermp commented Dec 7, 2021

In fact, if I remove that bigrams and retain these bigrams:

6
آباء الأطفال	1
آباء الكنيسة	4
آباء المجلس	3
آباء المجمع	1
آباء بالتبني	1
آباء سلالات	1

i.e., all prefixed by آباء, then it builds correctly a 2-gram model.

@abdullah-saal
Copy link
Author

Thanks, just noticed the issue.
WIth similar problems I used to get an error that looks like this:

2-grams file is incomplete:
        'شتلاند' should have been found among 1-grams

That's why I didn't notice this time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants