Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problematic tokens from tokenizer? #6

Closed
jakubzitny opened this issue Jun 7, 2016 · 4 comments
Closed

Problematic tokens from tokenizer? #6

jakubzitny opened this issue Jun 7, 2016 · 4 comments

Comments

@jakubzitny
Copy link

Some tokens from some of the tokenized files seem problematic for SourcererCC.

Here is an example of stderr when the indexing fails, the contents is e.g. weird whitespaces or chars like ||:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at noindex.CloneHelper.deserialise(Unknown Source)
    at indexbased.SearchManager.doIndex(Unknown Source)
    at indexbased.SearchManager.main(Unknown Source)

While indexing, there is a lot of EXCEPTION CAUGHT messages coming from caught ArrayIndexOutOfBoundsExceptions in CloneHelper.java. I'm not sure if it's a problem or not.

Also, while searching I am getting a lot of ERROR: more that one doc found. some error here. messages.

Maybe these problems are some small things in the tokenization process, you have any ideas what it might be? For now, I will update the handling of weird whitespaces and see how it helps.

@saini
Copy link
Contributor

saini commented Jun 9, 2016

ERROR: more than one doc found is a problem. This must be happening when SourcererCC searches for the tokens of a document (using document id as the query) in the forward index. Ideally we should never get more than one doc, as document id for each document should be unique. My guess is that this is happening because we might have assigned same id for more than one document in the parsing stage, or may be we indexed one document twice.

About the ArrayIndexOutOfBoundsException, let's keep a track of these characters. We should remove them during the tokenizing stage.

@Yanming-Yang
Copy link

I meet the similar problem with you. My error message is
EXCEPTION CAUGHT, invalid line: Bud% @� @ @ @ E%DSDB @ @ @
index size of GTPM: 66012
Directory: dataset
indexing file : .DS_Store
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at noindex.CloneHelper.deserialise(Unknown Source)
at indexbased.SearchManager.doIndex(Unknown Source)
at indexbased.SearchManager.main(Unknown Source)`
Every step of the operation is performed in accordance with the steps in readme. What does it mean? My data is wrong?

@saini
Copy link
Contributor

saini commented Oct 17, 2017

please delete the .DS_Store file from the dataset directory.

@pedromartins4
Copy link
Contributor

I'm hoping this solved the problem. If not, please open another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants