Problematic tokens from tokenizer? #6

jakubzitny · 2016-06-07T10:04:57Z

Some tokens from some of the tokenized files seem problematic for SourcererCC.

Here is an example of stderr when the indexing fails, the contents is e.g. weird whitespaces or chars like ||:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at noindex.CloneHelper.deserialise(Unknown Source)
    at indexbased.SearchManager.doIndex(Unknown Source)
    at indexbased.SearchManager.main(Unknown Source)

While indexing, there is a lot of EXCEPTION CAUGHT messages coming from caught ArrayIndexOutOfBoundsExceptions in CloneHelper.java. I'm not sure if it's a problem or not.

Also, while searching I am getting a lot of ERROR: more that one doc found. some error here. messages.

Maybe these problems are some small things in the tokenization process, you have any ideas what it might be? For now, I will update the handling of weird whitespaces and see how it helps.

The text was updated successfully, but these errors were encountered:

saini · 2016-06-09T22:30:09Z

ERROR: more than one doc found is a problem. This must be happening when SourcererCC searches for the tokens of a document (using document id as the query) in the forward index. Ideally we should never get more than one doc, as document id for each document should be unique. My guess is that this is happening because we might have assigned same id for more than one document in the parsing stage, or may be we indexed one document twice.

About the ArrayIndexOutOfBoundsException, let's keep a track of these characters. We should remove them during the tokenizing stage.

Yanming-Yang · 2017-10-17T04:10:09Z

I meet the similar problem with you. My error message is
EXCEPTION CAUGHT, invalid line: Bud% @� @ @ @ E%DSDB @ @ @
index size of GTPM: 66012
Directory: dataset
indexing file : .DS_Store
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at noindex.CloneHelper.deserialise(Unknown Source)
at indexbased.SearchManager.doIndex(Unknown Source)
at indexbased.SearchManager.main(Unknown Source)`
Every step of the operation is performed in accordance with the steps in readme. What does it mean? My data is wrong?

saini · 2017-10-17T06:20:33Z

please delete the .DS_Store file from the dataset directory.

pedromartins4 · 2017-11-28T20:27:28Z

I'm hoping this solved the problem. If not, please open another issue.

pedromartins4 closed this as completed Nov 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problematic tokens from tokenizer? #6

Problematic tokens from tokenizer? #6

jakubzitny commented Jun 7, 2016

saini commented Jun 9, 2016

Yanming-Yang commented Oct 17, 2017

saini commented Oct 17, 2017

pedromartins4 commented Nov 28, 2017

Problematic tokens from tokenizer? #6

Problematic tokens from tokenizer? #6

Comments

jakubzitny commented Jun 7, 2016

saini commented Jun 9, 2016

Yanming-Yang commented Oct 17, 2017

saini commented Oct 17, 2017

pedromartins4 commented Nov 28, 2017