Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use header_names instead of column_N if available #72

Merged
merged 10 commits into from
Aug 22, 2020

Conversation

kylecbrodie
Copy link
Contributor

Closes #53

This allows header_names to be specified and used as the field names in the Avro schema for files that lack a header row

@kylecbrodie
Copy link
Contributor Author

It looks like Travis had a transient failure. The only tracebacks in the output are,

Exception in thread "unVocity-parsers input reading thread" java.lang.IllegalStateException: Error closing input
	at com.univocity.parsers.common.input.concurrent.ConcurrentCharLoader.stopReading(ConcurrentCharLoader.java:181)
	at com.univocity.parsers.common.input.concurrent.ConcurrentCharLoader.run(ConcurrentCharLoader.java:101)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Filesystem closed
	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:475)
	at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:656)
	at java.io.FilterInputStream.close(FilterInputStream.java:181)
	at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
	at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
	at java.io.InputStreamReader.close(InputStreamReader.java:199)
	at com.univocity.parsers.common.input.concurrent.ConcurrentCharLoader.stopReading(ConcurrentCharLoader.java:178)
	... 2 more

Which I don't think my code change would cause. If you know why it failed that would be very helpful! Maybe re-running Travis will succeed

@coveralls
Copy link

coveralls commented Jul 19, 2020

Coverage Status

Coverage increased (+0.01%) to 96.261% when pulling 8af913a on reveel-it:allow-no-headers into 8e425e7 on mmolimar:develop.

Copy link
Owner

@mmolimar mmolimar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the interest @kylecbrodie!
Couldn't be done this using file_reader.delimited.settings.header and file_reader.delimited.settings.header_names properties?

@kylecbrodie
Copy link
Contributor Author

@mmolimar Setting file_reader.delimited.settings.header to true turns on header extraction so if the file doesn't have a header row it consumes the first row of data as a header row. Custom header names in file_reader.delimited.settings.header_names replace the extracted header row names. Setting file_reader.delimited.settings.header to false turns off header extraction and, before this PR, file_reader.delimited.settings.header_names is unused and column_1, column_2, etc are the header names.

This PR changes the value of the hasHeader variable in the buildSchema method

@mmolimar
Copy link
Owner

Ok, thanks.
I think it'd be better to do it inside the buildSchema method adding an if. Something like:

...
} else if (settings.getHeaders() != null && settings.getHeaders().length > 0) {
  List<Schema> dataTypes = getDataTypes(config, settings.getHeaders());
  IntStream.range(0, settings.getHeaders().length)
    .forEach(index -> builder.field(settings.getHeaders()[index], dataTypes.get(index)));
}
...

Could you also add a test to validate this change?

On the other hand, this change would be included in the next release and the new features/fixes will be in the develop branch. Could you point the PR to the develop branch pls?

@kylecbrodie kylecbrodie changed the base branch from master to develop July 28, 2020 22:54
to match the formatting of the previous version of this method
@mmolimar
Copy link
Owner

The PR does not compile. Method ifPresentOrElse belongs to JDK9 and the current target is JDK8.
Could you change it please?

@kylecbrodie
Copy link
Contributor Author

The PR does not compile. Method ifPresentOrElse belongs to JDK9 and the current target is JDK8.

Could you change it please?

Definitely! I'll change it tomorrow (July 30th)

@kylecbrodie
Copy link
Contributor Author

@mmolimar I added a test case and formatted my change so it is easier to see what has changed. I wasn't able to get JUnit working in VS Code so I'm hoping Travis can run the test case to see if it passes or fails

@mmolimar
Copy link
Owner

mmolimar commented Aug 4, 2020

It looks like the tests don't pass

Copy link
Owner

@mmolimar mmolimar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be simpler something like this:

private Schema buildSchema(Map<String, Object> config) {
        SchemaBuilder builder = SchemaBuilder.struct();
        if (iterator.hasNext() && !settings.isHeaderExtractionEnabled()) {
            String[] headers;
            if (settings.getHeaders() == null || settings.getHeaders().length == 0) {
                Record first = iterator.next();
                headers = new String[first.getValues().length];
                IntStream.range(0, headers.length)
                        .forEach(index -> headers[index] = DEFAULT_COLUMN_NAME + (index + 1));
                seek(0);
            } else {
                headers = settings.getHeaders();
            }
            List<Schema> dataTypes = getDataTypes(config, headers);
            IntStream.range(0, headers.length)
                    .forEach(index -> builder.field(headers[index], dataTypes.get(index)));
        } else if (settings.isHeaderExtractionEnabled()) {
            Optional.ofNullable(iterator.getContext().headers()).ifPresent(headers -> {
                List<Schema> dataTypes = getDataTypes(config, headers);
                IntStream.range(0, headers.length)
                        .forEach(index -> builder.field(headers[index], dataTypes.get(index)));
            });
        }
        return builder.build();
    }

And rearrange so the case of extracted or user provided headers
is handled first and using default headers is handled second

This stems from the potentially incorrect assumption that wanting to use
provided or extracted headers is more common than wanting to use the
default headers
Copy link
Owner

@mmolimar mmolimar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests don't pass.
You can use the snippet I shared with you ;-)

private Schema buildSchema(Map<String, Object> config) {
        SchemaBuilder builder = SchemaBuilder.struct();
        if (iterator.hasNext() && !settings.isHeaderExtractionEnabled()) {
            String[] headers;
            if (settings.getHeaders() == null || settings.getHeaders().length == 0) {
                Record first = iterator.next();
                headers = new String[first.getValues().length];
                IntStream.range(0, headers.length)
                        .forEach(index -> headers[index] = DEFAULT_COLUMN_NAME + (index + 1));
                seek(0);
            } else {
                headers = settings.getHeaders();
            }
            List<Schema> dataTypes = getDataTypes(config, headers);
            IntStream.range(0, headers.length)
                    .forEach(index -> builder.field(headers[index], dataTypes.get(index)));
        } else if (settings.isHeaderExtractionEnabled()) {
            Optional.ofNullable(iterator.getContext().headers()).ifPresent(headers -> {
                List<Schema> dataTypes = getDataTypes(config, headers);
                IntStream.range(0, headers.length)
                        .forEach(index -> builder.field(headers[index], dataTypes.get(index)));
            });
        }
        return builder.build();
}

@kylecbrodie
Copy link
Contributor Author

@mmolimar I applied your suggestion and it is passing tests now!

@mmolimar
Copy link
Owner

Thanks @kylecbrodie!

@mmolimar mmolimar merged commit e141293 into mmolimar:develop Aug 22, 2020
@kylecbrodie kylecbrodie deleted the allow-no-headers branch August 27, 2020 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specifying header names for files with no header row
3 participants