Use header_names instead of column_N if available #72

kylecbrodie · 2020-07-11T01:18:07Z

Closes #53

This allows header_names to be specified and used as the field names in the Avro schema for files that lack a header row

kylecbrodie · 2020-07-11T23:26:11Z

It looks like Travis had a transient failure. The only tracebacks in the output are,

Exception in thread "unVocity-parsers input reading thread" java.lang.IllegalStateException: Error closing input
	at com.univocity.parsers.common.input.concurrent.ConcurrentCharLoader.stopReading(ConcurrentCharLoader.java:181)
	at com.univocity.parsers.common.input.concurrent.ConcurrentCharLoader.run(ConcurrentCharLoader.java:101)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Filesystem closed
	at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:475)
	at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:656)
	at java.io.FilterInputStream.close(FilterInputStream.java:181)
	at sun.nio.cs.StreamDecoder.implClose(StreamDecoder.java:378)
	at sun.nio.cs.StreamDecoder.close(StreamDecoder.java:193)
	at java.io.InputStreamReader.close(InputStreamReader.java:199)
	at com.univocity.parsers.common.input.concurrent.ConcurrentCharLoader.stopReading(ConcurrentCharLoader.java:178)
	... 2 more

Which I don't think my code change would cause. If you know why it failed that would be very helpful! Maybe re-running Travis will succeed

coveralls · 2020-07-19T01:19:13Z

Coverage increased (+0.01%) to 96.261% when pulling 8af913a on reveel-it:allow-no-headers into 8e425e7 on mmolimar:develop.

mmolimar

Thanks for the interest @kylecbrodie!
Couldn't be done this using file_reader.delimited.settings.header and file_reader.delimited.settings.header_names properties?

kylecbrodie · 2020-07-21T18:24:50Z

@mmolimar Setting file_reader.delimited.settings.header to true turns on header extraction so if the file doesn't have a header row it consumes the first row of data as a header row. Custom header names in file_reader.delimited.settings.header_names replace the extracted header row names. Setting file_reader.delimited.settings.header to false turns off header extraction and, before this PR, file_reader.delimited.settings.header_names is unused and column_1, column_2, etc are the header names.

This PR changes the value of the hasHeader variable in the buildSchema method

mmolimar · 2020-07-27T02:04:22Z

Ok, thanks.
I think it'd be better to do it inside the buildSchema method adding an if. Something like:

...
} else if (settings.getHeaders() != null && settings.getHeaders().length > 0) {
  List<Schema> dataTypes = getDataTypes(config, settings.getHeaders());
  IntStream.range(0, settings.getHeaders().length)
    .forEach(index -> builder.field(settings.getHeaders()[index], dataTypes.get(index)));
}
...

Could you also add a test to validate this change?

On the other hand, this change would be included in the next release and the new features/fixes will be in the develop branch. Could you point the PR to the develop branch pls?

and remove hasHeader parameter

to match the formatting of the previous version of this method

mmolimar · 2020-07-29T20:31:48Z

The PR does not compile. Method ifPresentOrElse belongs to JDK9 and the current target is JDK8.
Could you change it please?

kylecbrodie · 2020-07-30T06:54:47Z

The PR does not compile. Method ifPresentOrElse belongs to JDK9 and the current target is JDK8.

Could you change it please?

Definitely! I'll change it tomorrow (July 30th)

since ifPresentOrElse is not available in Java 8

and header extraction off

kylecbrodie · 2020-08-04T18:47:57Z

@mmolimar I added a test case and formatted my change so it is easier to see what has changed. I wasn't able to get JUnit working in VS Code so I'm hoping Travis can run the test case to see if it passes or fails

mmolimar · 2020-08-04T23:18:57Z

It looks like the tests don't pass

mmolimar

It'd be simpler something like this:

private Schema buildSchema(Map<String, Object> config) {
        SchemaBuilder builder = SchemaBuilder.struct();
        if (iterator.hasNext() && !settings.isHeaderExtractionEnabled()) {
            String[] headers;
            if (settings.getHeaders() == null || settings.getHeaders().length == 0) {
                Record first = iterator.next();
                headers = new String[first.getValues().length];
                IntStream.range(0, headers.length)
                        .forEach(index -> headers[index] = DEFAULT_COLUMN_NAME + (index + 1));
                seek(0);
            } else {
                headers = settings.getHeaders();
            }
            List<Schema> dataTypes = getDataTypes(config, headers);
            IntStream.range(0, headers.length)
                    .forEach(index -> builder.field(headers[index], dataTypes.get(index)));
        } else if (settings.isHeaderExtractionEnabled()) {
            Optional.ofNullable(iterator.getContext().headers()).ifPresent(headers -> {
                List<Schema> dataTypes = getDataTypes(config, headers);
                IntStream.range(0, headers.length)
                        .forEach(index -> builder.field(headers[index], dataTypes.get(index)));
            });
        }
        return builder.build();
    }

And rearrange so the case of extracted or user provided headers is handled first and using default headers is handled second This stems from the potentially incorrect assumption that wanting to use provided or extracted headers is more common than wanting to use the default headers

mmolimar

The tests don't pass.
You can use the snippet I shared with you ;-)

private Schema buildSchema(Map<String, Object> config) {
        SchemaBuilder builder = SchemaBuilder.struct();
        if (iterator.hasNext() && !settings.isHeaderExtractionEnabled()) {
            String[] headers;
            if (settings.getHeaders() == null || settings.getHeaders().length == 0) {
                Record first = iterator.next();
                headers = new String[first.getValues().length];
                IntStream.range(0, headers.length)
                        .forEach(index -> headers[index] = DEFAULT_COLUMN_NAME + (index + 1));
                seek(0);
            } else {
                headers = settings.getHeaders();
            }
            List<Schema> dataTypes = getDataTypes(config, headers);
            IntStream.range(0, headers.length)
                    .forEach(index -> builder.field(headers[index], dataTypes.get(index)));
        } else if (settings.isHeaderExtractionEnabled()) {
            Optional.ofNullable(iterator.getContext().headers()).ifPresent(headers -> {
                List<Schema> dataTypes = getDataTypes(config, headers);
                IntStream.range(0, headers.length)
                        .forEach(index -> builder.field(headers[index], dataTypes.get(index)));
            });
        }
        return builder.build();
}

kylecbrodie · 2020-08-21T21:51:49Z

@mmolimar I applied your suggestion and it is passing tests now!

mmolimar · 2020-08-22T19:10:31Z

Thanks @kylecbrodie!

Use header_names instead of column_N if available

aa169ef

mmolimar reviewed Jul 19, 2020

View reviewed changes

kylecbrodie added 2 commits July 28, 2020 15:40

Merge branch 'develop' into allow-no-headers

63cbb47

Move header check into buildSchema

4a23b1b

and remove hasHeader parameter

kylecbrodie changed the base branch from master to develop July 28, 2020 22:54

move .forEach to next line

c968dd7

to match the formatting of the previous version of this method

kylecbrodie added 4 commits August 4, 2020 10:34

Become less functional

cfb8411

since ifPresentOrElse is not available in Java 8

Add test case for custom headers

8846398

and header extraction off

Format method to look similar to before

e8a6fcd

Format checkDataWithHeaders to match checkData

274b3d2

kylecbrodie requested a review from mmolimar August 4, 2020 18:42

mmolimar requested changes Aug 15, 2020

View reviewed changes

mmolimar requested changes Aug 19, 2020

View reviewed changes

Apply suggestion to buildSchema

8af913a

mmolimar merged commit e141293 into mmolimar:develop Aug 22, 2020

kylecbrodie deleted the allow-no-headers branch August 27, 2020 17:50

kylecbrodie mentioned this pull request Aug 27, 2020

Specifying header names for files with no header row #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use header_names instead of column_N if available #72

Use header_names instead of column_N if available #72

kylecbrodie commented Jul 11, 2020

kylecbrodie commented Jul 11, 2020

coveralls commented Jul 19, 2020 •

edited

Loading

mmolimar left a comment

kylecbrodie commented Jul 21, 2020

mmolimar commented Jul 27, 2020

mmolimar commented Jul 29, 2020

kylecbrodie commented Jul 30, 2020

kylecbrodie commented Aug 4, 2020

mmolimar commented Aug 4, 2020

mmolimar left a comment

mmolimar left a comment

kylecbrodie commented Aug 21, 2020

mmolimar commented Aug 22, 2020

Use header_names instead of column_N if available #72

Use header_names instead of column_N if available #72

Conversation

kylecbrodie commented Jul 11, 2020

kylecbrodie commented Jul 11, 2020

coveralls commented Jul 19, 2020 • edited Loading

mmolimar left a comment

Choose a reason for hiding this comment

kylecbrodie commented Jul 21, 2020

mmolimar commented Jul 27, 2020

mmolimar commented Jul 29, 2020

kylecbrodie commented Jul 30, 2020

kylecbrodie commented Aug 4, 2020

mmolimar commented Aug 4, 2020

mmolimar left a comment

Choose a reason for hiding this comment

mmolimar left a comment

Choose a reason for hiding this comment

kylecbrodie commented Aug 21, 2020

mmolimar commented Aug 22, 2020

coveralls commented Jul 19, 2020 •

edited

Loading