https://github.com/opensearch-project/data-prepper/issues/2445 #2309

mahesh724 · 2023-02-27T17:25:02Z

Description

InputCodec interface added in data-prepper-api
ParquetInputCodec implementation added in data-prepper-plugins/parquet-codecs module
Repository Restructured
JSONCodec moved from S3Source Plugin to parse-json-processor package
CSVCodec moved from SESource Plugin to csv-processor package
NewLine Codec moved from SESource Plugin to newLine-codecs package

Issues Resolved

Resolves #1532

Check List

New functionality includes testing.
New functionality has been documented.
- New functionality has javadoc added
Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Parquet codec code update

-Repository Restructured -JSONCodec moved from S3Source Plugin to parse-json-processor package -CSVCodec moved from SESource Plugin to csv-processor package -NewLine Codec moved from SESource Plugin to NewLine-Codecs package

Support for Source Codecs opensearch-project#1532

dlvenable

@mahesh724 , Thank you for your contribution! I left some comments.

I see an .exe file in here. Please remove that.

dlvenable · 2023-02-27T21:05:26Z

data-prepper-api/src/main/java/org/opensearch/dataprepper/model/codec/InputCodec.java

@@ -0,0 +1,18 @@
+package org.opensearch.dataprepper.model.codec;


All new files should have the following header:

/* * Copyright OpenSearch Contributors * SPDX-License-Identifier: Apache-2.0 */

dlvenable · 2023-02-27T21:06:54Z

...processor/src/main/java/org/opensearch/dataprepper/plugins/csvinputcodec/CsvCodecConfig.java

@@ -12,7 +12,7 @@
 import java.util.Objects;

 /**
- * Configuration class for {@link CsvCodec}.
+ * Configuration class for {@link CsvInputCodec}.
 */
 public class CsvCodecConfig {


Please rename this class to CsvInputCodecConfig. There might be some changes between this an output codec.

@dlvenable is this the correct path? data-prepper-plugins/csv-processor/src/main/java/org/opensearch/dataprepper/plugins/csvinputcodec/CsvCodecConfig.java - This has both data-prepper-plugins/csv-processor and dataprepper/plugins in it. I would think the path would be something like data-prepper-plugins/inputcodecs/src/main/java/org/opensearch/csvcodec or some thing similar

I think this should be in the csv-processor. We could reduce code duplication by keeping these CSV plugins together. I don't see any benefit to end users to split these out - since they share the same dependencies.

dlvenable · 2023-02-27T21:07:22Z

...-processor/src/main/java/org/opensearch/dataprepper/plugins/csvinputcodec/CsvInputCodec.java

-@DataPrepperPlugin(name = "csv", pluginType = Codec.class, pluginConfigurationType = CsvCodecConfig.class)
-public class CsvCodec implements Codec {
-    private static final Logger LOG = LoggerFactory.getLogger(CsvCodec.class);
+@DataPrepperPlugin(name = "Csv", pluginType = InputCodec.class, pluginConfigurationType = CsvCodecConfig.class)


Please use the original case for the name. name = "csv".

dlvenable · 2023-02-27T21:08:51Z

...processor/src/main/java/org/opensearch/dataprepper/plugins/csvinputcodec/CsvCodecConfig.java

@@ -3,7 +3,7 @@
 * SPDX-License-Identifier: Apache-2.0
 */

-package org.opensearch.dataprepper.plugins.source.codec;
+package org.opensearch.dataprepper.plugins.csvinputcodec;


I think a better package name would be either of the following:

org.opensearch.dataprepper.plugins.csv.codec

or

org.opensearch.dataprepper.plugins.codec.csv

The latter fits the current package name for the csv-processor project.

dlvenable · 2023-02-27T21:09:48Z

...er-plugins/json-codec/src/main/java/org/opensearch/dataprepper/jsonCodec/JsonInputCodec.java

- */
-
-package org.opensearch.dataprepper.plugins.source.codec;
+package org.opensearch.dataprepper.jsonCodec;


Please use either of the following package names:

org.opensearch.dataprepper.plugins.json.codec

or

org.opensearch.dataprepper.plugins.codec.json

This file path data-prepper-plugins/json-codec/src/main/java/org/opensearch/dataprepper/jsonCodec/JsonInputCodec.java does not look correct. json codec appears twice

I see two JsonInputCodec.java file in this PR. Please cleanup and re-submit.

Also, for json, we need the json parsing functionality in a library so that we can re-use it in other places (like future DLQ source code)

What exactly do you mean in a library as the existing JsonInputCodec can be used anywhere within the data-prepper-plugins for json parsing functionality?

Do you mean to use json parsing functionality inside other data-prepper packages instead of only data-prepper-plugins?

@mahesh724 , we should not have upper-case package names in Java. I think you can change the package to one of the original suggestions which should resolve both my comment and @kkondaka 's comment.

And yes, we should have only one JsonInputCodec.

Mahesh - Wouldn't renaming the package as per David's comment would address this issue?

@ashoktelukuntla yes renaming package to org.opensearch.dataprepper.plugins.codec.json and removing one JsonInputCodec.java class would resolve the issue.

dlvenable · 2023-02-27T21:20:48Z

data-prepper-plugins/s3-source/build.gradle

@@ -13,6 +13,10 @@ repositories {

 dependencies {
    implementation project(':data-prepper-api')
+    implementation project(':data-prepper-plugins:parquet-codecs')


This project should not include these codecs. The integration test implementation should have them for testing. But, not in implementation.

S3 source has integrationTest folder inside src folder that has CsvRecordsGenerator, JsonRecordGenerator and NewLineDelimitedRecordsGenerator classes that require CsvInputCodec, JsonInputCodec and NewlineDelimetedInputCodec classes respectively and if we put them in testImplementation than mentioned classes would not accessible inside s3-Source.src.integrationTest.

Do we need to use only old s3 codec implementation inside integrationTest?

dlvenable · 2023-02-27T21:21:47Z

settings.gradle

+include 'data-prepper-plugins:json-codec'
+include 'data-prepper-plugins:parquet-codecs'
+include 'data-prepper-plugins:newline-codecs'
+findProject(':data-prepper-plugins:newline-codecs')?.name = 'newline-codecs'


I'm unsure what you are trying to do here, but this does not seem right.

findProject(':data-prepper-plugins:newline-codecs')?.name = 'newline-codecs'
This line gets autogenearted first time when we build our newline-codecs module but it is redundant and could be removed.

dlvenable · 2023-02-27T21:24:56Z

...cs/src/main/java/org/opensearch/dataprepper/plugins/parquetinputcodec/ParquetInputCodec.java

+        File tempFile = File.createTempFile(FILE_NAME, FILE_SUFFIX);
+        Files.copy(inputStream, tempFile.toPath(), StandardCopyOption.REPLACE_EXISTING);
+
+        ParquetFileReader parquetFileReader = new ParquetFileReader(HadoopInputFile.fromPath(new Path(tempFile.toURI()), new Configuration()), ParquetReadOptions.builder().build());


Are there any other options rather than using a local file?

I understand that as a columnar format you need all the lines to put together any given event. But, maybe there is some other approach available here?

This require more investigation as we were not able to find any method/library where we can create a parquetReader object directly by passing inputStream inside constructor and extract all the required details.

dlvenable · 2023-02-27T21:26:44Z

...cs/src/main/java/org/opensearch/dataprepper/plugins/parquetinputcodec/ParquetInputCodec.java

+
+            for (int row = 0; row < rows; row++) {
+                SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();
+                eventData.put(MESSAGE_FIELD_NAME, simpleGroup.toString());


Why is this put as a string?

When we are trying to put SimpleGroup in eventData hashmap, and are trying to build Event out of it

Map<String, SimpleGroup> eventData = new HashMap<>();
eventData.put(MESSAGE_FIELD_NAME, simpleGroup); final Event event = JacksonLog.builder().withData(eventData).build();

Then we are getting an exception:

No serializer found for class org.apache.parquet.schema.LogicalTypeAnnotation$StringLogicalTypeAnnotation and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: java.util.HashMap["message"]->org.apache.parquet.example.data.simple.SimpleGroup["type"]->org.apache.parquet.schema.MessageType["fields"]->java.util.ArrayList[0]->org.apache.parquet.schema.PrimitiveType["logicalTypeAnnotation"])
at com.fasterxml.jackson.databind.ObjectMapper.valueToTree(ObjectMapper.java:3442)

I'm not very familiar with the SimpleGroup class and the documentation I found is lacking.

In the end, the Event should have keys and values representing the Parquet results. This will create an Event with a single key-value where the value is some arbitrary string.

dlvenable · 2023-02-27T21:27:04Z

...cs/src/main/java/org/opensearch/dataprepper/plugins/parquetinputcodec/ParquetInputCodec.java

+        ParquetMetadata footer = parquetFileReader.getFooter();
+        MessageType schema = createdParquetSchema(footer);
+
+        List<SimpleGroup> simpleGroups = new ArrayList<>();


What is the purpose of this list?

kkondaka · 2023-03-01T06:55:10Z

...processor/src/main/java/org/opensearch/dataprepper/plugins/csvinputcodec/CsvCodecConfig.java

@@ -12,7 +12,7 @@
 import java.util.Objects;

 /**
- * Configuration class for {@link CsvCodec}.
+ * Configuration class for {@link CsvInputCodec}.
 */
 public class CsvCodecConfig {


@dlvenable is this the correct path? data-prepper-plugins/csv-processor/src/main/java/org/opensearch/dataprepper/plugins/csvinputcodec/CsvCodecConfig.java - This has both data-prepper-plugins/csv-processor and dataprepper/plugins in it. I would think the path would be something like data-prepper-plugins/inputcodecs/src/main/java/org/opensearch/csvcodec or some thing similar

kkondaka · 2023-03-01T06:56:02Z

...v-processor/src/test/java/org/opensearch/dataprepper/plugins/csvinputcodec/CsvCodecTest.java

-import java.util.List;
-import java.util.Map;
-import java.util.UUID;
+import java.util.*;


We should not use "*". import individual packages that are used here.

kkondaka · 2023-03-01T06:56:15Z

...v-processor/src/test/java/org/opensearch/dataprepper/plugins/csvinputcodec/CsvCodecTest.java

-import static org.mockito.Mockito.verify;
-import static org.mockito.Mockito.verifyNoInteractions;
-import static org.mockito.Mockito.when;
+import static org.mockito.Mockito.*;


Do not use "*"

kkondaka · 2023-03-01T06:58:59Z

...er-plugins/json-codec/src/main/java/org/opensearch/dataprepper/jsonCodec/JsonInputCodec.java

- */
-
-package org.opensearch.dataprepper.plugins.source.codec;
+package org.opensearch.dataprepper.jsonCodec;


This file path data-prepper-plugins/json-codec/src/main/java/org/opensearch/dataprepper/jsonCodec/JsonInputCodec.java does not look correct. json codec appears twice

kkondaka · 2023-03-01T07:09:18Z

...er-plugins/json-codec/src/main/java/org/opensearch/dataprepper/jsonCodec/JsonInputCodec.java

- */
-
-package org.opensearch.dataprepper.plugins.source.codec;
+package org.opensearch.dataprepper.jsonCodec;


I see two JsonInputCodec.java file in this PR. Please cleanup and re-submit.

Also, for json, we need the json parsing functionality in a library so that we can re-use it in other places (like future DLQ source code)

Description: - Package naming convention changed - @tempdir used in ParquetInputCodecTest class - Gradle build files updated - Wildcard imports removed Signed-off-by: Mahesh Kariya kariyamahesh82@gmail.com

-Deleted all unwanted files Signed-off-by: Mahesh Kariya <kariyamahesh82@gmail.com>

dlvenable · 2023-03-03T15:29:25Z

data-prepper-plugins/parquet-codecs/build.gradle

+    implementation project(':data-prepper-api')
+    implementation 'org.apache.parquet:parquet-hadoop:1.12.0'
+    implementation 'org.apache.hadoop:hadoop-common:3.3.3'
+    implementation 'org.apache.parquet:parquet-avro:1.10.1'


Do we need these Avro dependencies?

Yes we do require these dependencies inside the test class to generate schema and to create random parquet stream. So implementation needs to be changed to testImplementation.

Yes, please change to use testImplementation.

dlvenable

It seems that when you moved the packages you duplicated a lot of code. I see the same classes repeated. Can you clean these up?

dlvenable · 2023-03-03T15:46:22Z

...rocessor/src/main/java/org/opensearch/dataprepper/plugins/jsoninputcodec/JsonInputCodec.java

@@ -0,0 +1,59 @@
+package org.opensearch.dataprepper.plugins.jsoninputcodec;


Why do we have this class twice? Can we remove this one?

Signed-off-by: mahesh724 <kariyamahesh82@gmail.com>

…fixed a bug in the tests where the data was not sent to the correct sink. (opensearch-project#2061) Signed-off-by: David Venable <dlv@amazon.com>

…oject#2140) * Adds ScheduledExecutor Service and runnable task Signed-off-by: Shivani Shukla <sshkamz@amazon.com>

…ue 2123) (opensearch-project#2124) * Fix for null pointer exception in remote peer forwarding (fix for issue opensearch-project#2123) Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> * Addressed review comments to add a counter and not skip when an identification key is missing Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> * Addressed review comments. Modified to increment the counter only when all identification keys are missing Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> * Added 'final' to the local variable Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> * Addressed review comments Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> * Added a test with all missing keys Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> Co-authored-by: Krishna Kondaka <krishkdk@amazon.com>

Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com> Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com>

* Added implementation of s3 support in Opensearch sink Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com>

…-project#2100) Bumps [byte-buddy-agent](https://github.com/raphw/byte-buddy) from 1.12.18 to 1.12.20. - [Release notes](https://github.com/raphw/byte-buddy/releases) - [Changelog](https://github.com/raphw/byte-buddy/blob/master/release-notes.md) - [Commits](raphw/byte-buddy@byte-buddy-1.12.18...byte-buddy-1.12.20) --- updated-dependencies: - dependency-name: net.bytebuddy:byte-buddy-agent dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> Co-authored-by: Krishna Kondaka <krishkdk@amazon.com>

…-project#2160) Bumps [byte-buddy-agent](https://github.com/raphw/byte-buddy) from 1.12.20 to 1.12.22. - [Release notes](https://github.com/raphw/byte-buddy/releases) - [Changelog](https://github.com/raphw/byte-buddy/blob/master/release-notes.md) - [Commits](raphw/byte-buddy@byte-buddy-1.12.20...byte-buddy-1.12.22) --- updated-dependencies: - dependency-name: net.bytebuddy:byte-buddy-agent dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…ct#2161) Bumps [byte-buddy](https://github.com/raphw/byte-buddy) from 1.12.18 to 1.12.22. - [Release notes](https://github.com/raphw/byte-buddy/releases) - [Changelog](https://github.com/raphw/byte-buddy/blob/master/release-notes.md) - [Commits](raphw/byte-buddy@byte-buddy-1.12.18...byte-buddy-1.12.22) --- updated-dependencies: - dependency-name: net.bytebuddy:byte-buddy dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…arch-project#2214) Bumps [org.springframework:spring-context](https://github.com/spring-projects/spring-framework) from 5.3.23 to 5.3.25. - [Release notes](https://github.com/spring-projects/spring-framework/releases) - [Commits](spring-projects/spring-framework@v5.3.23...v5.3.25) --- updated-dependencies: - dependency-name: org.springframework:spring-context dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…pensearch-project#2223) Bumps [org.springframework:spring-context](https://github.com/spring-projects/spring-framework) from 5.3.23 to 5.3.25. - [Release notes](https://github.com/spring-projects/spring-framework/releases) - [Commits](spring-projects/spring-framework@v5.3.23...v5.3.25) --- updated-dependencies: - dependency-name: org.springframework:spring-context dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix grok processor to not create a new record Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> * Fixed checkStyleMain failure Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> --------- Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> Co-authored-by: Krishna Kondaka <krishkdk@amazon.com>

Initial commit for OTel trace path changes Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com> Signed-off-by: Asif Sohail Mohammed <mdasifsohail7@gmail.com>

…rch-project#2332) Bumps [org.junit.jupiter:junit-jupiter-api](https://github.com/junit-team/junit5) from 5.9.0 to 5.9.2. - [Release notes](https://github.com/junit-team/junit5/releases) - [Commits](junit-team/junit5@r5.9.0...r5.9.2) --- updated-dependencies: - dependency-name: org.junit.jupiter:junit-jupiter-api dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…ct#2331) Bumps org.assertj:assertj-core from 3.21.0 to 3.24.2. --- updated-dependencies: - dependency-name: org.assertj:assertj-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…arch-project#2333) Bumps org.apache.logging.log4j:log4j-bom from 2.19.0 to 2.20.0. --- updated-dependencies: - dependency-name: org.apache.logging.log4j:log4j-bom dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…ertain objects in peer-to-peer connections. Additionally, it refactors some application configurations to improve integration testing. Fixes opensearch-project#2310. (opensearch-project#2311) Signed-off-by: David Venable <dlv@amazon.com>

Signed-off-by: GitHub <noreply@github.com> Co-authored-by: asifsmohammed <asifsmohammed@users.noreply.github.com>

Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com>

…oject#2343. Updated the AWS CDK as well. (opensearch-project#2345) Signed-off-by: David Venable <dlv@amazon.com>

Description: - Package naming convention changed - @tempdir used in ParquetInputCodecTest class - Gradle build files updated - Wildcard imports removed Signed-off-by: Mahesh Kariya <kariyamahesh82@gmail.com>

-Deleted all unwanted files Signed-off-by: mahesh724 <kariyamahesh82@gmail.com>

-Bad field scenario handled. Signed-off-by: umairofficial <umairhusain1010@gmail.com>

dlvenable

Thank you @mahesh724 for this contribution! I have requested a few changes for everything except the Parquet changes. I'll take a look at those shortly.

dlvenable · 2023-03-24T15:46:41Z

config/data-prepper-config.yaml

@@ -0,0 +1,2 @@
+
+ssl: false


Please remove this change. I understand you may have used it for testing, but let's keep this configure the same way it started.

dlvenable · 2023-03-24T15:51:04Z

settings.gradle

@@ -91,4 +91,7 @@ include 'release:docker'
 include 'release:maven'
 include 'e2e-test:peerforwarder'
 include 'rss-source'
+include 'data-prepper-plugins:json-codec'


I see you have the JsonInputCodec in the parse-json-processor Gradle project. So we don't need a new json-codec Gradle project. You can remove this line.

dlvenable · 2023-03-24T15:52:16Z

...on-processor/src/main/java/org/opensearch/dataprepper/plugins/codec/json/JsonInputCodec.java

    private final ObjectMapper objectMapper = new ObjectMapper();
    private final JsonFactory jsonFactory = new JsonFactory();

    @Override
-    public void parse(final InputStream inputStream, final Consumer<Record<Event>> eventConsumer) throws IOException {
+    public void parse(InputStream inputStream, Consumer<Record<Event>> eventConsumer) throws IOException {


You can keep these parameters as final as they were in the original code.

dlvenable · 2023-03-24T15:53:24Z

data-prepper-plugins/s3-source/build.gradle

@@ -13,6 +13,9 @@ repositories {

 dependencies {
    implementation project(':data-prepper-api')
+    implementation project(':data-prepper-plugins:csv-processor')


Please make these three dependencies available only as testImplementation.

The S3 Source shouldn't depend on any particular Codec. The integration tests do rely on them, and that is why we need it in the testImplementation.

dlvenable · 2023-03-24T15:57:25Z

pipelines/pipelines-scan.yaml

@@ -0,0 +1,14 @@
+scan-pipeline:


Please remove this file. We don't need this file checked in.

kkondaka · 2023-03-24T16:08:19Z

data-prepper-api/src/main/java/org/opensearch/dataprepper/model/codec/InputCodec.java

+     *
+     * @param inputStream   The input stream for the source plugin(e.g. S3, Http, RssFeed etc) object
+     * @param eventConsumer The consumer which handles each event from the stream
+     */


The javadoc should also have entry for @throws

I am sorry, could you please elaborate?

kkondaka · 2023-03-24T16:08:54Z

...s/csv-processor/src/test/java/org/opensearch/dataprepper/plugins/codec/csv/CsvCodecTest.java

-import java.util.List;
-import java.util.Map;
-import java.util.UUID;
+import java.util.*;


Do not use ".*". Please list individual imports.

kkondaka · 2023-03-24T16:09:18Z

...s/csv-processor/src/test/java/org/opensearch/dataprepper/plugins/codec/csv/CsvCodecTest.java

-import static org.mockito.Mockito.verify;
-import static org.mockito.Mockito.verifyNoInteractions;
-import static org.mockito.Mockito.when;
+import static org.mockito.Mockito.*;


Do not use ".*". Please list individual imports.

kkondaka · 2023-03-24T16:12:13Z

...codecs/src/main/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodec.java

+    private static final Logger LOG = LoggerFactory.getLogger(ParquetInputCodec.class);
+
+    @Override
+    public void parse(InputStream inputStream, Consumer<Record<Event>> eventConsumer) throws IOException {


Add final to both the arguments.

kkondaka · 2023-03-24T16:16:27Z

...codecs/src/main/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodec.java

+                            try {
+                                eventData.put(field.getName(), simpleGroup.getValueToString(fieldIndex, 0));
+                            }
+                            catch (Exception parquetException){


We should increment some metric here

As per initial discussions and HLD, it was concluded that to metrics are to be captured at codec level. We had initially proposed to include various metrics in the initial versions of HLD but David advised us to not keep either.

kkondaka · 2023-03-24T16:18:57Z

...codecs/src/main/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodec.java

+                                eventData.put(field.getName(), simpleGroup.getValueToString(fieldIndex, 0));
+                            }
+                            catch (Exception parquetException){
+                                eventData.put(field.getName(), "unknown");


I think it is better to indicate which field index was not found. Something like this - "unknown" -> "failed to extract index "?

I'm not sure we should keep this at all. Once Data Prepper supports tagging (#629 ) then it might make sense to tag this event with failure.

I think for now we have two options: 1) Drop the entire event; 2) Do not include the field in the event at all.

Alright. Not including that particular field in that event for now.

kkondaka · 2023-03-24T16:19:20Z

...codecs/src/main/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodec.java

+                }
+            }
+        } catch (Exception parquetException) {
+            LOG.error("An exception occurred while parsing parquet InputStream  ", parquetException);


Some metric must be incremented here.

As per initial discussions and HLD, it was concluded that NO metrics are to be captured at codec level. We had initially proposed to include various metrics in the initial versions of HLD but David advised us to not keep either.

kkondaka · 2023-03-24T16:20:42Z

...cs/src/test/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodecTest.java

+        }
+        return  new ByteArrayInputStream(INVALID_PARQUET_INPUT_STREAM.getBytes());
+    }
+}


After adding metrics in the exception cases (as commented above), add test cases to test the metrics.

I dont think that this will be required anymore as after today's discussion it was concluded that we have to send the exception back to the source plugin and that source plugin will capture the metric. The test case that will now be required is "test_when_invalid_field_then_throws_exception". Please correct me if wrong.

dlvenable · 2023-03-24T21:31:32Z

...codecs/src/main/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodec.java

+                                eventData.put(field.getName(), simpleGroup.getValueToString(fieldIndex, 0));
+                            }
+                            catch (Exception parquetException){
+                                eventData.put(field.getName(), "unknown");


I'm not sure we should keep this at all. Once Data Prepper supports tagging (#629 ) then it might make sense to tag this event with failure.

I think for now we have two options: 1) Drop the entire event; 2) Do not include the field in the event at all.

dlvenable · 2023-03-24T21:33:35Z

...codecs/src/main/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodec.java

+                            }
+                            catch (Exception parquetException){
+                                eventData.put(field.getName(), "unknown");
+                                LOG.error("Unreadable or bad record");


This error message should give some indication as to what was wrong along with context.

LOG.error("Unable to retrieve value for field with name = '{}' with error '{}'", field.getName(), parquetException.getMessage());

dlvenable · 2023-03-24T21:40:44Z

...codecs/src/main/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodec.java

+
+    private void parseParquetStream(final InputStream inputStream, final Consumer<Record<Event>> eventConsumer) throws IOException {
+
+         final File tempFile = File.createTempFile(FILE_NAME, FILE_SUFFIX);


It is important to let the user define the location of the directory where the temp files will be created.

Perhaps we should give an option to define the path to the file. If specified, then we use that path. Otherwise, the default is to create a temp file.

Perhaps something like the following?

codec: parquet: temp_directory: /usr/share/data-prepper/data/parquet/

You can use this version of createTempFile to allow a user-defined temp directory. https://docs.oracle.com/javase/7/docs/api/java/io/File.html#createTempFile(java.lang.String,%20java.lang.String,%20java.io.File)

dlvenable · 2023-03-24T21:45:44Z

data-prepper-plugins/parquet-codecs/build.gradle

+    implementation project(':data-prepper-api')
+    implementation 'org.apache.parquet:parquet-hadoop:1.12.0'
+    implementation 'org.apache.hadoop:hadoop-common:3.3.3'
+    implementation 'org.apache.parquet:parquet-avro:1.10.1'


Yes, please change to use testImplementation.

dlvenable · 2023-03-24T21:47:41Z

data-prepper-plugins/parquet-codecs/build.gradle

+    implementation 'org.apache.parquet:parquet-hadoop:1.12.0'
+    implementation 'org.apache.hadoop:hadoop-common:3.3.3'
+    implementation 'org.apache.parquet:parquet-avro:1.10.1'
+    implementation("org.apache.avro:avro:1.9.0")


This also can be testImplementation I believe.

dlvenable · 2023-03-24T21:51:22Z

...cs/src/test/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodecTest.java

+        Schema schema = parseSchema();
+        String OS = System.getProperty("os.name").toLowerCase();
+
+        if (OS.contains("win")) {


This test is making a change which then may influence the actual code it is testing. (It has a side-effect)

After writing the test data, please unset this property to verify that this is not needed by the code under test.

dlvenable · 2023-03-24T21:52:44Z

...cs/src/test/java/org/opensearch/dataprepper/plugins/codec/parquet/ParquetInputCodecTest.java

+    private static InputStream createInvalidParquetStream() {
+        String OS = System.getProperty("os.name").toLowerCase();
+        if (OS.contains("win")) {
+            System.setProperty("hadoop.home.dir", Paths.get("").toAbsolutePath().toString());


This is setting a property and has a side-effect. Please see my comment above. I'm not sure this is really needed for setting up the test data.

This test case was failing unless we set hadoop.home.dir. After the test cases are done running, we are unsetting this property.

Signed-off-by: umairofficial <umairhusain1010@gmail.com>

dlvenable · 2023-05-01T20:19:11Z

This PR was replaced by:

mahesh724 and others added 7 commits January 23, 2023 13:47

Added Codec Interface in data-prepper-api package

4dd6dd9

-Parquet codec implementation added

54e3fc9

-Parquet codec implementation added

ff8867c

Merge pull request #1 from mahesh724/parquetCodec

52ea369

Parquet codec code update

Merge branch 'opensearch-project:main' into main

987b50c

-Parquet codec implementation added

950ebe1

-Repository Restructured -JSONCodec moved from S3Source Plugin to parse-json-processor package -CSVCodec moved from SESource Plugin to csv-processor package -NewLine Codec moved from SESource Plugin to NewLine-Codecs package

Merge pull request #2 from mahesh724/parquetCodec

3b77164

Support for Source Codecs opensearch-project#1532

mahesh724 requested a review from a team as a code owner February 27, 2023 17:25

dlvenable requested changes Feb 27, 2023

View reviewed changes

ashoktelukuntla requested a review from dlvenable February 28, 2023 01:00

kkondaka requested changes Mar 1, 2023

View reviewed changes

mahesh724 and others added 3 commits March 2, 2023 20:49

Merge branch 'opensearch-project:main' into main

2671457

Review Comments Incorporated

f344848

Description: - Package naming convention changed - @tempdir used in ParquetInputCodecTest class - Gradle build files updated - Wildcard imports removed Signed-off-by: Mahesh Kariya kariyamahesh82@gmail.com

-Wildcard imports removed

78880b8

-Deleted all unwanted files Signed-off-by: Mahesh Kariya <kariyamahesh82@gmail.com>

dlvenable reviewed Mar 3, 2023

View reviewed changes

dlvenable requested changes Mar 3, 2023

View reviewed changes

mahesh724 force-pushed the main branch 2 times, most recently from 8abf693 to 78880b8 Compare March 3, 2023 17:26

mahesh724 and others added 12 commits March 9, 2023 12:51

Added Codec Interface in data-prepper-api package

e1f5118

Signed-off-by: mahesh724 <kariyamahesh82@gmail.com>

-Parquet codec implementation added

bb4ba56

Signed-off-by: mahesh724 <kariyamahesh82@gmail.com>

-Parquet codec implementation added

1113ab2

Signed-off-by: mahesh724 <kariyamahesh82@gmail.com>

Combined two integration tests for conditional routes into one. Also …

1807cc3

…fixed a bug in the tests where the data was not sent to the correct sink. (opensearch-project#2061) Signed-off-by: David Venable <dlv@amazon.com>

Adds ScheduledExecutorService for Polling the RSS feed (opensearch-pr…

7cb4355

…oject#2140) * Adds ScheduledExecutor Service and runnable task Signed-off-by: Shivani Shukla <sshkamz@amazon.com>

Added batch delay to CPF configuration (opensearch-project#2159)

a240d9a

Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com> Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com>

Added s3 support in Opensearch sink (opensearch-project#2121)

f1447a9

* Added implementation of s3 support in Opensearch sink Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com>

New Aggregate Action - Event Rate Limiter (opensearch-project#2090)

a43b298

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com> Co-authored-by: Krishna Kondaka <krishkdk@amazon.com>

dependabot bot and others added 17 commits March 9, 2023 12:57

Support for path in OTel sources (opensearch-project#2297)

b02db12

Initial commit for OTel trace path changes Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com> Signed-off-by: Asif Sohail Mohammed <mdasifsohail7@gmail.com>

Generated THIRD-PARTY file for 062ae95 (opensearch-project#2344)

6eb2804

Signed-off-by: GitHub <noreply@github.com> Co-authored-by: asifsmohammed <asifsmohammed@users.noreply.github.com>

Updated version to 2.2 on main (opensearch-project#2342)

47759c7

Signed-off-by: Asif Sohail Mohammed <nsifmoh@amazon.com>

Explicitly set the GitHub Actions thumbprint to resolve opensearch-pr…

471e7ab

…oject#2343. Updated the AWS CDK as well. (opensearch-project#2345) Signed-off-by: David Venable <dlv@amazon.com>

Review Comments Incorporated

47eddbf

Description: - Package naming convention changed - @tempdir used in ParquetInputCodecTest class - Gradle build files updated - Wildcard imports removed Signed-off-by: Mahesh Kariya <kariyamahesh82@gmail.com>

-Wildcard imports removed

64522c6

-Deleted all unwanted files Signed-off-by: mahesh724 <kariyamahesh82@gmail.com>

Merge branch 'main' of https://github.com/mahesh724/data-prepper

8e11ca6

Review changes of parquet

f017eb2

Merge branch 'main' of https://github.com/mahesh724/data-prepper

c2a5d05

-Parquet Codec Implementation added post correction.

3e08a9f

-Bad field scenario handled. Signed-off-by: umairofficial <umairhusain1010@gmail.com>

dlvenable requested changes Mar 24, 2023

View reviewed changes

kkondaka requested changes Mar 24, 2023

View reviewed changes

dlvenable requested changes Mar 24, 2023

View reviewed changes

umairofficial added 2 commits March 28, 2023 19:43

-Review Comments Incorporated

13e2e92

Signed-off-by: umairofficial <umairhusain1010@gmail.com>

-Review Comments Incorporated

8b20823

Signed-off-by: umairofficial <umairhusain1010@gmail.com>

dlvenable mentioned this pull request Apr 1, 2023

Support for source codecs #2414

Closed

4 tasks

svana added the v2.3.0 label Apr 20, 2023

svana changed the title ~~Support for Source Codecs #1532~~ https://github.com/opensearch-project/data-prepper/issues/2445 Apr 20, 2023

dlvenable added this to the v2.3 milestone Apr 20, 2023

dlvenable removed the v2.3.0 label Apr 20, 2023

dlvenable closed this May 1, 2023

dlvenable removed this from the v2.3 milestone May 4, 2023

		@@ -0,0 +1,18 @@
		package org.opensearch.dataprepper.model.codec;

		@@ -0,0 +1,59 @@
		package org.opensearch.dataprepper.plugins.jsoninputcodec;


		private void parseParquetStream(final InputStream inputStream, final Consumer<Record<Event>> eventConsumer) throws IOException {

		final File tempFile = File.createTempFile(FILE_NAME, FILE_SUFFIX);

https://github.com/opensearch-project/data-prepper/issues/2445 #2309

https://github.com/opensearch-project/data-prepper/issues/2445 #2309

Conversation

mahesh724 commented Feb 27, 2023

Description

Issues Resolved

Check List

dlvenable left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashoktelukuntla Mar 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahesh724 Mar 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mahesh724 Mar 1, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlvenable left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlvenable left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

umairofficial Mar 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ashoktelukuntla Mar 1, 2023 •

edited

mahesh724 Mar 1, 2023 •

edited

mahesh724 Mar 1, 2023 •

edited

umairofficial Mar 28, 2023 •

edited

dlvenable commented May 1, 2023 •

edited