Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Aliyun Object Storage Service (OSS) #485

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
10 changes: 10 additions & 0 deletions automation/Makefile
Expand Up @@ -211,6 +211,16 @@ ifneq "$(PROTOCOL)" ""
sed $(SED_OPTS) "s|YOUR_AZURE_BLOB_STORAGE_ACCOUNT_NAME|$(WASB_ACCOUNT_NAME)|" $(PROTOCOL_HOME)/$(PROTOCOL)-site.xml; \
sed $(SED_OPTS) "s|YOUR_AZURE_BLOB_STORAGE_ACCOUNT_KEY|$(WASB_ACCOUNT_KEY)|" $(PROTOCOL_HOME)/$(PROTOCOL)-site.xml; \
fi; \
if [ $(PROTOCOL) = oss ]; then \
if [ -z "$(OSS_ACCESS_KEY_ID)" ] || [ -z "$(OSS_SECRET_ACCESS_KEY)" ] || [ -z "$(OSS_ENDPOINT)" ]; then \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if these are following an existing convention or if they have to be named this way because third-party deps require them, but the OSS_* seems very generic. Could we prepend ALIYUN_ to the environment variables?

echo "Aliyun Keys or Endpoint (OSS_ACCESS_KEY_ID, OSS_SECRET_ACCESS_KEY, OSS_ENDPOINT) not set"; \
rm -rf $(PROTOCOL_HOME); \
exit 1; \
fi; \
sed $(SED_OPTS) "s|YOUR_OSS_ACCESS_KEY_ID|$(OSS_ACCESS_KEY_ID)|" $(PROTOCOL_HOME)/$(PROTOCOL)-site.xml; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these calls to sed supposed to be editing in place? Are you missing -i?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

sed $(SED_OPTS) "s|YOUR_OSS_SECRET_ACCESS_KEY|$(OSS_SECRET_ACCESS_KEY)|" $(PROTOCOL_HOME)/$(PROTOCOL)-site.xml; \
sed $(SED_OPTS) "s|YOUR_OSS_ENDPOINT|$(OSS_ENDPOINT)|" $(PROTOCOL_HOME)/$(PROTOCOL)-site.xml; \
fi; \
echo "Created $(PROTOCOL) server configuration"; \
fi
endif
Expand Down
16 changes: 16 additions & 0 deletions automation/pom.xml
Expand Up @@ -285,6 +285,22 @@
<version>5.4.0</version>
</dependency>

<!-- Aliyun Dependencies -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aliyun</artifactId>
<version>${hdp.hadoop.version}</version>
</dependency>

<!-- Use version 3.8.1 (version 3.0.0 would produce a lot of output for
a NoSuchKey error. The issue is detailed here:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we be hitting the NoSuchKey error?

https://github.com/aliyun/aliyun-oss-java-sdk/issues/145 -->
<dependency>
<groupId>com.aliyun.oss</groupId>
<artifactId>aliyun-sdk-oss</artifactId>
<version>3.8.1</version>
</dependency>

<!-- HADOOP Dependencies -->
<dependency>
<groupId>org.apache.hadoop</groupId>
Expand Down
Expand Up @@ -16,6 +16,7 @@ public String getExternalTablePath(String basePath, String path) {
return StringUtils.removeStart(path, basePath);
}
},
OSS("oss"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline with my comment above, is it possible to make this aliyun-oss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think we need a good name for aliyun's service. OSS is what their service is called ... but it's pretty confusing>.

S3("s3"),
WASBS("wasbs");

Expand Down
Expand Up @@ -367,7 +367,7 @@ public void textFormatBZip2CopyFromStdin() throws Exception {
*
* @throws Exception if test fails to run
*/
@Test(groups = {"features", "gpdb", "hcfs", "security"}, timeOut = 120000)
@Test(groups = {"features", "gpdb", "hcfs", "security"}, timeOut = 180000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect the newer aliyun tests to run slower? Why did the timeout need to be increased?

public void textFormatWideRowsInsert() throws Exception {

int rows = 10;
Expand Down
7 changes: 7 additions & 0 deletions automation/src/test/resources/sut/default.xml
Expand Up @@ -54,6 +54,13 @@
<scheme>wasbs</scheme>
</wasbs>

<oss>
<class>org.greenplum.pxf.automation.components.hdfs.Hdfs</class>
<workingDirectory>data-gpdb-ud-automation/tmp/pxf_automation_data/__UUID__</workingDirectory>
<hadoopRoot>${pxf.base}/servers/oss</hadoopRoot>
<scheme>oss</scheme>
</oss>

<hbase>
<class>org.greenplum.pxf.automation.components.hbase.HBase</class>
<host>localhost</host>
Expand Down
2 changes: 2 additions & 0 deletions concourse/Makefile
Expand Up @@ -34,6 +34,7 @@ GS ?= false
MINIO ?= false
OEL7 ?= false
FILE ?= false
OSS ?= false

.PHONY: build certification dev pr cloudbuild
build: set-build-pipeline
Expand Down Expand Up @@ -114,6 +115,7 @@ set-dev-build-pipeline:
gs=$(GS) \
minio=$(MINIO) \
oel7=$(OEL7) \
oss=$(OSS) \
dev_pipeline=true \
user=$(USER) \
branch=$(BRANCH) \
Expand Down
34 changes: 33 additions & 1 deletion concourse/pipelines/templates/build_pipeline-tpl.yml
Expand Up @@ -41,6 +41,7 @@ groups:
- Test PXF-GP[[gp_ver]]-S3-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-ADL-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-GS-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-OSS-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-MINIO-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-NO-IMPERS on RHEL7
Expand Down Expand Up @@ -85,6 +86,7 @@ groups:
- Test PXF-GP[[gp_ver]]-S3-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-ADL-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-GS-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-OSS-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-MINIO-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-NO-IMPERS on RHEL7
Expand Down Expand Up @@ -1154,6 +1156,35 @@ jobs:
PROTOCOL: minio
TARGET_OS: centos

- name: Test PXF-GP[[gp_ver]]-OSS-NO-IMPERS on RHEL7
plan:
- in_parallel:
- get: pxf_src
passed: [Testing Gate for PXF-GP]
trigger: true
- get: pxf_tarball
resource: pxf_gp[[gp_ver]]_tarball_rhel7
passed: [Testing Gate for PXF-GP]
- get: gpdb_package
resource: gpdb[[gp_ver]]_rhel7_rpm_latest-0
passed: [Testing Gate for PXF-GP]
- get: gpdb[[gp_ver]]-pxf-dev-centos7-image
- get: pxf-automation-dependencies
- get: singlecluster
resource: singlecluster-hdp2
- task: Test PXF-GP[[gp_ver]]-OSS-NO-IMPERS on RHEL7
file: pxf_src/concourse/tasks/test.yml
image: gpdb[[gp_ver]]-pxf-dev-centos7-image
params:
GP_VER: [[gp_ver]]
GROUP: hcfs
IMPERSONATION: false
OSS_ACCESS_KEY_ID: ((data-gpdb-ud-oss-access-key-id))
OSS_SECRET_ACCESS_KEY: ((data-gpdb-ud-oss-access-key))
OSS_ENDPOINT: ((data-gpdb-ud-oss-endpoint))
PROTOCOL: oss
TARGET_OS: centos

## ---------- Multi-node tests ----------

- name: Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-IMPERS on RHEL7
Expand Down Expand Up @@ -1421,6 +1452,7 @@ jobs:
- Test PXF-GP[[gp_ver]]-S3-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-ADL-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-GS-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-OSS-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-MINIO-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-NO-IMPERS on RHEL7
Expand Down Expand Up @@ -1448,6 +1480,7 @@ jobs:
- Test PXF-GP[[gp_ver]]-S3-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-ADL-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-GS-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-OSS-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-MINIO-NO-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-IMPERS on RHEL7
- Test PXF-GP[[gp_ver]]-HDP2-SECURE-MULTI-NO-IMPERS on RHEL7
Expand Down Expand Up @@ -1740,4 +1773,3 @@ jobs:
attachment_globs: [pxf_artifacts/((pxf-osl-file-prefix))*.txt]
to_text: "((pxf-releng-email)),((pxf-ud-email))"
{% endif %}

34 changes: 34 additions & 0 deletions concourse/pipelines/templates/dev_build_pipeline-tpl.yml
Expand Up @@ -941,6 +941,40 @@ jobs:
{% endif %}
{% endif %}

{% if oss %}
- name: Test PXF-GP[[gp_ver]]-OSS-NO-IMPERS on RHEL7
plan:
- in_parallel:
- get: pxf_src
passed: [Build PXF-GP[[gp_ver]] on RHEL7]
trigger: true
- get: pxf_tarball
resource: pxf_gp[[gp_ver]]_tarball_rhel7
passed: [Build PXF-GP[[gp_ver]] on RHEL7]
- get: gpdb_package
resource: gpdb[[gp_ver]]_rhel7_rpm_latest-0
passed: [Build PXF-GP[[gp_ver]] on RHEL7]
- get: gpdb[[gp_ver]]-pxf-dev-centos7-image
- get: pxf-automation-dependencies
- get: singlecluster
resource: singlecluster-hdp2
- task: Test PXF-GP[[gp_ver]]-OSS-NO-IMPERS on RHEL7
file: pxf_src/concourse/tasks/test.yml
image: gpdb[[gp_ver]]-pxf-dev-centos7-image
params:
GP_VER: [[gp_ver]]
GROUP: hcfs
IMPERSONATION: false
OSS_ACCESS_KEY_ID: ((data-gpdb-ud-oss-access-key-id))
OSS_SECRET_ACCESS_KEY: ((data-gpdb-ud-oss-access-key))
OSS_ENDPOINT: ((data-gpdb-ud-oss-endpoint))
PROTOCOL: oss
TARGET_OS: centos
{% if slack_notification %}
<<: *slack_alert
{% endif %}
{% endif %}

{% if minio %}
- name: Test PXF-GP[[gp_ver]]-MINIO-NO-IMPERS on RHEL7
plan:
Expand Down
3 changes: 3 additions & 0 deletions concourse/tasks/test.yml
Expand Up @@ -21,6 +21,9 @@ params:
GROUP: smoke
HADOOP_CLIENT: HDP
IMPERSONATION: true
OSS_ACCESS_KEY_ID:
OSS_SECRET_ACCESS_KEY:
OSS_ENDPOINT:
PGPORT: 5432
PXF_BASE_DIR:
SECRET_ACCESS_KEY:
Expand Down
2 changes: 1 addition & 1 deletion server/gradle.properties
Expand Up @@ -17,7 +17,7 @@

version=0.0.0-SNAPSHOT
license=ASL 2.0
hadoopVersion=2.9.2
hadoopVersion=2.10.0
htraceVersion=4.0.1-incubating
hiveVersion=2.3.7
hbaseVersion=1.3.2
Expand Down
6 changes: 6 additions & 0 deletions server/pxf-hdfs/build.gradle
Expand Up @@ -78,6 +78,12 @@ dependencies {
// GCS jars and dependencies
implementation("com.google.cloud.bigdataoss:gcs-connector:hadoop2-1.9.17:shaded") { transitive = false }

// Aliyun (Alibaba cloud) jars
implementation("org.apache.hadoop:hadoop-aliyun:${hadoopVersion}") { transitive = false }
implementation("com.aliyun.oss:aliyun-sdk-oss:3.8.1") { transitive = false }
implementation("com.aliyun.odps:hadoop-fs-oss:3.3.8-public") { transitive = false }
implementation("org.jdom:jdom:1.1") { transitive = false }

/*******************************
* Test Dependencies
*******************************/
Expand Down
Expand Up @@ -43,6 +43,8 @@ protected String validateAndNormalizeBasePath(String serverName, String basePath
},
GS,
HDFS,
// OSS is for Aliyun (Alibaba Cloud)
OSS,
S3,
S3A,
S3N,
Expand Down
101 changes: 101 additions & 0 deletions server/pxf-service/src/main/resources/pxf-profiles-default.xml
Expand Up @@ -829,4 +829,105 @@ under the License.
<resolver>org.greenplum.pxf.plugins.json.JsonResolver</resolver>
</plugins>
</profile>

<!-- Aliyun (Alibaba Cloud) profiles -->
<profile>
<name>oss:text</name>
<description>This profile is suitable for using when reading delimited single line records
from plain text, tab-delimited, files on Alibaba Cloud
</description>
<plugins>
<fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.greenplum.pxf.plugins.hdfs.LineBreakAccessor</accessor>
<resolver>org.greenplum.pxf.plugins.hdfs.StringPassResolver</resolver>
</plugins>
<protocol>oss</protocol>
</profile>
<profile>
<name>oss:csv</name>
<description>This profile is suitable for using when reading delimited single line records
from plain text CSV files on Alibaba Cloud
</description>
<plugins>
<fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.greenplum.pxf.plugins.hdfs.LineBreakAccessor</accessor>
<resolver>org.greenplum.pxf.plugins.hdfs.StringPassResolver</resolver>
</plugins>
<protocol>oss</protocol>
</profile>
<profile>
<name>oss:text:multi</name>
<description>This profile is suitable for using when reading delimited single or multi line
records (with quoted linefeeds) from plain text files on Alibaba Cloud. It is not splittable (non
parallel) and slower than HdfsTextSimple.
</description>
<plugins>
<fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsFileFragmenter</fragmenter>
<accessor>org.greenplum.pxf.plugins.hdfs.QuotedLineBreakAccessor</accessor>
<resolver>org.greenplum.pxf.plugins.hdfs.StringPassResolver</resolver>
</plugins>
<protocol>oss</protocol>
</profile>
<profile>
<name>oss:parquet</name>
<description>A profile for reading and writing Parquet data from Alibaba Cloud
</description>
<plugins>
<fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.greenplum.pxf.plugins.hdfs.ParquetFileAccessor</accessor>
<resolver>org.greenplum.pxf.plugins.hdfs.ParquetResolver</resolver>
</plugins>
<protocol>oss</protocol>
</profile>
<profile>
<name>oss:avro</name>
<description>This profile is suitable for using when reading Avro files (i.e
fileName.avro)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fileName.avro)
fileName.avro) from Alibaba Cloud

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like you specify Alibaba Cloud in all the other profiles, so we should keep this consistent across all the profiles.

</description>
<plugins>
<fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.greenplum.pxf.plugins.hdfs.AvroFileAccessor</accessor>
<resolver>org.greenplum.pxf.plugins.hdfs.AvroResolver</resolver>
</plugins>
<protocol>oss</protocol>
</profile>
<profile>
<name>oss:json</name>
<description>
Access JSON data either as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Access JSON data either as:
Access JSON data from Alibaba Cloud either as:

* one JSON record per line (default)
* or multiline JSON records with an IDENTIFIER parameter indicating a member name used
to determine the encapsulating json object to return
</description>
<plugins>
<fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.greenplum.pxf.plugins.json.JsonAccessor</accessor>
<resolver>org.greenplum.pxf.plugins.json.JsonResolver</resolver>
</plugins>
<protocol>oss</protocol>
</profile>
<profile>
<name>oss:AvroSequenceFile</name>
<description>
Read an Avro format stored in sequence file, with separated schema file from Alibaba Cloud
</description>
<plugins>
<fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.greenplum.pxf.plugins.hdfs.SequenceFileAccessor</accessor>
<resolver>org.greenplum.pxf.plugins.hdfs.AvroResolver</resolver>
</plugins>
<protocol>oss</protocol>
</profile>
<profile>
<name>oss:SequenceFile</name>
<description>
Profile for accessing Sequence files serialized with a custom Writable class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Profile for accessing Sequence files serialized with a custom Writable class
Profile for accessing Sequence files serialized with a custom Writable class from Alibaba Cloud

</description>
<plugins>
<fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.greenplum.pxf.plugins.hdfs.SequenceFileAccessor</accessor>
<resolver>org.greenplum.pxf.plugins.hdfs.WritableResolver</resolver>
</plugins>
<protocol>oss</protocol>
</profile>
</profiles>
1 change: 1 addition & 0 deletions server/pxf-service/src/templates/conf/pxf-log4j2.xml
Expand Up @@ -31,6 +31,7 @@
<!-- <Logger name="org.greenplum.pxf" level="debug"/> -->

<!-- The levels below are tuned to provide minimal output, change to INFO or DEBUG to troubleshoot -->
<Logger name="com.aliyun.oss" level="error"/>
<Logger name="com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase" level="warn"/>
<Logger name="org.apache.hadoop" level="warn"/>
<Logger name="org.apache.parquet" level="warn"/>
Expand Down
27 changes: 27 additions & 0 deletions server/pxf-service/src/templates/templates/oss-site.xml
@@ -0,0 +1,27 @@
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>fs.oss.endpoint</name>
<value>YOUR_OSS_ENDPOINT</value>
<description>Aliyun OSS endpoint to connect to. An up-to-date list is
provided in the Aliyun OSS Documentation.</description>
</property>
<property>
<name>fs.oss.accessKeyId</name>
<value>YOUR_OSS_ACCESS_KEY_ID</value>
<description>Aliyun Access Key ID</description>
</property>
<property>
<name>fs.oss.accessKeySecret</name>
<value>YOUR_OSS_SECRET_ACCESS_KEY</value>
<description>Aliyun Access Key Secret</description>
</property>
<property>
<name>fs.AbstractFileSystem.oss.impl</name>
<value>org.apache.hadoop.fs.aliyun.oss.OSS</value>
</property>
<property>
<name>fs.oss.impl</name>
<value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
</property>
</configuration>