Change log

Generated on 2022-12-14

Release 1.5.0

Gazelle Plugin

Features


#931	Reuse partition vectors for arrow scan
#955	implement missing expressions
#1120	Support aggregation window functions with order by
#1135	Supports Spark 3.2.2 shims
#1114	Remove tmp directory after application exits
#862	implement row_number window function
#1007	Document how to test columnar UDF
#942	Use hash aggregate for string type input

Performance


#1144	Optimize cast WSCG performance

Bugs Fixed


#1170	Segfault on data source v2
#1164	Limit the column num in WSCG
#1166	Peers' values should be considered in window function for CURRENT ROW in range mode
#1149	Vulnerability issues
#1112	Validate Error： “Invalid: Length spanned by binary offsets (21) larger than values array (size 20)”
#1103	wrong hashagg results
#929	Failed to add user extension while using gazelle
#1100	Wildcard in json path is not supported
#1079	Like function gets wrong result when default escape char is contained
#1046	Fall back to use row-based operators, error is makeStructField is unable to parse from conv
#1053	Exception when there is function expression in pos or len of substring
#1024	ShortType is not supported in ColumnarLiteral
#1034	Exception when there is unix_timestamp in CaseWhen
#1032	Missing WSCG check for ExistenceJoin
#1027	partition by literal in window function
#1019	Support more date formats for from_unixtime & unix_timestamp
#999	The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec
#984	concat_ws
#958	JVM/Native R2C and CoalesceBatcth process time inaccuracy
#979	Failed to find column while reading parquet with case insensitive

PRs


#1175	[NSE-1171] Support merge parquet schema and read missing schema
#1178	[NSE-1161][FOLLOWUP] Remove extra compression type check
#1162	[NSE-1161] Support read-write parquet conversion to read-write arrow
#1014	[NSE-956] allow to write parquet with compression
#1176	bump h2/pgsql version
#1173	[NSE-1171] Throw RuntimeException when reading duplicate fields in case-insensitive mode
#1172	[NSE-1170] Setting correct row number in batch scan w/ partition columns
#1169	[NSE-1161] Format sql config string key
#1167	[NSE-1166] Cover peers' values in sum window function in range mode
#1165	[NSE-1164] Limit the max column num in WSCG
#1160	[NSE-1149] upgrade guava to 30.1.1
#1158	[NSE-1149] upgrade guava to 30.1.1
#1152	[NSE-1149] upgrade guava to 24.1.1
#1153	[NSE-1149] upgrade pgsql to 42.3.3
#1150	[NSE-1149] Remove log4j in shims module
#1146	[NSE-1135] Introduce shim layer for supporting spark 3.2.2
#1145	[NSE-1144] Optimize cast wscg performance
#1136	Remove project from wscg when it's the child of window
#1122	[NSE-1120] Support sum window function with order by statement
#1131	[NSE-1114] Remove temp directory without FileUtils.forceDeleteOnExit
#1129	[NSE-1127] Use larger buffer for hash agg
#1130	[NSE-610] fix hashjoin build time metric
#1126	[NSE-1125] Add status check for hashing GetOrInsert
#1056	[NSE-955] Support window function lag
#1123	[NSE-1118] fix codegen on TPCDS Q88
#1119	[NSE-1118] adding more checks for SMJ codegen
#1058	[NSE-981] Add a test suite for projection codegen
#1117	[NSE-1116] Disable columnar url_decoder
#1113	[NSE-1112] Fix Arrow array meta data validating issue when writing parquet files
#1039	[NSE-1019] fix codegen for all expressions
#1115	[NSE-1114] Remove tmp directory after application exits
#1111	remove debug log
#1098	[NSE-1108] allow to use different cases in column names
#1082	[NSE-1071] Refactor vector resizing in hash aggregate
#1036	[NSE-987] fix string date
#948	[NSE-947] Add a whole stage fallback strategy
#1099	[NSE-1104] fix hashagg w/ empty string
#1102	[NSE-400] Fix memory leak for native C2R and R2C.
#1101	[NSE-1100] Fall back get_json_object when wildcard is contained in json path
#1090	[NSE-1065] fix on count distinct w/ keys
#1097	Ignore two unit tests
#1081	[NSE-1075] Support dynamic merge file partition
#1080	[NSE-1079] Set the default escape char for like function
#1078	[NSE-610] support big keys in hashagg
#1072	[NSE-1071] Add tiny optimizations for hash aggregation functions
#1069	[NSE-800] Remove spark-arrow-datasource-parquet in assembly
#1066	[NSE-1065] Adding hashagg w/ filter support
#1067	[NSE-958] Fix JVM R2C operator metrics
#935	[NSE-931] Reuse partition vectors for arrow scan
#1064	[NSE-955] Implement parse_url
#1063	[NSE-955] Support more date format in unix timestamp
#930	[NSE-929] Support user defined spark extensions
#1038	[NSE-928] allow to sort with big partitions
#1057	[NSE-1019] fix codegen for unixtimestamp
#1055	[NSE-955] Support md5/sha1/sha2 functions
#903	[NSE-610] hashagg opt#3
#1044	[NE-400] fix memory leakage in native columnartorow
#1041	[NSE-1023] [NSE-1046] Cover more supported expressions in getting AttributeReference
#1054	[NSE-1053] Support function in substring's pos and len
#1049	[NSE-955] Support bin function
#1048	[NSE-955] Support power function
#1042	[NSE-955] Support find_in_set function
#1025	[NSE-1024] Support ShortType in ColumnarLiteral
#1037	[NSE-955] Turn on the support for get_json_object
#1033	[NSE-1032] Adding WSCG check for keys in Join
#1035	[NSE-1034] Add timeZoneId in ColumnarUnixTimestamp
#1028	[NSE-1027] Problem with Literal in window function
#1017	[NSE-999] use TimSort for STRING/DECIMAL onekey based sorting
#1022	[NSE-955] Support remainder function
#1021	[NSE-1019] [NSE-1020] Support more date formats and be aware of local time zone in handling unix timestamp
#1009	[NSE-999] s/string/string_view in sort
#990	[NSE-943] Improve rowtocolumn operator
#1000	[NSE-862] improve row_number()
#1013	[NSE-955] Add Murmur3Hash expression support
#995	[NSE-981] Add more codegen checking in BHJ & SHJ
#1006	[NSE-1007] Add a test guide for columnar UDF
#969	[NSE-943] Optimize data conversion for String/Binary type in Row2Columnar
#973	[NSE-928] Add ARROW_CHECK for batch_size check
#992	[NSE-984] fix concat_ws
#991	[NSE-981] check all expressions in HashAgg
#993	[NSE-979] fix data source
#980	[NSE-979] Support reading parquet with case sensitive
#985	[NSE-981] Implement supportColumnarCodegen to reflect the actual support state
#964	[NSE-955] implement lpad/rpad
#963	[NSE-955] implement concat_ws
#971	[NSE-955] Support hex expression
#968	[NSE-955] implement lower function
#965	[NSE-955] Support expression conv
#949	[NSE-862] implement row_number function
#960	[NSE-955] doc: Add columnar expression development guide
#941	[NSE-942] Force to use hash aggregate for string type input
#959	[NSE-958] Fix SQLMetrics inaccuracy in JVM/Native R2C and CoalesceBatcth

Release 1.4.0

Gazelle Plugin

Features


#781	Add spark eventlog analyzer for advanced analyzing
#927	Column2Row further enhancement
#913	Add Hadoop 3.3 profile to pom.xml
#869	implement first agg function
#926	Support UDF URLDecoder
#856	[SHUFFLE] manually split of Variable length buffer (String likely)
#886	Add pmod function support
#855	[SHUFFLE] HugePage support in shuffle
#872	implement replace function
#867	Add substring_index function support
#818	Support length, char_length, locate, regexp_extract
#864	Enable native parquet write by default
#828	CoalesceBatches native implementation
#800	Combine datasource and columnar core jar

Performance


#848	Optimize Columnar2Row performance
#943	Optimize Row2Columnar performance
#854	Enable skipping columnarWSCG for queries with small shuffle size
#857	[SHUFFLE] split by reducer by column

Bugs Fixed


#827	Github action is broken
#987	TPC-H q7, q8, q9 run failed when using String for Date
#892	Q47 and q57 failed on ubuntu 20.04 OS without open-jdk.
#784	Improve Sort Spill
#788	Spark UT of "randomSplit on reordered partitions" encountered "Invalid: Map array child array should have no nulls" issue
#821	Improve Wholestage Codegen check
#831	Support more expression types in getting attribute
#876	Write arrow hang with OutputWriter.path
#891	Spark executor lost while DatasetFileWriter failed with speculation
#909	"INSERT OVERWRITE x SELECT /+ REPARTITION(2) / * FROM y LIMIT 2" drains 4 rows into table x using Arrow write extension
#889	Failed to write with ParquetFileFormat while using ArrowWriteExtension
#910	TPCDS failed, segfault caused by PR903
#852	Unit test fix for NSE-843
#843	ArrowDataSouce: Arrow dataset inspect() is called every time a file is read

PRs


#1005	[NSE-800] Fix an assembly warning
#1002	[NSE-800] Pack the classes into one single jar
#988	[NSE-987] fix string date
#977	[NSE-126] set default codegen opt to O1
#975	[NSE-927] Add macro AVX512BW check for different CPU architecture
#962	[NSE-359] disable unit tests on spark32 package
#966	[NSE-913] Add support for Hadoop 3.3.1 when packaging
#936	[NSE-943] Optimize IsNULL() function for Row2Columnar
#937	[NSE-927] Implement AVX512 optimization selection in Runtime and merge two C2R code files into one.
#951	[DNM] update sparklog
#938	[NSE-581] implement rlike/regexp_like
#946	[DNM] update on sparklog script
#939	[NSE-581] adding ShortType/FloatType in ColumnarLiteral
#934	[NSE-927] Extract and inline functions for native ColumnartoRow
#933	[NSE-581] Improve GetArrayItem(Split()) performance
#922	[NSE-912] Remove extra handleSafe costs
#925	[NSE-926] Support a UDF: URLDecoder
#924	[NSE-927] Enable AVX512 in Binary length calculation for native ColumnartoRow
#918	[NSE-856] Optimize of string/binary split
#908	[NSE-848] Optimize performance for Column2Row
#900	[NSE-869] Add 'first' agg function support
#917	[NSE-886] Add pmod expression support
#916	[NSE-909] fix slow test
#915	[NSE-857] Further optimizations of validity buffer split
#912	[NSE-909] "INSERT OVERWRITE x SELECT /+ REPARTITION(2) / * FROM y L…
#896	[NSE-889] Failed to write with ParquetFileFormat while using ArrowWriteExtension
#911	[NSE-910] fix bug of PR903
#901	[NSE-891] Spark executor lost while DatasetFileWriter failed with speculation
#907	[NSE-857] split validity buffer by reducer
#902	[NSE-892] Allow to use jar cmd not in PATH
#898	[NSE-867][FOLLOWUP] Add substring_index function support
#894	[NSE-855] allocate large block of memory for all reducer #881
#880	[NSE-857] Fill destination buffer by reducer
#839	[DNM] some optimizations to shuffle's split function
#879	[NSE-878]Wip get phyplan bugfix
#877	[NSE-876] Fix writing arrow hang with OutputWriter.path
#873	[NSE-872] implement replace function
#850	[NSE-854] Small Shuffle Size disable wholestagecodegen
#868	[NSE-867] Add substring_index function support
#847	[NSE-818] Support length, char_length, locate & regexp_extract
#865	[NSE-864] Enable native parquet write by default
#811	[NSE-810] disable codegen for SMJ with local limit
#860	remove sensitive info from physical plan
#853	[NSE-852] Unit test fix for NSE-843
#844	[NSE-843] ArrowDataSouce: Arrow dataset inspect() is called every tim…
#842	fix in eventlog script
#841	fix bug of script
#829	[NSE-828] Add native CoalesceBatches implementation
#830	[NSE-831] Support more expression types in getting attribute
#815	[NSE-610] Shrink hashmap to use less memory
#822	[NSE-821] Fix Wholestage Codegen on unsupported pattern
#824	[NSE-823] Use `SPARK_VERSION_SHORT` instead of `SPARK_VERSION` to find SparkShims
#826	[NSE-827] fix GHA
#819	[DNM] complete sparklog script
#802	[NSE-794] Fix count() with decimal value
#801	[NSE-786] Adding docs for shim layers
#790	[NSE-781]Add eventlog analyzer tool
#789	[NSE-788] Quick fix for randomSplit on reordered partitions
#780	[NSE-784] fallback Sort after SortHashAgg

OAP MLlib

Performance


#204	Intel-MLlib require more memory to run Bayes algorithm.

PRs


#208	[ML-204][NaiveBayes] Remove cache from NaiveBayes

Release 1.3.1

Gazelle Plugin

Features


#710	Add rand expression support
#745	improve codegen check
#761	Update the document to reflect the changes in build and deployment
#635	Document the incompatibility with Spark on Expressions
#702	Print output datatype for columnar shuffle on WebUI
#712	[Nested type] Optimize Array split and support nested Array
#732	[Nested type] Support Struct and Map nested types in Shuffle
#759	Add spark 3.1.2 & 3.1.3 as supported versions for 3.1.1 shim layer

Performance


#610	refactor on shuffled hash join/hash agg

Bugs Fixed


#755	GetAttrFromExpr unsupported issue when run TPCDS Q57
#764	add java.version to clarify jdk version
#774	Fix runtime issues on spark 3.2
#778	Failed to find include file while running code gen
#725	gazelle failed to run with spark local
#746	Improve memory allocation on native row to column operator
#770	There are cast exception and null pointer expection in spark-3.2
#772	ColumnarBatchScan name missing in UI for Spark321
#740	Handle exceptions like std::out_of_range in casting string to numeric types in WSCG
#727	Create table failed with TPCH partiton dataset
#719	Wrong result on TPC-DS Q38, Q87
#705	Two unit tests failed on master branch

PRs


#834	[NSE-746]Fix memory allocation in row to columnar
#809	[NSE-746]Fix memory allocation in row to columnar
#817	[NSE-761] Update document to reflect spark 3.2.x support
#805	[NSE-772] Code refactor for ColumnarBatchScan
#802	[NSE-794] Fix count() with decimal value
#779	[NSE-778] Failed to find include file while running code gen
#798	[NSE-795] Fix a consecutive SMJ issue in wscg
#799	[NSE-791] fix xchg reuse in Spark321
#773	[NSE-770] [NSE-774] Fix runtime issues on spark 3.2
#787	[NSE-774] Fallback broadcast exchange for DPP to reuse
#763	[NSE-762] Add complex types support for ColumnarSortExec
#783	[NSE-782] prepare 1.3.1 release
#777	[NSE-732]Adding new config to enable/disable complex data type support
#776	[NSE-770] [NSE-774] Fix runtime issues on spark 3.2
#765	[NSE-764] declare java.version for maven
#767	[NSE-610] fix unit tests on SHJ
#760	[NSE-759] Add spark 3.1.2 & 3.1.3 as supported versions for 3.1.1 shim layer
#757	[NSE-746]Fix memory allocation in row to columnar
#724	[NSE-725] change the code style for ExecutorManger
#751	[NSE-745] Improve codegen check for expression
#742	[NSE-359] [NSE-273] Introduce shim layer to fix compatibility issues for gazelle on spark 3.1 & 3.2
#754	[NSE-755] Quick fix for ConverterUtils.getAttrFromExpr for TPCDS queries
#749	[NSE-732] Support Map complex type in Shuffle
#738	[NSE-610] hashjoin opt1
#733	[NSE-732] Support Struct complex type in Shuffle
#744	[NSE-740] fix codegen with out_of_range check
#743	[NSE-740] Catch out_of_range exception in casting string to numeric types in wscg
#735	[NSE-610] hashagg opt#2
#707	[NSE-710] Add rand expression support
#734	[NSE-727] Create table failed with TPCH partiton dataset, patch 2
#715	[NSE-610] hashagg opt#1
#731	[NSE-727] Create table failed with TPCH partiton dataset
#713	[NSE-712] Optimize Array split and support nested Array
#721	[NSE-719][backport]fix null check in SMJ
#720	[NSE-719] fix null check in SMJ
#718	Following NSE-702, fix for AQE enabled case
#691	[NSE-687]Try to upgrade log4j
#703	[NSE-702] Print output datatype for columnar shuffle on WebUI
#706	[NSE-705] Fallback R2C on unsupported cases
#657	[NSE-635] Add document to clarify incompatibility issues in expressions
#623	[NSE-602] Fix Array type shuffle split segmentation fault
#693	[NSE-692] JoinBenchmark is broken

OAP MLlib

Features


#189	Intel-MLlib not support spark-3.2.1 version
#186	[Core] Support CDH versions
#187	Intel-MLlib not support spark-3.1.3 version.
#180	[CI] Refactor CI and add code checks

Bugs Fixed


#202	[SDLe] Update oneAPI version to solve vulnerabilities
#171	[Core] detect if spark.dynamicAllocation.enabled is set true and exit gracefully
#185	[Naive Bayes]Big dataset will out of memory errors.
#184	[Core] Fix code style issues
#179	[GPU][PCA] use distributed covariance as the first step for PCA
#178	[ALS] Fix error when converting buffer to CSRNumericTable
#177	[Native Bayes] Fix error when converting Vector to CSRNumericTable

PRs


#203	[ML-202] Update oneAPI Base Toolkit version and prepare for OAP 1.3.1 release
#197	[ML-187]Support spark 3.1.3 and 3.2.0 and support CDH
#201	[ML-171]When enabled oap mllib, spark.dynamicAllocation.enabled should be set false.
#196	[ML-185]Select label and features columns and cache data
#195	[ML-184]Fix code style issues
#183	[ML-180][CI] Refactor CI and add code checks
#175	[ML-179][GPU] use distributed covariance as the first step for PCA
#182	[ML-178]fix als convert buffer to NumericTable
#176	[ML-177][Native Bayes] Fix error when converting Vector to CSRNumericTable

Release 1.3.0

Gazelle Plugin

Features


#550	[ORC] Support ORC Format Reading
#188	Support Dockerfile
#574	implement native LocalLimit/GlobalLimit
#684	BufferedOutputStream causes massive futex system calls
#465	Provide option to rely on JVM GC to release Arrow buffers in Java
#681	Enable gazelle to support two math expressions: ceil & floor
#651	Set Hadoop 3.2 as default in pom.xml
#126	speed up codegen
#596	[ORC] Verify whether ORC file format supported complex data types in gazelle
#581	implement regex/trim/split expr
#473	Optimize the ArrowColumnarToRow performance
#647	Leverage buffered write in shuffle
#674	Add translate expression support
#675	Add instr expression support
#645	Add support to cast data in bool type to bigint type or string type
#463	version bump on 1.3
#583	implement get_json_object
#640	Disable compression for tiny payloads in shuffle
#631	Do not write schema in shuffle writting
#609	Implement date related expression like to_date, date_sub
#629	Improve codegen failure handling
#612	Add metric "prepare time" for shuffle writer
#576	columnar FROM_UNIXTIME
#589	[ORC] Add TPCDS and TPCH UTs for ORC Format Reading
#537	Increase partition number adaptively for large SHJ stages
#580	document how to create metadata for data source V1 based testing
#555	support batch size > 32k
#561	document the code generation behavior on driver, suggest to deploy driver on powerful server
#523	Support ArrayType in ArrowColumnarToRow operator
#542	Add rule to propagate local window for rank + filter pattern
#21	JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew
#512	Add strategy to force use of SHJ
#518	Arrow buffer cleanup: Support both manual release and auto release as a hybrid mode
#525	Support AQE in columnWriter
#516	Support External Sort in sort kernel
#503	能提供下官网性能测试的详细配置吗？
#501	Remove ArrowRecordBatchBuilder and its usages
#461	Support ArrayType in Gazelle
#479	Optimize sort materialization
#449	Refactor sort codegen kernel
#667	1.3 RC release
#352	Map/Array/Struct type support for Parquet reading in Arrow Data Source

Bugs Fixed


#660	support string builder in window output
#636	Remove log4j 1.2 Support for security issue
#540	reuse subquery in TPC-DS Q14a
#687	log4j 1.2.17 in spark-core
#617	Exceptions handling for stoi, stol, stof, stod in whole stage code gen
#653	Handle special cases for get_json_object in WSCG
#650	Scala test ArrowColumnarBatchSerializerSuite is failing
#642	Fail to cast unresolved reference to attribute reference
#599	data source unit tests are broken
#604	Sort with special projection key broken
#627	adding security instructions
#615	An excpetion in trying to cast attribute in getResultAttrFromExpr of ConverterUtils
#588	preallocated memory for shuffle split
#606	NullpointerException getting map values from ArrowWritableColumnVector
#569	CPU overhead on fine grain / concurrent off-heap acquire operations
#553	Support casting string type to types like int, bigint, float, double
#514	Fix the core dump issue in Q93 when enable columnar2row
#532	Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow
#534	Columnar SHJ: Error if probing with empty record batch
#529	Wrong build side may be chosen for SemiJoin when forcing use of SHJ
#504	Fix non-decimal window function unit test failures
#493	Three unit tests newly failed on master branch

PRs


#690	[NSE-667] backport patches to 1.3 branch
#688	[NSE-687]remove exclude log4j when running ut
#686	[NSE-400] Fix the bug for negative decimal data
#685	[NSE-684] BufferedOutputStream causes massive futex system calls
#680	[NSE-667] backport patches to 1.3 branch
#683	[NSE-400] fix leakage in row to column operator
#637	[NSE-400] Native Arrow Row to columnar support
#648	[NSE-647] Leverage buffered write in shuffle
#682	[NSE-681] Add floor & ceil expression support
#672	[NSE-674] Add translate expression support
#676	[NSE-675] Add instr expression support
#652	[NSE-651]Use Hadoop 3.2 as default hadoop.version
#666	[NSE-667] backport patches to 1.3 branch
#644	[NSE-645] Add support to cast bool type to bigint type & string type
#659	[NSE-650] Scala test ArrowColumnarBatchSerializerSuite is failing
#649	[NSE-660] fix window builder with string
#655	[NSE-617] Handle exception in cast expression from string to numeric types in WSCG
#654	[NSE-653] Add validity checking for get_json_object in WSCG
#641	[NSE-640] Disable compression for tiny payloads in shuffle
#646	[NSE-636]Remove log4j1 related unit tests
#488	[NSE-463] version bump to 1.3.0-SNAPSHOT
#639	[NSE-126] improve codegen with pre-compiled header
#638	[NSE-642] Correctly get ResultAttrFromExpr for sql with 'case when IN/AND/OR'
#632	[NSE-631] Do not write schema in shuffle writting
#633	[NSE-601] Fix an issue in the case of group by coalesce
#630	[NSE-629] improve codegen failure handling
#622	[NSE-609] Complete to_date expression support
#628	[NSE-627] Doc: adding security readme
#624	[NSE-609] Add support for date_sub expression
#619	[NSE-583] impl get_json_object in wscg
#614	[NSE-576] Support from_unixtime expression in the case that 'yyyyMMdd' format is required
#616	[NSE-615] Add tackling for ColumnarEqualTo type in getResultAttrFromExpr of ConverterUtils
#613	[NSE-612] Add metric "prepare time" for shuffle writer
#608	[NSE-602] don't enable columnar shuffle on unsupported data types
#601	[NSE-604] fix sort w/ proj keys
#607	[NSE-606] NullpointerException getting map values from ArrowWritableC…
#584	[NSE-583] implement get_json_object
#595	[NSE-576] fix from_unixtime
#582	[NSE-581]impl regexp_replace
#594	[NSE-588] config the pre-allocated memory for shuffle's splitter
#600	[NSE-599] fix datasource unit tests
#597	[NSE-596] Add complex data types validation for ORC file format in gazelle
#590	[NSE-569] CPU overhead on fine grain / concurrent off-heap acquire operations
#586	[NSE-581] Add trim, left trim, right trim support in expression
#578	[NSE-589] Add TPCDS and TPCH suite for Orc fileformat
#538	[NSE-537] Increase partition number adaptively for large SHJ stages
#587	[NSE-580] update doc on data source(DS V1/V2 usage)
#575	[NSE-574]implement columnar limit
#556	[NSE-555] using 32bit selection vector
#577	[NSE-576] implement columnar from_unixtime
#572	[NSE-561] refine docs on sample configurations and code generation behavior
#552	[NSE-553] Complete the support to cast string type to types like int, bigint, float, double
#543	[NSE-540] enable reuse subquery
#554	[NSE-207] change the fallback condition for Columnar Like
#559	[NSE-352] Map/Array/Struct type support for Parquet reading in ArrowData Source
#551	[NSE-550] Support ORC Format Reading in Gazelle
#545	[NSE-542] Add rule to propagate local window for rank + filter pattern
#541	[NSE-207] improve the fix for join optimization
#495	[NSE-207] Fix NaN in Max and Min
#533	[NSE-532] Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow
#536	[NSE-207] Ignore tests causing test stop
#535	[NSE-534] Columnar SHJ: Error if probing with empty record batch
#531	[NSE-21] JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew
#530	[NSE-529] Wrong build side may be chosen for SemiJoin when forcing use of SHJ
#524	[NSE-523] Support ArrayType in ArrowColumnarToRow optimization
#513	[NSE-512] Add strategy to force use of SHJ
#519	[NSE-518] Arrow buffer cleanup: Support both manual release and auto …
#526	[NSE-525]Support AQE for ColumnarWriter
#517	[NSE-516]Support ExternalSorter to control memory footage
#515	[NSE-514] Fix the core dump issue in Q93 with V2 test
#509	Update README.md for performance result.
#511	[NSE-207] fix full UT test
#502	[NSE-501] Remove ArrowRecordBatchBuilder and its usages
#507	Previous PR removed this UT, fix here
#496	[NSE-461]columnar shuffle support for ArrayType
#480	[NSE-479] optimize sort materialization
#474	[NSE-473]Optimize ArrowColumnarToRow performance
#505	[NSE-504] Fix non-decimal window function unit test
#497	[NSE-493] Three unit tests newly failed on master branch (Python UDF Unit Tests)
#466	[NSE-465] POC release memory using GC
#462	[NSE-461][WIP] Support ArrayType in ArrowWritableColumnVector and ColumarPandasUDF
#450	[NSE-449] Refactor codegen sort kernel
#471	[NSE-207] Enabling UT report
#445	[NSE-444]Support ArrowColumnarToRowExec when the root plan is ColumnarToRowExec
#447	[NSE-207] Fix date and timestamp functions

OAP MLlib

Features


#158	[GPU] Add convertToSyclHomogen for row merged table for kmeans and pca
#149	[GPU] Add check-gpu utility
#140	[Core] Refactor and support multiple Spark versions in single JAR
#137	[Core] Multiple improvements for build & deploy and integrate oneAPI 2021.4
#133	[Correlation] Add Correlation algorithm
#125	[GPU] Update for Kmeans and PCA

Bugs Fixed


#161	[SDLe][Snyk] Log4j 1.2.* issues brought from Spark when scanning 3rd-party components for vulnerabilities
#155	[POM] Update scala version to 2.12.15
#135	[Core] Fix ccl::gather and Add ccl::gatherv

PRs


#162	[ML-161] Excluding log4j 1.x dependency from Spark core to avoid log4…
#159	[GPU] Add convertToSyclHomogen for row merged table for kmeans and pca
#157	[ML-155] [POM] Update scala version to 2.12.15
#150	[ML-149][GPU] Add check-gpu utility
#144	[ML-151] enable Summarizer with OAP
#141	[Core] Refactor and support multiple Spark versions in single JAR
#139	[ML-137] [Core] Multiple improvements for build & deploy and integrate oneAPI 2021.4
#127	[ML-133][Correlation] Add Correlation algorithm
#126	[ML-125][GPU] Update for Kmeans and PCA

Release 1.2.0

Gazelle Plugin

Features


#394	Support ColumnarArrowEvalPython operator
#368	Encountered Hadoop version (3.2.1) conflict issue on AWS EMR-6.3.0
#375	Implement a series of datetime functions
#183	Add Date/Timestamp type support
#362	make arrow-unsafe allocator as the default
#343	configurable codegen opt level
#333	Arrow Data Source: CSV format support fix
#223	Add Parquet write support to Arrow data source
#320	Add build option to enable unsafe Arrow allocator
#337	UDF: Add test case for validating basic row-based udf
#326	Update Scala unit test to spark-3.1.1

Performance


#400	Optimize ColumnarToRow Operator in NSE.
#411	enable ccache on C++ code compiling

Bugs Fixed


#358	Running TPC DS all queries with native-sql-engine for 10 rounds will have performance degradation problems in the last few rounds
#481	JVM heap memory leak on memory leak tracker facilities
#436	Fix for Arrow Data Source test suite
#317	persistent memory cache issue
#382	Hadoop version conflict when supporting to use gazelle_plugin on Google Cloud Dataproc
#384	ColumnarBatchScanExec reading parquet failed on java.lang.IllegalArgumentException: not all nodes and buffers were consumed
#370	Failed to get time zone: NoSuchElementException: None.get
#360	Cannot compile master branch.
#341	build failed on v2 with -Phadoop-3.2

PRs


#489	[NSE-481] JVM heap memory leak on memory leak tracker facilities (Arrow Allocator)
#486	[NSE-475] restore coalescebatches operator before window
#482	[NSE-481] JVM heap memory leak on memory leak tracker facilities
#470	[NSE-469] Lazy Read: Iterator objects are not correctly released
#464	[NSE-460] fix decimal partial sum in 1.2 branch
#439	[NSE-433]Support pre-built Jemalloc
#453	[NSE-254] remove arrow-data-source-common from jar with dependency
#452	[NSE-254]Fix redundant arrow library issue.
#432	[NSE-429] TPC-DS Q14a/b get slowed down within setting spark.oap.sql.columnar.sortmergejoin.lazyread=true
#426	[NSE-207] Fix aggregate and refresh UT test script
#442	[NSE-254]Issue0410 jar size
#441	[NSE-254]Issue0410 jar size
#440	[NSE-254]Solve the redundant arrow library issue
#437	[NSE-436] Fix for Arrow Data Source test suite
#387	[NSE-383] Release SMJ input data immediately after being used
#423	[NSE-417] fix sort spill on inplsace sort
#416	[NSE-207] fix left/right outer join in SMJ
#422	[NSE-421]Disable the wholestagecodegen feature for the ArrowColumnarToRow operator
#369	[NSE-417] Sort spill support framework
#401	[NSE-400] Optimize ColumnarToRow Operator in NSE.
#413	[NSE-411] adding ccache support
#393	[NSE-207] fix scala unit tests
#407	[NSE-403]Add Dataproc integration section to README
#406	[NSE-404]Modify repo name in documents
#402	[NSE-368]Update emr-6.3.0 support
#395	[NSE-394]Support ColumnarArrowEvalPython operator
#346	[NSE-317]fix columnar cache
#392	[NSE-382]Support GCP Dataproc 2.0
#388	[NSE-382]Fix Hadoop version issue
#385	[NSE-384] "Select count(*)" without group by results in error: java.lang.IllegalArgumentException: not all nodes and buffers were consumed
#374	[NSE-207] fix left anti join and support filter wo/ project
#376	[NSE-375] Implement a series of datetime functions
#373	[NSE-183] fix timestamp in native side
#356	[NSE-207] fix issues found in scala unit tests
#371	[NSE-370] Failed to get time zone: NoSuchElementException: None.get
#347	[NSE-183] Add Date/Timestamp type support
#363	[NSE-362] use arrow-unsafe allocator by default
#361	[NSE-273] Spark shim layer infrastructure
#364	[NSE-360] fix ut compile and travis test
#264	[NSE-207] fix issues found from join unit tests
#344	[NSE-343]allow to config codegen opt level
#342	[NSE-341] fix maven build failure
#324	[NSE-223] Add Parquet write support to Arrow data source
#321	[NSE-320] Add build option to enable unsafe Arrow allocator
#299	[NSE-207] fix unsuppored types in aggregate
#338	[NSE-337] UDF: Add test case for validating basic row-based udf
#336	[NSE-333] Arrow Data Source: CSV format support fix
#327	[NSE-326] update scala unit tests to spark-3.1.1

OAP MLlib

Features


#110	Update isOAPEnabled for Kmeans, PCA & ALS
#108	Update PCA GPU, LiR CPU and Improve JAR packaging and libs loading
#93	[GPU] Add GPU support for PCA
#101	[Release] Add version update scripts and improve scripts for examples
#76	Reorganize Spark version specific code structure
#82	[Tests] Add NaiveBayes test and refactors

Bugs Fixed


#119	[SDLe][Klocwork] Security vulnerabilities found by static code scan
#121	Meeting freeing memory issue after the training stage when using Intel-MLlib to run PCA and K-means algorithms.
#122	Cannot run K-means and PCA algorithm with oap-mllib on Google Dataproc
#123	[Core] Improve locality handling for native lib loading
#116	Cannot run ALS algorithm with oap-mllib thanks to the commit "2883d3447d07feb55bf5d4fee8225d74b0b1e2b1"
#114	[Core] Improve native lib loading
#94	Failed to run KMeans workload with oap-mllib in JLSE
#95	Some shared libs are missing in 1.1.1 release
#105	[Core] crash when libfabric version conflict
#98	[SDLe][Klocwork] Security vulnerabilities found by static code scan
#88	[Test] Fix ALS Suite "ALS shuffle cleanup standalone"
#86	[NaiveBayes] Fix isOAPEnabled and add multi-version support

PRs


#124	[ML-123][Core] Improve locality handling for native lib loading
#118	[ML-116] use getOneCCLIPPort and fix lib loading
#115	[ML-114] [Core] Improve native lib loading
#113	[ML-110] Update isOAPEnabled for Kmeans, PCA & ALS
#112	[ML-105][Core] Fix crash when libfabric version conflict
#111	[ML-108] Update PCA GPU, LiR CPU and Improve JAR packaging and libs loading
#104	[ML-93][GPU] Add GPU support for PCA
#103	[ML-98] [Release] Clean Service.java code
#102	[ML-101] [Release] Add version update scripts and improve scripts for examples
#90	[ML-88][Test] Fix ALS Suite "ALS shuffle cleanup standalone"
#87	[ML-86][NaiveBayes] Fix isOAPEnabled and add multi-version support
#83	[ML-82] [Tests] Add NaiveBayes test and refactors
#75	[ML-53] [CPU] Add Linear & Ridge Regression
#77	[ML-76] Reorganize multiple Spark version support code structure
#68	[ML-55] [CPU] Add Naive Bayes
#64	[ML-42] [PIP] Misc improvements and refactor code
#62	[ML-30][Coding Style] Add code style rules & scripts for Scala, Java and C++

SQL DS Cache

Features


#155	reorg to support profile based multi spark version

Bugs Fixed


#190	The function of vmem-cache and guava-cache should not be associated with arrow.
#181	[SDLe]Vulnerabilities scanned by Snyk

PRs


#182	[SQL-DS-CACHE-181][SDLe]Fix Snyk code scan issues
#191	[SQL-DS-CACHE-190]put plasma detector in seperate object to avoid unnecessary dependency of arrow
#189	[SQL-DS-CACHE-188][POAE7-1253] improvement of fallback from plasma cache to simple cache
#157	[SQL-DS-CACHE-155][POAE7-1187]reorg to support profile based multi spark version

PMem Shuffle

Bugs Fixed


#46	Cannot run Terasort with pmem-shuffle of branch-1.2
#43	Rpmp cannot be compiled due to the lack of boost header file.

PRs


#51	[PMEM-SHUFFLE-50] Remove description about download submodules manually since they can be downloaded automatically.
#49	[PMEM-SHUFFLE-48] Fix the bug about mapstatus tracking and add more connections for metastore.
#47	[PMEM-SHUFFLE-46] Fix the bug that off-heap memory is over used in shuffle reduce stage.
#40	[PMEM-SHUFFLE-39] Fix the bug that pmem-shuffle without RPMP fails to pass Terasort benchmark due to latest patch.
#38	[PMEM-SHUFFLE-37] Add start-rpmp.sh and stop-rpmp.sh
#33	[PMEM-SHUFFLE-28]Add RPMP with HA support and integrate it with Spark3.1.1
#27	[PMEM-SHUFFLE] Change artifact name to make it compatible with naming…

Remote Shuffle

Bugs Fixed


#24	Enhance executor memory release

PRs


#25	[REMOTE-SHUFFLE-24] Enhance executor memory release

Release 1.1.1

Native SQL Engine

Features


#304	Upgrade to Arrow 4.0.0
#285	ColumnarWindow: Support Date/Timestamp input in MAX/MIN
#297	Disable incremental compiler in CI
#245	Support columnar rdd cache
#276	Add option to switch Hadoop version
#274	Comment to trigger tpc-h RAM test
#256	CI: do not run ram report for each PR

Bugs Fixed


#325	java.util.ConcurrentModificationException: mutation occurred during iteration
#329	numPartitions are not the same
#318	fix Spark 311 on data source v2
#311	Build reports errors
#302	test on v2 failed due to an exception
#257	different version of slf4j-log4j
#293	Fix BHJ loss if key = 0
#248	arrow dependency must put after arrow installation

PRs


#332	[NSE-325] fix incremental compile issue with 4.5.x scala-maven-plugin
#335	[NSE-329] fix out partitioning in BHJ and SHJ
#328	[NSE-318]check schema before reuse exchange
#307	[NSE-304] Upgrade to Arrow 4.0.0
#312	[NSE-311] Build reports errors
#272	[NSE-273] support spark311
#303	[NSE-302] fix v2 test
#306	[NSE-304] Upgrade to Arrow 4.0.0: Change basic GHA TPC-H test target …
#286	[NSE-285] ColumnarWindow: Support Date input in MAX/MIN
#298	[NSE-297] Disable incremental compiler in GHA CI
#291	[NSE-257] fix multiple slf4j bindings
#294	[NSE-293] fix unsafemap with key = '0'
#233	[NSE-207] fix issues found from aggregate unit tests
#246	[NSE-245]Adding columnar RDD cache support
#289	[NSE-206]Update installation guide and configuration guide.
#277	[NSE-276] Add option to switch Hadoop version
#275	[NSE-274] Comment to trigger tpc-h RAM test
#271	[NSE-196] clean up configs in unit tests
#258	[NSE-257] fix different versions of slf4j-log4j12
#259	[NSE-248] fix arrow dependency order
#249	[NSE-241] fix hashagg result length
#255	[NSE-256] do not run ram report test on each PR

SQL DS Cache

Features


#118	port to Spark 3.1.1

Bugs Fixed


#121	OAP Index creation stuck issue

PRs


#132	Fix SampleBasedStatisticsSuite UnitTest case
#122	[ sql-ds-cache-121] Fix Index stuck issues
#119	[SQL-DS-CACHE-118][POAE7-1130] port sql-ds-cache to Spark3.1.1

OAP MLlib

Features


#26	[PIP] Support Spark 3.0.1 / 3.0.2 and upcoming 3.1.1

PRs


#39	[ML-26] Build for different spark version by -Pprofile

PMem Spill

Features


#34	Support vanilla spark 3.1.1

PRs


#41	[PMEM-SPILL-34][POAE7-1119]Port RDD cache to Spark 3.1.1 as separate module

PMem Common

Features


#10	add -mclflushopt flag to enable clflushopt for gcc
#8	use clflushopt instead of clflush

PRs


#11	[PMEM-COMMON-10][POAE7-1010]Add -mclflushopt flag to enable clflushop…
#9	[PMEM-COMMON-8][POAE7-896]use clflush optimize version for clflush

PMem Shuffle

Features


#15	Doesn't work with Spark3.1.1

PRs


#16	[pmem-shuffle-15] Make pmem-shuffle support Spark3.1.1

Remote Shuffle

Features


#18	upgrade to Spark-3.1.1
#11	Support DAOS Object Async API

PRs


#19	[REMOTE-SHUFFLE-18] upgrade to Spark-3.1.1
#14	[REMOTE-SHUFFLE-11] Support DAOS Object Async API

Release 1.1.0

Native SQL Engine

Features


#261	ArrowDataSource: Add S3 Support
#239	Adopt ARROW-7011
#62	Support Arrow's Build from Source and Package dependency library in the jar
#145	Support decimal in columnar window
#31	Decimal data type support
#128	Support Decimal in Aggregate
#130	Support decimal in project
#134	Update input metrics during reading
#120	Columnar window: Reduce peak memory usage and fix performance issues
#108	Add end-to-end test suite against TPC-DS
#68	Adaptive compression select in Shuffle.
#97	optimize null check in codegen sort
#29	Support mutiple-key sort without codegen
#75	Support HashAggregate in ColumnarWSCG
#73	improve columnar SMJ
#51	Decimal fallback
#38	Supporting expression as join keys in columnar SMJ
#27	Support REUSE exchange when DPP enabled
#17	ColumnarWSCG further optimization

Performance


#194	Arrow Parameters Update when compiling Arrow
#136	upgrade to arrow 3.0
#103	reduce codegen in multiple-key sort
#90	Refine HashAggregate to do everything in CPP

Bugs Fixed


#278	fix arrow dep in 1.1 branch
#265	TPC-DS Q67 failed with memmove exception in native split code.
#280	CMake version check
#241	TPC-DS q67 failed for XXH3_hashLong_64b_withSecret.constprop.0+0x180
#262	q18 has different digits compared with vanilla spark
#196	clean up options for native sql engine
#224	update 3rd party libs
#227	fix vulnerabilities from klockwork
#237	Add ARROW_CSV=ON to default C++ build commands
#229	Fix the deprecated code warning in shuffle_split_test
#119	consolidate batch size
#217	TPC-H query20 result not correct when use decimal dataset
#211	IndexOutOfBoundsException during running TPC-DS Q2
#167	Cannot successfully run q.14a.sql and q14b.sql when using double format for TPC-DS workload.
#191	libarrow.so and libgandiva.so not copy into the tmp directory
#179	Unable to find Arrow headers during build
#153	Fix incorrect queries after enabled Decimal
#173	fix the incorrect result of q69
#48	unit tests for c++ are broken
#101	ColumnarWindow: Remove obsolete debug code
#100	Incorrect result in Q45 w/ v2 bhj threshold is 10MB sf500
#81	Some ArrowVectorWriter implementations doesn't implement setNulls method
#82	Incorrect result in TPCDS Q72 SF1536
#70	Duplicate IsNull check in codegen sort
#64	Memleak in sort when SMJ is disabled
#58	Issues when running tpcds with DPP enabled and AQE disabled
#52	memory leakage in columnar SMJ
#53	Q24a/Q24b SHJ tail task took about 50 secs in SF1500
#42	reduce columnar sort memory footprint
#40	columnar sort codegen fallback to executor side
#1	columnar whole stage codegen failed due to empty results
#23	TPC-DS Q8 failed due to unsupported operation in columnar sortmergejoin
#22	TPC-DS Q95 failed due in columnar wscg
#4	columnar BHJ failed on new memory pool
#5	columnar BHJ failed on partitioned table with prefercolumnar=false

PRs


#288	[NSE-119] clean up on comments
#282	[NSE-280]fix cmake version check
#281	[NSE-280] bump cmake to 3.16
#279	[NSE-278]fix arrow dep in 1.1 branch
#268	[NSE-186] backport to 1.1 branch
#266	[NSE-265] Reserve enough memory before UnsafeAppend in builder
#270	[NSE-261] ArrowDataSource: Add S3 Support
#263	[NSE-262] fix remainer loss in decimal divide
#215	[NSE-196] clean up native sql options
#231	[NSE-176]Arrow install order issue
#242	[NSE-224] update third party code
#240	[NSE-239] Adopt ARROW-7011
#238	[NSE-237] Add ARROW_CSV=ON to default C++ build commands
#230	[NSE-229] Fix the deprecated code warning in shuffle_split_test
#225	[NSE-227]fix issues from codescan
#219	[NSE-217] fix missing decimal check
#212	[NSE-211] IndexOutOfBoundsException during running TPC-DS Q2
#187	[NSE-185] Avoid unnecessary copying when simply projecting on fields
#195	[NSE-194]Turn on several Arrow parameters
#189	[NSE-153] Following NSE-153, optimize fallback conditions for columnar window
#192	[NSE-191]Fix issue0191 for .so file copy to tmp.
#181	[NSE-179]Fix arrow include directory not include when using ARROW_ROOT
#175	[NSE-153] Fix window results
#174	[NSE-173] fix incorrect result of q69
#172	[NSE-62]Fixing issue0062 for package arrow dependencies in jar with refresh2
#171	[NSE-170]improve sort shuffle code
#165	[NSE-161] adding format check
#166	[NSE-130] support decimal round and abs
#164	[NSE-130] fix precision loss in divide w/ decimal type
#159	[NSE-31] fix SMJ divide with decimal
#156	[NSE-130] fix overflow and precision loss
#152	[NSE-86] Merge Arrow Data Source
#154	[NSE-153] Fix incorrect quries after enabled Decimal
#151	[NSE-145] Support decimal in columnar window
#129	[NSE-128]Support Decimal in Aggregate/HashJoin
#131	[NSE-130] support decimal in project
#107	[NSE-136]upgrade to arrow 3.0.0
#135	[NSE-134] Update input metrics during reading
#121	[NSE-120] Columnar window: Reduce peak memory usage and fix performance issues
#112	[NSE-97] optimize null check and refactor sort kernels
#109	[NSE-108] Add end-to-end test suite against TPC-DS
#69	[NSE-68][Shuffle] Adaptive compression select in Shuffle.
#98	[NSE-97] remove isnull when null count is zero
#102	[NSE-101] ColumnarWindow: Remove obsolete debug code
#105	[NSE-100]Fix an incorrect result error when using SHJ in Q45
#91	[NSE-90]Refactor HashAggregateExec and CPP kernels
#79	[NSE-81] add missing setNulls methods in ArrowWritableColumnVector
#44	[NSE-29]adding non-codegen framework for multiple-key sort
#76	[NSE-75]Support ColumnarHashAggregate in ColumnarWSCG
#83	[NSE-82] Fix Q72 SF1536 incorrect result
#72	[NSE-51] add more datatype fallback logic in columnar operators
#60	[NSE-48] fix c++ unit tests
#50	[NSE-45] BHJ memory leak
#74	[NSE-73]using data ref in multiple keys based SMJ
#71	[NSE-70] remove duplicate IsNull check in sort
#65	[NSE-64] fix memleak in sort when SMJ is disabled
#59	[NSE-58]Fix empty input issue when DPP enabled
#7	[OAP-1846][oap-native-sql] add more fallback logic
#57	[NSE-56]ColumnarSMJ: fallback on full outer join
#55	[NSE-52]Columnar SMJ: fix memory leak by closing stream batches properly
#54	[NSE-53]Partial fix Q24a/Q24b tail SHJ task materialization performance issue
#47	[NSE-17]TPCDS Q72 optimization
#39	[NSE-38]ColumnarSMJ: support expression as join keys
#43	[NSE-42] early release sort input
#33	[NSE-32] Use Spark managed spill in columnar shuffle
#41	[NSE-40] fixes driver failing to do sort codege
#28	[NSE-27]Reuse exchage to optimize DPP performance
#36	[NSE-1]fix columnar wscg on empty recordbatch
#24	[NSE-23]fix columnar SMJ fallback
#26	[NSE-22]Fix w/DPP issue when inside wscg smj both sides are smj
#18	[NSE-17] smjwscg optimization:
#3	[NSE-4]fix columnar BHJ on new memory pool
#6	[NSE-5][SCALA] Fix ColumnarBroadcastExchange didn't fallback issue w/ DPP

SQL DS Cache

Features


#36	HCFS doc for Spark
#38	update Plasma dependency for Plasma-based-cache module
#14	Add HCFS module
#17	replace arrow-plasma dependency for hcfs module

Bugs Fixed


#62	Upgrade hadoop dependencies in HCFS

PRs


#83	[SQL-DS-CACHE-82][SDLe]Upgrade Jetty version
#77	[SQL-DS-CACHE-62][POAE7-984] upgrade hadoop version to 3.3.0
#56	[SQL-DS-CACHE-47]Add plasma native get timeout
#37	[SQL-DS-CACHE-36][POAE7-898]HCFS docs for OAP 1.1
#39	[SQL-DS-CACHE-38][POAE7-892]update Plasma dependency
#18	[SQL-DS-CACHE-17][POAE7-905]replace intel-arrow with apache-arrow v3.0.0
#13	[SQL-DS-CACHE-14][POAE7-847] Port HCFS to OAP
#16	[SQL-DS-CACHE-15][POAE7-869]Refactor original code to make it a sub-module

OAP MLlib

Features


#35	Restrict printNumericTable to first 10 eigenvalues with first 20 dimensions
#33	Optimize oneCCL port detecting
#28	Use getifaddrs to get host ips for oneCCL kvs
#12	Improve CI and add pseudo cluster testing
#31	Print time duration for each PCA step
#13	Add ALS with new oneCCL APIs
#18	Auto detect KVS port for oneCCL to avoid port conflict
#10	Porting Kmeans and PCA to new oneCCL API

Bugs Fixed


#43	[Release] Error when installing intel-oneapi-dal-devel-2021.1.1 intel-oneapi-tbb-devel-2021.1.1
#46	[Release] Meet hang issue when running PCA algorithm.
#48	[Release] No performance benefit when using Intel-MLlib to run ALS algorithm.
#25	Fix oneCCL KVS port auto detect and improve logging

PRs


#51	[ML-50] Merge #47 and prepare for OAP 1.1
#49	Revert "[ML-41] Revert to old oneCCL and Prepare for OAP 1.1"
#47	[ML-44] [PIP] Update to oneAPI 2021.2 and Rework examples for validation
#40	[ML-41] Revert to old oneCCL and Prepare for OAP 1.1
#36	[ML-35] Restrict printNumericTable to first 10 eigenvalues with first 20 dimensions
#34	[ML-33] Optimize oneCCL port detecting
#20	[ML-12] Improve CI and add pseudo cluster testing
#32	[ML-31] Print time duration for each PCA step
#14	[ML-13] Add ALS with new oneCCL APIs
#24	[ML-25] Fix oneCCL KVS port auto detect and improve logging
#19	[ML-18] Auto detect KVS port for oneCCL to avoid port conflict

PMem Spill

Bugs Fixed


#22	[SDLe][Snyk]Upgrade Jetty version to fix vulnerability scanned by Snyk
#13	The compiled code failed because the variable name was not changed

PRs


#27	[PMEM-SPILL-22][SDLe]Upgrade Jetty version
#21	[POAE7-961] fix null pointer issue when offheap enabled.
#18	[POAE7-858] disable RDD cache related PMem intialization as default and add PMem related logic in SparkEnv
#19	[PMEM-SPILL-20][POAE7-912]add vanilla SparkEnv.scala for future update
#15	[POAE7-858] port memory extension options to OAP 1.1
#12	Change the variable name so that the passed parameters are correct
#10	Fixing one pmem path on AppDirect mode may cause the pmem initialization path to be empty Path

PMem Shuffle

Features


#7	Enable running in fsdax mode

Bugs Fixed


#10	[pmem-shuffle] There are potential issues reported by Klockwork.

PRs


#13	[PMEM-SHUFFLE-10] Fix potential issues reported by klockwork for branch 1.1.
#6	[PMEM-SHUFFLE-7] enable fsdax mode in pmem-shuffle

Remote Shuffle

Features


#6	refactor shuffle-daos by abstracting shuffle IO for supporting both synchronous and asynchronous DAOS Object API
#4	check-in remote shuffle based on DAOS Object API

Bugs Fixed


#12	[SDLe][Snyk]Upgrade org.mock-server:mockserver-netty to fix vulnerability scanned by Snyk

PRs


#13	[REMOTE-SHUFFLE-12][SDle][Snyk]Upgrade org.mock-server:mockserver-net…
#5	check-in remote shuffle based on DAOS Object API

Release 1.0.0

Features


#1823	[oap-native-sql][doc] Spark Native SQL Engine installation guide is obsolete and thus broken.
#1545	[oap-data-source][arrow] Add metric: output_batches
#1588	[OAP-CACHE] Make Parquet file splitable
#1337	[oap-cacnhe] Discard OAP data format
#1679	[OAP-CACHE]Remove the code related to reading and writing OAP data format
#1680	[OAP-CACHE]Decouple spark code includes FileFormatDataWriter, FileFormatWriter and OutputWriter
#1846	[oap-native-sql] spark sql unit test
#1811	[OAP-cache]provide one-step starting scripts like plasma-sever redis-server
#1519	[oap-native-sql] upgrade cmake
#1873	[oap-native-sql] Columnar shuffle split variable length use UnsafeAppend
#1835	[oap-native-sql] Support ColumnarBHJ to Build and Broadcast HashRelation in driver side
#1848	[OAP-CACHE]Decouple spark code include OneApplicationResource.scala
#1824	[OAP-CACHE]Decouple spark code includes DataSourceScanExec.scala.
#1838	[OAP-CACHE]Decouple spark code includes VectorizedColumnReader.java, VectorizedPlainValuesReader.java, VectorizedRleValuesReader.java and OnHeapColumnVector.java
#1839	[oap-native-sql] Add prefetch to columnar shuffle split
#1756	[Intel MLlib] Add Kmeans "tolerance" support and test cases
#1818	[OAP-Cache]Make Spark webUI OAP Tab more user friendly
#1831	[oap-native-sql] ColumnarWindow: Support reusing same window spec in multiple functions
#1653	[SQL Data Source Cache]Consistency issue on "enable" and "enabled" configuration
#1765	[oap-native-sql] Support WSCG in nativesql
#1517	[oap-native-sql] implement SortMergeJoin
#1535	[oap-native-sql] Add ColumnarWindowExec
#1654	[oap-native-sql] Columnar shuffle TPCDS enabling
#1700	[oap-native-sql] Support inside join condition project
#1717	[oap-native-sql] support null in columnar literal and subquery
#1704	[oap-native-sql] Add ColumnarUnion and ColumnarExpand
#1647	[oap-native-sql] row to columnar for decimal
#1638	[oap-native-sql] adding full TPC-DS support
#1498	[oap-native-sql] stddev_samp support
#1547	[oap-native-sql] adding metrics for input/output batches

Performance


#1956	[OAP-MLlib]Cannot get 5x performance benefit comparing with vanilla spark.
#1955	[OAP-CACHE] Plasma shows lower performance comparing with vanilla spark.
#2023	[OAP-MLlib] Use oneAPI official release instead of beta versions
#1829	[oap-native-sql] Optimize columnar shuffle and option to use AVX512
#1734	[oap-native-sql] use non-codegen for sort with one key
#1706	[oap-native-sql] Optimize columnar shuffle write

Bugs Fixed


#2054	[OAP-MLlib] Faild run Intel mllib after updating the version of oneapi.
#2012	[SQL Data Source Cache] The task will be suspended when using plasma cache.
#1640	[SQL Data Source Cache] The task will be suspended when using plasma cache and starting 2 executors per worker.
#2028	[OAP-Cache]When using Plasma Spark webUI OAP Tab cache metrics are not right
#1979	[SDLe][native-sql-engine] Issues from Static Code Analysis with Klocwork need to be fixed
#1938	[oap-native-sql] Stability test failed when running TPCH for 10 rounds.
#1924	[OAP-CACHE] Decouple hearbeat message and use conf to determine whether to report locailty information
#1937	[rpmem-shuffle] Cannot pass q64.sql of TPC-DS when enable RPmem shuffle.
#1951	[SDLe][PMem-Shuffle]Specify Scala version above 2.12.4 in pom.xml
#1921	[SDLe][rpmem-shuffle] The master branch and branch-1.0-spark-3.0 can't pass BDBA analysis with libsqlitejdbc dependency.
#1743	[oap-native-sql] Error not reported when creating CodeGenerator instance
#1864	[oap-native-sql] hash conflict in hashagg
#1934	[oap-native-sql] backport to 1.0
#1929	[oap-native-sql] memleak in non-codegen aggregate
#1907	[OAP-cache]Cannot find the class of redis-client
#1888	[oap-native-sql] Add hash collision check for all HashJoins and hashAggr
#1903	[oap-native-sql] BHJ related UT fix
#1881	[oap-native-sql] Fix split use avx512
#1742	[oap-native-sql] SortArraysToIndicesKernel: incorrect null ordering with multiple sort keys
#1553	[oap-native-sql] TPCH-Q7 fails in throughput tests
#1854	[oap-native-sql] Fix columnar shuffle file not deleted
#1844	[oap-native-sql] Fix columnar shuffle spilled file not deleted
#1580	[oap-native-sql] Hash Collision in multiple keys scenario
#1754	[Intel MLlib] Improve LibLoader creating temp dir name with UUID
#1815	[oap-native-sql] Memory management: Error on task end if there are unclosed child allocators
#1808	[oap-native-sql] ColumnarWindow: Memory leak on converting input/output batches
#1806	[oap-native-sql] Fix Columnar Shuffle Memory Leak
#1783	[oap-native-sql] ColumnarWindow: Rank() returns wrong result when input row number >= 65536
#1776	[oap-native-sql] memory leakage in native code
#1760	[oap-native-sql] fix columnar sorting on string
#1733	[oap-native-sql]TPCH Q18 memory leakage
#1694	[oap-native-sql] TPC-H q15 failed for ConditionedProbeArraysVisitorImpl MakeResultIterator does not support dependency type other than Batch
#1682	[oap-native-sql] fix aggregate without codegen
#1707	[oap-native-sql] Fix collect batch metric
#1642	[oap-native-sql] Support expression key in Join
#1669	[oap-native-sql] TPCH Q1 results is not correct w/ hashagg codegen off
#1629	[oap-native-sql] clean up building steps
#1602	[oap-native-sql] rework copyfromjar function
#1599	[oap-native-sql] Columnar BHJ fail on TPCH-Q15
#1567	[oap-native-sql] Spark thrift-server does not honor LIBARROW_DIR env
#1541	[oap-native-sql] TreeNode children not replaced by columnar operators

PRs


#2056	[OAP-2054][OAP-MLlib] Fix oneDAL libJavaAPI.so packaging for oneAPI 2021.1 production release
#2039	[OAP-2023][OAP-MLlib] Switch to oneAPI 2021.1.1 official release for OAP 1.0
#2043	[OAP-1981][OAP-CACHE][POAE7-617]fix binary cache core dump issue
#2002	[OAP-2001][oap-native-sql]fix coding style
#2035	[OAP-2028][OAP-cache][POAE7-635] Fix set concurrent access bug
#2037	[OAP-1640][OAP-CACHE][POAE7-593]Fix plasma hang due to threshold
#2036	[OAP-1955][OAP-CACHE][POAE7-660]preferLocation low hit rate fix master branch
#2013	[OAP-CACHE][POAE7-628]port missing commits from branch 0.8/0.9
#2015	[OAP-2016] fix klocwork issues in oap-common/oap-spark
#2022	[OAP-1980][rpmem-shuffle] Fix Klockwork issues for spark3.x version
#2011	[OAP-2010][oap-native-sql] Add abs support in wscg
#1996	[OAP-1998][oap-native-sql] Add support to do numa binding for Columnar Operations
#2004	[OAP-2012][OAP-CACHE][POAE7-635]bug fix: plasma hang - use java thread-safe set
#1988	[OAP-1983][oap-native-sql] Fix Q38 and Q87 when unsafeRow contains null
#1976	[OAP-1983][oap-native-sql] Fix hashCheck performance issue
#1970	[OAP-1947][oap-native-sql][C++] reduce sort kernel memory footprint
#1961	[OAP-1924][OAP-CACHE]Decouple hearbeat message and use conf to determine whether to report locailty information for branch branch-1.0-spark-3.x
#1982	[OAP-1981][OAP-CACHE][POAE7-617]Bug fix binary docache
#1952	[OAP-1951][PMem-Shuffle][SDLe]Specify Scala version in pom.xml
#1919	[OAP-1918][OAP-CACHE][POAE7-563]bug fix: plasma get an invalid value
#1589	[OAP-1588][OAP-CACHE][POAE7-363] Make Parquet splitable
#1954	[OAP-1884][OAP-dev]Small fix for arrow build in prepare_oap_env.sh.
#1933	[OAP-1934][oap-native-sql]Backport NativeSQL code to 1.0
#1889	[OAP-1888][oap-native-sql]Add hash collision check for all HashJoins and hashAggr
#1904	[OAP-1903][oap-native-sql] Fix Local Mode BHJ related UT fail issue
#1916	[OAP-1846][oap-native-sql] clean up travis test
#1923	[OAP-1921][rpmem-shuffle] For BDBA analysis to exclude unused library
#1890	[OAP-1846][oap-native-sql] add script for running unit test
#1905	[OAP-1813][POAE7-555] [OAP-CACHE] package redis related dependency
#1908	[OAP-1884][OAP-dev]Add cxx-compiler in oap conda recipes for native-sql.
#1901	[OAP-1884][OAP-dev]Add c-compiler in oap conda recipes for native-sql.
#1895	[OAP-1884][OAP-dev] Checkout arrow branch in case arrow in other branch
#1876	[OAP-1875]Generating changelog automatically for new releases
#1812	[OAP-1811][OAP-cache][POAE7-486]add sbin folder
#1882	[OAP-1881][oap-native-sql] Fix split use avx512
#1847	[OAP-1846][oap-native-sql] add unit tests from spark to native sql
#1836	[OAP-1835][oap-native-sql] Support ColumnarBHJ to build and broadcast hashrelation
#1885	[OAP-1884][OAP-dev]Add oap-mllib to parent pom and fix error when git clone oneccl.
#1868	[OAP-1653][OAP-Cache]Modify enabled and enable compatibility check
#1853	[OAP-1852][oap-native-sql] Memory Management: Use Arrow C++ memory po…
#1859	[OAP-1858][OAP-cache][POAE7-518] Decouple FilePartition.scala
#1857	[OAP-1833][oap-native-sql] Fix HashAggr hasNext won't stop issue
#1855	[OAP-1854][oap-native-sql] Fix columnar shuffle file not deleted
#1840	[OAP-1839][oap-native-sql] Add prefetch to columnar shuffle split
#1843	[OAP-1842][OAP-dev]Add arrow conda build action job.
#1849	[OAP-1848][SQL Data Source Cache] Decouple OneApplicationResource.scala
#1837	[OAP-1838][SQL Data Source Cache] Decouple VectorizedColumnReader.java, VectorizedPlainValuesReader.java, VectorizedRleValuesReader.java and OnHeapColumnVector.java.
#1757	[OAP-1756][Intel MLlib] Add Kmeans "tolerance" support and test cases
#1845	[OAP-1844][oap-native-sql] Fix columnar shuffle spilled file not deleted
#1827	[OAP-1818][SQL-Data-Source-Cache]Modify Spark webUI OAP Tab expressio…
#1832	[OAP-1831][oap-native-sql] ColumnarWindow: Support reusing same windo…
#1834	[OAP-1833][oap-native-sql][Scala] fix CoalesceBatchs after HashAgg
#1830	[OAP-1829][oap-native-sql] Optimize columnar shuffle and option to use AVX-512
#1803	[OAP-1751][oap-native-sql]fix sort on TPC-DS
#1755	[OAP-1754][Intel MLlib] Improve LibLoader creating temp dir name with UUID
#1826	[OAP-1825] disable pmemblk test
#1802	[OAP-1653][OAP-Cache]Keep consistency on 'enabled' of OapConf configu…
#1810	[OAP-1771]Fix README for Arrow Data Source
#1816	[OAP-1815][oap-native-sql] Memory management: Error on task end if th…
#1809	[OAP-1808][oap-native-sql] ColumnarWindow: Memory leak on converting input/output batches
#1467	[OAP-1457][oap-native-sql] Reserve Spark off-heap execution memory after buffer allocation
#1807	[OAP-1806][oap-native-sql] Fix Columnar Shuffle Memory Leak
#1788	[OAP-1765][oap-native-sql] Fix for dropped CoalecseBatches before ColumnarBroadcastExchange
#1799	[OAP-CACHE][OAP-1690][POAE7-430] Cache backend fall back detect bug fix branch master
#1744	[OAP-CACHE][OAP-1748][POAE7-462] Enable externalDB to store CacheMetaInfo branch master
#1787	[OAP-1786][oap-native-sql] ColumnarWindow: Avoid unnecessary mem copies
#1773	[POAE7-471]Handle oap-common build issue about PMemKV
#1782	[OAP-1631]Update compile scripts from 0.9
#1785	[OAP-1765][oap-native-sql] Support WSCG for nativesql(PART 2)
#1781	[OAP-1765][oap-native-sql] fix codegen for SMJ and HashAgg
#1775	[OAP-1776][oap-native-sql]fix sort memleak
#1766	[OAP-1765][oap-native-sql] Support WSCG for nativesql and use non-codegen join for remainings
#1774	[OAP-1631]Add prepare_oap_env.sh.
#1769	[OAP-1768][POAE7-163][OAP-SPARK] Integrate block manager with chunk api
#1763	[OAP-1759][oap-native-sql] ColumnarWindow: Add execution metrics
#1656	[OAP-1517][oap-native-sql] Improve SortMergeJoin Part2
#1761	[oap-native-sql] quick fix sort on string by fallback to row
#1536	[OAP-1535][oap-native-sql] Add ColumnarWindowExec
#1735	[OAP-1734][oap-native-sql]use non-codegen for sort with single key
#1747	[OAP-1741][rpmem-shuffle]To make java side load native library from jar directly
#1725	[OAP-1727][POAE7-358] Spark integration: Memory Spill to PMem
#1738	[OAP-1733][oap-native-sql][Scala] fix mem leak
#1701	[OAP-1700][oap-native-sql] support join-inside condition project
#1736	[oap-1727][POAE7-358] Add native spark files for memory spill module
#1719	[oap-common][POAE7-347]Stream API for PMem storage store
#1723	[OAP-1679][OAP-CACHE] Remove the code related to reading and writing OAP data format
#1716	[OAP-1717][oap-native-sql] support null in columnar literal and subquery
#1713	[OAP-1712] [OAP-SPARK] Remove file change list from dev directory
#1711	[OAP-1694][oap-native-sql][Scala] fix hash join w/ empty batch
#1710	[OAP-1706][oap-native-sql] Optimize shuffle write
#1705	[OAP-1704][oap-native-sql] Support ColumnarUnion and ColumnarExpand
#1683	[OAP-1682][oap-native-sql] fix aggregate without codegen
#1708	[OAP-1707][oap-native-sql] Fix collect batch metric
#1675	[OAP-1651][oap-native-sql] Adding fallback rules for join/shuffle
#1674	[OAP-1673][oap-native-sql] Adding native double round function
#1632	[OAP-1631][Doc] Add Commit Message Requirements
#1672	[OAP-1610][Intel-MLlib]Upgrade the mahout-hdfs to version 14.1
#1641	[OAP-1651][OAP-1642][oap-native-sql] support TPCDS w/ AQE
#1670	[OAP-1669][oap-native-sql] use distinct ordinal list
#1655	[OAP-1654][oap-native-sql]Columnar shuffle tpcds enabling
#1630	[OAP-1629][oap-native-sql] clean up building scripts
#1601	[OAP-1602][oap-native-sql][Java] fix exract resource from jar
#1639	[OAP-1638][oap-native-sql] tpcds enabling (part2)
#1586	[OAP-1587][oap-native-sql] tpcds enabling (part1)
#1600	[oap-1599][oap-native-sql][Scala] fix broadcasthashjoin
#1555	[OAP-1541][oap-native-sql] TreeNode children not replaced by columnar…
#1546	[OAP-1547][oap-native-sql][Scala] Adding metrics for input/output batches
#1472	[OAP-1466] [RDD Cache] [POAE-354] Initialize pmem with AppDirect and KMemDax mode in block manager

Release 0.8.4

Features


#1865	[OAP-CACHE]Decouple spark code include DataSourceScanExec.scala, OneApplicationResource.scala, Decouple VectorizedColumnReader.java, VectorizedPlainValuesReader.java, VectorizedRleValuesReader.java and OnHeapColumnVector.java for OAP-0.8.4.
#1813	[OAP-cache] package redis client jar into oap-cache

Bugs Fixed


#2044	[OAP-CACHE] Build error due to synchronizedSet on branch 0.8
#2027	[oap-shuffle] Should load native library from jar directly
#1981	[OAP-CACHE] Error runing q32 binary cache
#1980	[SDLe][RPMem-Shuffle]Issues from Static Code Analysis with Klocwork need to be fixed
#1918	[OAP-CACHE] Plasma throw exception:get an invalid value- branch 0.8

PRs


#2045	[OAP-2044][OAP-CACHE]bug fix: build error due to synchronizedSet
#2031	[OAP-1955][OAP-CACHE][POAE7-667]preferLocation low hit rate fix branch 0.8
#2029	[OAP-2027][rpmem-shuffle] Load native libraries from jar
#2018	[OAP-1980][SDLe][rpmem-shuffle] Fix potential risk issues reported by Klockwork
#1920	[OAP-1924][OAP-CACHE]Decouple hearbeat message and use conf to determine whether to report locailty information
#1949	[OAP-1948][rpmem-shuffle] Fix several vulnerabilities reported by BDBA
#1900	[OAP-1680][OAP-CACHE] Decouple FileFormatDataWriter, FileFormatWriter and OutputWriter
#1899	[OAP-1679][OAP-CACHE] Remove the code related to reading and writing OAP data format (#1723)
#1897	[OAP-1884][OAP-dev] Update memkind version and copy arrow plasma jar to conda package build path
#1883	[OAP-1568][OAP-CACHE] Cleanup Oap data format read/write related test cases
#1863	[OAP-1865][SQL Data Source Cache]Decouple spark code include DataSourceScanExec.scala, OneApplicationResource.scala, Decouple VectorizedColumnReader.java, VectorizedPlainValuesReader.java, VectorizedRleValuesReader.java and OnHeapColumnVector.java for OAP-0.8.4.
#1841	[OAP-1579][OAP-cache]Fix web UI to show cache size
#1814	[OAP-cache][OAP-1813][POAE7-481]package redis client related dependency
#1790	[OAP-CACHE][OAP-1690][POAE7-430] Cache backend fallback bugfix
#1740	[OAP-CACHE][OAP-1748][POAE7-453]Enable externalDB to store CacheMetaInfo branch 0.8
#1731	[OAP-CACHE] [OAP-1730] [POAE-428] Add OAP cache runtime enable

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Change log

Release 1.5.0

Gazelle Plugin

Features

Performance

Bugs Fixed

PRs

Release 1.4.0

Gazelle Plugin

Features

Performance

Bugs Fixed

PRs

OAP MLlib

Performance

PRs

Release 1.3.1

Gazelle Plugin

Features

Performance

Bugs Fixed

PRs

OAP MLlib

Features

Bugs Fixed

PRs

Release 1.3.0

Gazelle Plugin

Features

Bugs Fixed

PRs

OAP MLlib

Features

Bugs Fixed

PRs

Release 1.2.0

Gazelle Plugin

Features

Performance

Bugs Fixed

PRs

OAP MLlib

Features

Bugs Fixed

PRs

SQL DS Cache

Features

Bugs Fixed

PRs

PMem Shuffle

Bugs Fixed

PRs

Remote Shuffle

Bugs Fixed

PRs

Release 1.1.1

Native SQL Engine

Features

Bugs Fixed

PRs

SQL DS Cache

Features

Bugs Fixed

PRs

OAP MLlib

Features

PRs

PMem Spill

Features

PRs

PMem Common

Features

PRs

PMem Shuffle