Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync to 1.2: supporting reader linear search on unsorted fake pk col #16461

Conversation

gouhongshen
Copy link
Contributor

@gouhongshen gouhongshen commented May 29, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue ##13959

What this PR does / why we need it:

supporting reader linear search on unsorted fake pk col


PR Type

Enhancement, Tests


Description

  • Added multiple linear search functions for unordered vectors in search.go.
  • Updated filter handling in disttae/reader.go to use new blockio.ReadFilter structure.
  • Refactored filter handling in txn_table.go to accommodate new filter type.
  • Enhanced utility functions in util.go to support new filter structure and unsorted search functions.
  • Updated tests in util_test.go for new filter structure.
  • Refactored blockio/read.go to use ReadFilterSearchFuncType and updated ReadFilter structure.
  • Updated logtail/snapshot.go to replace nil filters with blockio.ReadFilter{}.
  • Added SQL test cases and results for linear search on fake primary key column.

Changes walkthrough 📝

Relevant files
Enhancement
6 files
search.go
Add linear search functions for unordered vectors.             

pkg/container/vector/search.go

  • Added multiple linear search functions for unordered vectors.
  • Introduced factories for creating search functions based on value
    types.
  • Implemented search functions for various data types including byte
    slices and prefixes.
  • +103/-0 
    reader.go
    Update filter handling in disttae reader.                               

    pkg/vm/engine/disttae/reader.go

  • Replaced nil filters with blockio.ReadFilter{}.
  • Updated filter evaluation logic to handle new filter structure.
  • Modified block reading logic to use the new filter structure.
  • +15/-14 
    txn_table.go
    Refactor filter handling in txn_table.                                     

    pkg/vm/engine/disttae/txn_table.go

  • Replaced ReadFilter with ReadFilterSearchFuncType.
  • Updated filter building logic to use new filter structure.
  • Adjusted block read calls to accommodate new filter type.
  • +7/-8     
    util.go
    Enhance util functions for new filter structure.                 

    pkg/vm/engine/disttae/util.go

  • Added support for unsorted search functions.
  • Updated getNonCompositePKSearchFuncByExpr to return new filter
    structure.
  • Modified linear search factory functions to handle new filter types.
  • +100/-49
    read.go
    Refactor blockio read functions for new filter structure.

    pkg/vm/engine/tae/blockio/read.go

  • Introduced ReadFilterSearchFuncType and updated ReadFilter structure.
  • Modified ReadByFilter and BlockRead functions to use new filter
    structure.
  • +19/-5   
    snapshot.go
    Update snapshot logtail for new filter structure.               

    pkg/vm/engine/tae/logtail/snapshot.go

  • Replaced nil filters with blockio.ReadFilter{} in BlockRead calls.
  • +1/-1     
    Tests
    3 files
    util_test.go
    Update util tests for new filter structure.                           

    pkg/vm/engine/disttae/util_test.go

    • Updated test to use SortedSearchFunc from new filter structure.
    +1/-1     
    block_reader_filter.result
    Add test results for fake PK linear search.                           

    test/distributed/cases/disttae_filters/reader_filters/block_reader_filter.result

    • Added test results for linear search on fake primary key column.
    +24/-0   
    block_reader_filter.sql
    Add SQL tests for fake PK linear search.                                 

    test/distributed/cases/disttae_filters/reader_filters/block_reader_filter.sql

    • Added SQL test cases for linear search on fake primary key column.
    +14/-0   

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Review 🔍

    ⏱️ Estimated effort to review [1-5]

    4, because the PR involves multiple complex changes across several files, including the addition of new search functions, updates to filter handling, and modifications to existing functions to accommodate new filter types. The changes impact core functionalities like search and filtering in vectors, which requires a thorough understanding of the existing architecture and careful review to ensure correctness and performance.

    🧪 Relevant tests

    Yes

    ⚡ Possible issues

    Performance Concern: The addition of multiple linear search functions could potentially lead to performance degradation, especially for large datasets. Linear searches are generally less efficient than binary searches, particularly when dealing with unsorted data.

    Possible Bug: The use of nil checks and conversions between different filter types and search functions must be carefully reviewed to ensure that there are no edge cases or conditions where the filters might not behave as expected, leading to incorrect query results or runtime errors.

    🔒 Security concerns

    No

    @matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label May 29, 2024
    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Maintainability
    Remove the duplicate import statement for moerr

    The import statement for moerr is duplicated. It should be removed to avoid redundancy and
    potential confusion.

    pkg/vm/engine/disttae/util.go [27]

    -"github.com/matrixorigin/matrixone/pkg/common/moerr"
    +# Remove the duplicate import statement
     
    Suggestion importance[1-10]: 10

    Why: The import statement for moerr is indeed duplicated at lines 25 and 27. Removing the duplicate is crucial to avoid redundancy and potential confusion.

    10
    Define the buildFilter and buildUnsortedFilter functions outside the main function to improve readability and maintainability

    In the PKPersistedBetween function, the buildFilter and buildUnsortedFilter functions can
    be defined outside the main function to improve readability and maintainability.

    pkg/vm/engine/disttae/txn_table.go [2247-2270]

     var filter blockio.ReadFilterSearchFuncType
    -buildFilter := func() blockio.ReadFilterSearchFuncType {
    +var unsortedFilter blockio.ReadFilterSearchFuncType
    +
    +func buildFilter(keys *Keys, inExpr *Expr, tbl *txnTable) blockio.ReadFilterSearchFuncType {
     	//keys must be sorted.
     	keys.InplaceSort()
     	bytes, _ := keys.MarshalBinary()
     	filter := getNonCompositePKSearchFuncByExpr(
     		inExpr,
     		"pk",
     		tbl.proc.Load())
     	return filter.SortedSearchFunc
     }
    -...
    -var unsortedFilter blockio.ReadFilterSearchFuncType
    -buildUnsortedFilter := func() blockio.ReadFilterSearchFuncType {
    +
    +func buildUnsortedFilter(keys *Keys) blockio.ReadFilterSearchFuncType {
     	return getNonSortedPKSearchFuncByPKVec(keys)
     }
     
    Suggestion importance[1-10]: 6

    Why: Moving these functions outside the main function scope would indeed improve readability and maintainability. However, the impact on functionality is minimal, so it gets a moderate score.

    6
    Add comments to explain the purpose of each SELECT statement

    To improve readability and maintainability, consider adding comments to explain the
    purpose of each SQL statement, especially the SELECT statements.

    test/distributed/cases/disttae_filters/reader_filters/block_reader_filter.sql [10-12]

    +-- Select rows where b equals 1
     select a from t1 where b = 1;
    +-- Select rows where b is between 1 and 3
     select a from t1 where b between 1 and 3;
    +-- Select rows where b is in the set (1, 2, 3)
     select a from t1 where b in (1,2,3);
     
    Suggestion importance[1-10]: 5

    Why: Adding comments improves the readability and maintainability of the code. However, the impact on functionality is minimal, so it receives a moderate score.

    5
    Possible bug
    Check for nil before assigning functions to filter to avoid potential runtime errors

    The sortedSearchFunc and unSortedSearchFunc should be checked for nil before being
    assigned to filter.SortedSearchFunc and filter.UnSortedSearchFunc to avoid potential
    runtime errors.

    pkg/vm/engine/disttae/util.go [696-700]

    -filter.SortedSearchFunc = func(vecs []*vector.Vector) []int32 {
    -    return sortedSearchFunc(vecs[0])
    -}
    -filter.UnSortedSearchFunc = func(vecs []*vector.Vector) []int32 {
    -    return unSortedSearchFunc(vecs[0])
    +if sortedSearchFunc != nil && unSortedSearchFunc != nil {
    +    filter.SortedSearchFunc = func(vecs []*vector.Vector) []int32 {
    +        return sortedSearchFunc(vecs[0])
    +    }
    +    filter.UnSortedSearchFunc = func(vecs []*vector.Vector) []int32 {
    +        return unSortedSearchFunc(vecs[0])
    +    }
     }
     
    Suggestion importance[1-10]: 9

    Why: Checking sortedSearchFunc and unSortedSearchFunc for nil before using them in filter is essential to prevent runtime errors from nil pointer dereferences. This suggestion addresses a potential major bug.

    9
    Initialize function variables to nil to avoid potential nil pointer dereferences

    The sortedSearchFunc and unSortedSearchFunc variables should be initialized to nil to
    ensure they are explicitly set before use, avoiding potential nil pointer dereferences.

    pkg/vm/engine/disttae/util.go [540]

    -var sortedSearchFunc, unSortedSearchFunc func(*vector.Vector) []int32
    +var sortedSearchFunc, unSortedSearchFunc func(*vector.Vector) []int32 = nil, nil
     
    Suggestion importance[1-10]: 7

    Why: Initializing sortedSearchFunc and unSortedSearchFunc to nil is a good practice to ensure they are explicitly set before use, preventing nil pointer dereferences. However, the Go language initializes function variables to nil by default, so this is more about explicitness than a requirement.

    7
    Best practice
    Initialize the filter variable at the beginning of the function to ensure it is always defined

    The filter variable should be initialized at the beginning of the function to ensure it is
    always defined, even if the valExpr is nil.

    pkg/vm/engine/disttae/util.go [531-534]

    +var filter blockio.ReadFilter
     valExpr := getPkExpr(expr, pkName, proc)
     if valExpr == nil {
    -    return false, false, blockio.ReadFilter{}
    +    return false, false, filter
     }
     
    Suggestion importance[1-10]: 8

    Why: Initializing filter at the beginning of the function ensures that it is always defined, which is crucial for the function's robustness, especially when valExpr is nil.

    8
    Use IF NOT EXISTS and IF EXISTS clauses to make the CREATE and DROP statements more robust

    To ensure that the database and table are created and dropped only if they do not already
    exist, consider using IF NOT EXISTS and IF EXISTS clauses for the CREATE and DROP
    statements respectively.

    test/distributed/cases/disttae_filters/reader_filters/block_reader_filter.sql [1-14]

     drop database if exists testdb;
    -create database testdb;
    +create database if not exists testdb;
     ...
    -drop table t1;
    -drop database testdb;
    +drop table if exists t1;
    +drop database if exists testdb;
     
    Suggestion importance[1-10]: 7

    Why: The suggestion correctly identifies a best practice for handling database and table creation and deletion, ensuring operations are idempotent. However, the original code already uses IF EXISTS for the DROP statements, so the improvement is mainly for the CREATE statement.

    7
    Performance
    Optimize the nested loops in the UnOrderedLinearSearchOffsetByValFactory function by using a map for faster lookup

    In the UnOrderedLinearSearchOffsetByValFactory function, the nested loops can be optimized
    by using a map to store the values for faster lookup, which will improve the search
    performance.

    pkg/container/vector/search.go [36-42]

    +valMap := make(map[T]struct{}, len(vals))
    +for _, val := range vals {
    +	valMap[val] = struct{}{}
    +}
     for x := range rows {
    -	for y := range vals {
    -		if (cmp != nil && cmp(rows[x], vals[y]) == 0) || (cmp == nil && rows[x] == vals[y]) {
    -			sels = append(sels, int32(x))
    -			break
    -		}
    +	if _, found := valMap[rows[x]]; found {
    +		sels = append(sels, int32(x))
     	}
     }
     
    Suggestion importance[1-10]: 8

    Why: This is a valid optimization suggestion that would significantly improve the performance of the search function by reducing the complexity from O(n*m) to O(n+m).

    8
    Consolidate the filter.Valid check to avoid redundant evaluations

    In the getPKFilter function, the filter.Valid check is performed multiple times. It would
    be more efficient to perform this check once and store the result in a variable to avoid
    redundant evaluations.

    pkg/vm/engine/disttae/reader.go [144-157]

    -if !ok || !filter.Valid {
    +isValidFilter := ok && filter.Valid
    +if !isValidFilter {
     	mixin.filterState.evaluated = true
     	mixin.filterState.filter = blockio.ReadFilter{}
     	mixin.filterState.hasNull = hasNull
     	return blockio.ReadFilter{}
     }
     ...
    -if filter.Valid {
    +if isValidFilter {
     	mixin.filterState.filter = filter
     }
     
    Suggestion importance[1-10]: 7

    Why: The suggestion correctly identifies a potential performance improvement by reducing redundant checks of filter.Valid. This change would make the code more efficient and cleaner.

    7
    Enhancement
    Add a verification step to check the row count after inserting test data

    To ensure the test data is correctly inserted, consider adding a verification step after
    the INSERT statement to check the row count in the table.

    test/distributed/cases/disttae_filters/reader_filters/block_reader_filter.sql [7]

     insert into t1 select *, * from generate_series(1, 8192000)g;
    +select count(*) from t1;
     
    Suggestion importance[1-10]: 6

    Why: Adding a verification step after data insertion is a good practice for ensuring data integrity, especially in test environments. This suggestion is practical and enhances the reliability of the test setup.

    6
    Add a verification step to check the status of the mo_ctl flush operation

    To ensure the flush operation is successful, consider adding a verification step to check
    the status of the mo_ctl command.

    test/distributed/cases/disttae_filters/reader_filters/block_reader_filter.sql [9]

     select mo_ctl("dn", "flush", "testdb.t1");
    +-- Verify the flush operation
    +select * from mo_ctl_status where table_name = 't1';
     
    Suggestion importance[1-10]: 6

    Why: This suggestion enhances the robustness of the script by verifying the success of a critical operation. It's a valuable addition for ensuring the expected behavior of database operations.

    6
    Readability
    Simplify the searchFunc assignment logic in the BlockRead function using a single if-else block

    In the BlockRead function, the searchFunc assignment logic can be simplified by using a
    single if-else block to improve readability.

    pkg/vm/engine/tae/blockio/read.go [140-152]

     var searchFunc ReadFilterSearchFuncType
     if (filter.HasFakePK || !info.Sorted) && filter.UnSortedSearchFunc != nil {
     	searchFunc = filter.UnSortedSearchFunc
     } else if info.Sorted && filter.SortedSearchFunc != nil {
     	searchFunc = filter.SortedSearchFunc
     }
     
    -if searchFunc != nil {
    -	if sels, err = ReadByFilter(
    -		ctx, info, inputDeletes, filterSeqnums, filterColTypes,
    -		types.TimestampToTS(ts), searchFunc, fs, mp,
    -	); err != nil {
    -		return nil, err
    -	}
    +if searchFunc == nil {
    +	return nil, errors.New("no valid search function found")
     }
     
    +if sels, err = ReadByFilter(
    +	ctx, info, inputDeletes, filterSeqnums, filterColTypes,
    +	types.TimestampToTS(ts), searchFunc, fs, mp,
    +); err != nil {
    +	return nil, err
    +}
    +
    Suggestion importance[1-10]: 5

    Why: While the suggestion aims to improve readability, the proposed change does not significantly simplify the existing logic and introduces a new error check that was not present before. This could potentially alter the behavior of the function.

    5

    @mergify mergify bot merged commit c317b77 into matrixorigin:1.2-dev May 30, 2024
    17 of 18 checks passed
    @aylei aylei mentioned this pull request Jun 5, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    6 participants