Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance TiSpark start up task speed when there are a large amount of regions #2520

Closed
wants to merge 4 commits into from

Conversation

qidi1
Copy link
Contributor

@qidi1 qidi1 commented Aug 24, 2022

Signed-off-by: qidi1 1083369179@qq.com

What problem does this PR solve?

What is changed and how it works?

The origin logical of get regions info in the range is like this:

public List<RegionTask> splitRangeByRegion(List<KeyRange> keyRanges, TiStoreType storeType) {
    if (keyRanges == null || keyRanges.size() == 0) {
      return ImmutableList.of();
    }

    ...

    while (true) {
      ...
      while (regionStorePair == null) {
        try {
          regionStorePair = regionManager.getRegionStorePairByKey(range.getStart(), storeType, bo);

          if (regionStorePair == null) {
           // throw exception
           ...
          }
        } catch (Exception e) {
		   ...
        }
      }

      TiRegion region = regionStorePair.first;
      idToRegion.putIfAbsent(region.getId(), regionStorePair);

	  // update range start, if range start is bigger or equal the range end,
      // use next range of list.
      ...
 }
public Pair<TiRegion, Store> getRegionStorePairByKey(
      ByteString key, TiStoreType storeType, BackOffer backOffer) {
    TiRegion region = cache.getRegionByKey(key, backOffer);
    // get store from region.
    ...
    return Pair.create(region, store);
}

public synchronized TiRegion getRegionByKey(ByteString key, BackOffer backOffer) {
      TiRegion region = regionCache.get(getEncodedKey(key));
	  ...

      if (region == null) {
        ...
        // only get one region here.
        region = pdClient.getRegionByKey(backOffer, key);
        if (!putRegion(region)) {
          throw new TiClientInternalException("Invalid Region: " + region.toString());
        }
      }

      return region;
  }

As you can see, a lot of rpc requests need to be made when a lot of region information is not in the cache.

In order to reduce rpc requests, when we request region information from pd, we not only request the region information containing the startKey, but also request other region information in this range.

We do not request all the regions of this range at once, because most of the regions we need may already be in the cache, and if we request all the regions it may put a large burden on the pd. If we request all the regions we need, we will need to request all the regions again, which will cause a huge consumption.
we limit the number of returned region information per request. The default number is 128.

   public List<RegionTask> splitRangeByRegion(List<KeyRange> keyRanges, TiStoreType storeType) {
    // init some data.
    ...
    for (KeyRange range : keyRanges) {
      while (true) {
        try {
          List<RegionStorePair> regionStorePairList =
              regionManager.getAllRegionStorePairsInRange(range, storeType, bo);
          // add data to result.
		  ...
          break;
        } catch (Exception e) {
 		  ...
          bo.doBackOff(BackOffFuncType.BoRegionMiss, e);
        }
      }
    }
    // build result
	...
    return resultBuilder.build();
  }  

public List<RegionStorePair> getAllRegionStorePairsInRange(
      KeyRange range, TiStoreType storeType, BackOffer backOffer) {
    // init some data.
	...
    while (toRawKey(startKey).compareTo(toRawKey(endKey)) < 0) {
      try {
        ...
        ArrayList<TiRegion> regions =
            cache.getRegionsInRangeWithLimit(range, backOffer, REGION_SCAN_LIMIT);
        // construct region and store pair
		...
        startKey = regions.get(regions.size() - 1).getEndKey();
        range = KeyRange.newBuilder().setStart(startKey).setEnd(endKey).build();
      } catch (Exception e) {
   		...
      }
    }
    return allRegionStorePairs;
  }

public synchronized ArrayList<TiRegion> getRegionsInRangeWithLimit(
        KeyRange range, BackOffer backOffer, int scanLimit) {
      // init some data.
      ...
      while (toRawKey(startKey).compareTo(toRawKey(endKey)) < 0
          && regionsInRange.size() < scanLimit) {
        TiRegion region = regionCache.get(getEncodedKey(startKey));
        while (region == null) {
          try {
            updateCacheInRangeWithLimit(startKey, endKey, scanLimit, backOffer);
            region = regionCache.get(getEncodedKey(startKey));
            ...
          } catch (Exception e) {
            ...
          }
        }
        ...
        startKey = region.getEndKey();
      }
      return regionsInRange;
    }

private synchronized void updateCacheInRangeWithLimit(
        ByteString startKey, ByteString endKey, int limit, BackOffer backOffer)
        throws GrpcException {
        // get muti region from pd.
   List<TiRegion> regions = pdClient.scanRegionWithLimit(backOffer, startKey, endKey, limit);
   for (TiRegion region : regions) {
      // Region without leader.
      if (region.getLeader() == null || region.getLeader().getId() == 0) {
        continue;
      }
      // Region which has leader will be inserted into cache.
      if (!putRegion(region)) {
        throw new TiClientInternalException("Invalid Region: " + region);
      }
 }
    
    

We tested the new method against the old method.

When the region number than we need to request is 1w.

The old method took about seven seconds,The new method took about four seconds.

new

spark测试新region1000_10

old

spark测试旧region1000_10

When the region number than we need to request is 10w.

The old method took about thirteen seconds,The new method took about thirty seconds.

new

spark测试新region10w

old

spark测试旧region10w

Check List

Tests

  • Unit test

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@qidi1
Copy link
Contributor Author

qidi1 commented Aug 24, 2022

/run-all-tests

@qidi1 qidi1 changed the title TiSpark start up task slow when there are a large amount of regions Enhance TiSpark start up task speed when there are a large amount of regions Aug 24, 2022
Signed-off-by: qidi1 <1083369179@qq.com>
@qidi1
Copy link
Contributor Author

qidi1 commented Aug 24, 2022

/run-all-tests

1 similar comment
@qidi1
Copy link
Contributor Author

qidi1 commented Aug 25, 2022

/run-all-tests

Signed-off-by: qidi1 <1083369179@qq.com>
@qidi1
Copy link
Contributor Author

qidi1 commented Aug 25, 2022

/run-all-tests

@ti-chi-bot
Copy link
Member

@qidi1: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@qidi1
Copy link
Contributor Author

qidi1 commented Sep 23, 2022

There is no need to improve the performance of this. This part of logical is move to client-java.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants