Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curvefs: add a new distributed transaction model to improve rename performance #2884

Merged
merged 2 commits into from
Dec 12, 2023

Conversation

SeanHai
Copy link
Contributor

@SeanHai SeanHai commented Nov 10, 2023

What problem does this PR solve?

first commit related pr: #2748

Issue Number: #xxx

Problem Summary:

What is changed and how it works?

What's Changed:

How it Works:

Side effects(Breaking backward compatibility? Performance regression?):

Check List

  • Relevant documentation/comments is changed or added
  • I acknowledge that all my contributions will be made under the project's license

@SeanHai SeanHai changed the title [draft] curvefs: replace distributed transaction model to improve rename performance curvefs: add a new distributed transaction model to improve rename performance Nov 20, 2023
@SeanHai
Copy link
Contributor Author

SeanHai commented Nov 20, 2023

cicheck

@SeanHai
Copy link
Contributor Author

SeanHai commented Nov 20, 2023

cicheck

@SeanHai SeanHai force-pushed the improve_rename branch 3 times, most recently from 16914e5 to bdbcf7f Compare November 24, 2023 02:43
@SeanHai
Copy link
Contributor Author

SeanHai commented Nov 24, 2023

cicheck

@SeanHai
Copy link
Contributor Author

SeanHai commented Nov 30, 2023

cicheck

Copy link
Contributor

@wu-hanqing wu-hanqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you going to solve the compatibility issues introduced?

curvefs/src/mds/topology/topology.cpp Outdated Show resolved Hide resolved
@@ -1780,6 +1785,12 @@ TopologyImpl::AllocOrGetMemcacheCluster(FsIdType fsId,
return ret;
}

bool TopologyImpl::Tso(uint64_t* ts, uint64_t* timestamp) {
*timestamp = curve::common::TimeUtility::GetTimeofDayMs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the man page https://man7.org/linux/man-pages/man2/gettimeofday.2.html, gettimeofday is not monotonical, so that, you may get a lower timestamp for latter request. And it may also have problem when mds leader exchaged.

Copy link
Contributor Author

@SeanHai SeanHai Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a problem. But normally the server will use NTP to synchronize time, If the difference is not enough to determine the transaction timeout (default 5s) there should be no impact. otherwise it may rollback the ongoing transaction in a transaction conflict scenario.
Do you have any feasible suggestions to get monotonically increasing physical time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed a problem. But normally the server will use NTP to synchronize time, If the difference is not enough to determine the transaction timeout (default 5s) there should be no impact. otherwise it may rollback the ongoing transaction in a transaction conflict scenario. Do you have any feasible suggestions to get monotonically increasing physical time?

If ts is still persisted in each time, you can consider persist timestamp with it. And if GetTimeofDay returns a value that is less than the previous timestamp, then sleep until GetTimeofDay returns a bigger value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can try this function clock_gettime? CLOCK_MONOTONIC args seems can do it. can man clock_gettime see details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can try this function clock_gettime? CLOCK_MONOTONIC args seems can do it. can man clock_gettime see details.

clock_gettime with args CLOCK_REALTIME is better than gettimeofday here but the clocks on different MDS machines may still have discrepancies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clock_gettime with args CLOCK_REALTIME is better than gettimeofday here but the clocks on different MDS machines may still have discrepancies.

What I mean is that this function can get continuous and greater than over and over timestamps on a single machine, and is not affected by time drift and NTP alignment. If you want to get this on multiple machines, It's going to be a big challenge. For most storage or data services, they have a separate timing service.

curvefs/src/mds/topology/topology.cpp Outdated Show resolved Hide resolved
curvefs/src/mds/topology/topology.h Outdated Show resolved Hide resolved
curvefs/src/metaserver/dentry_storage.cpp Outdated Show resolved Hide resolved
curvefs/src/client/dentry_cache_manager.cpp Outdated Show resolved Hide resolved
curvefs/src/client/fuse_client.cpp Show resolved Hide resolved
@@ -107,6 +114,7 @@ class RenameOperator {
// if dest exist, record the size and type of file or empty dir
int64_t oldInodeSize_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oldInode means new parent inode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oldInode means srcInode, A rename to B is A.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oldInode means srcInode, A rename to B is A.

It seems oldInodeId_ is only assigned with dstDentry_.inodeid() in https://github.com/opencurve/curve/blob/bfd5acbf1bfc24047af857be7218c76cfe27c8e2/curvefs/src/client/client_operator.cpp#L176C44-L176C44

@@ -83,6 +90,10 @@ class DentryCacheManagerImpl : public DentryCacheManager {
const std::shared_ptr<MetaServerClient> &metaClient)
: metaClient_(metaClient) {}

void Init(std::shared_ptr<MdsClient> mdsClient) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like DentryCacheManager no longer has cache functionality

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the previous dentry and inode cache has been optimized in v2.6.

Copy link
Contributor

@wu-hanqing wu-hanqing Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the previous dentry and inode cache has been optimized in v2.6.

It's better to remove Cache from its name

fixed

Comment on lines +933 to +953
VLOG(3) << "FuseOpRename [start]: " << renameOp.DebugString();
RETURN_IF_UNSUCCESS(Precheck);
RETURN_IF_UNSUCCESS(RecordOldInodeInfo);
// Do not move LinkDestParentInode behind CommitTx.
// If so, the nlink will be lost when the machine goes down
RETURN_IF_UNSUCCESS(LinkDestParentInode);
RETURN_IF_UNSUCCESS(PrewriteTx);
RETURN_IF_UNSUCCESS(CommitTxV2);
VLOG(3) << "FuseOpRename [success]: " << renameOp.DebugString();
// Do not check UnlinkSrcParentInode, beause rename is already success
renameOp.UnlinkSrcParentInode();
renameOp.UnlinkOldInode();
if (parent != newparent) {
renameOp.UpdateInodeParent();
}
renameOp.UpdateInodeCtime();

if (enableSumInDir_.load()) {
xattrManager_->UpdateParentXattrAfterRename(
parent, newparent, newname, &renameOp);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn’t look very different from v1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire process is the same, the difference lies in the handling of dentry (prewriteTx, commitTx). In the v1 version, the rename process needs to be protected by a lock and cannot be concurrent.

@SeanHai
Copy link
Contributor Author

SeanHai commented Dec 1, 2023

How are you going to solve the compatibility issues introduced?

The two transaction models are currently not compatible. The result of the previous discussion is that the old cluster continues to use the old version, and the newly deployed cluster turns on the v2 version of the transaction model.

curvefs/src/client/client_operator.cpp Show resolved Hide resolved
curvefs/src/metaserver/dentry_storage.cpp Show resolved Hide resolved
rt = PrewriteRenameTx(dentrys, txLockIn);
if (rt == CURVEFS_ERROR::OK) {
dentrys[0] = newDentry_;
rt = PrewriteRenameTx(dentrys, txLockIn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should client clean up lock in primary in this case? otherwise, subsequent requests must wait until lock in primary timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the one hand, the current transaction model is to defer error handling to the next transaction involving the same dentry. On the other hand, the clean up lock is likely to fail when PrewriteRenameTx failed.

metaClient_->CommitTx(dentrys, startTs_, commitTs);
}
}
if (rt != MetaStatusCode::OK) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, clean up locks in prewrite phase?

@@ -760,53 +734,31 @@ MetaStatusCode DentryStorage::CommitTx(const std::vector<Dentry>& dentrys,
}
WriteLockGuard lg(rwLock_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most dentry operations are protected by this lock, it seeems you can pack almost all logic on client side Precheck/Prewrite/Commit into a single RPC when renaming files under the same directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, the txV1 and txV2 all need this improve, a new pr will do this later.

@SeanHai
Copy link
Contributor Author

SeanHai commented Dec 5, 2023

cicheck

@SeanHai SeanHai force-pushed the improve_rename branch 2 times, most recently from b7d6ff1 to 57fbcba Compare December 5, 2023 09:08
@SeanHai
Copy link
Contributor Author

SeanHai commented Dec 5, 2023

cicheck

2 similar comments
@SeanHai
Copy link
Contributor Author

SeanHai commented Dec 5, 2023

cicheck

@SeanHai
Copy link
Contributor Author

SeanHai commented Dec 5, 2023

cicheck

@@ -570,5 +620,35 @@ FSStatusCode PersisKVStorage::DeleteFsUsage(const std::string& fsName) {
return FSStatusCode::OK;
}

FSStatusCode PersisKVStorage::Tso(uint32_t fsId, uint64_t* ts,
uint64_t* timestamp) {
WriteLockGuard lock(tsLock_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock also causes requests from different file systems to be processed serially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, change to one tsIdGenerator in cluster.

TS tsInfo;
tsInfo.set_ts(*ts);
// persist to storage
std::string key = codec::EncodeTsKey(fsId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the method of allocating chunkid to allocate ts? so that, you don't have to persist to etch for each ts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -83,6 +90,10 @@ class DentryCacheManagerImpl : public DentryCacheManager {
const std::shared_ptr<MetaServerClient> &metaClient)
: metaClient_(metaClient) {}

void Init(std::shared_ptr<MdsClient> mdsClient) override {
Copy link
Contributor

@wu-hanqing wu-hanqing Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the previous dentry and inode cache has been optimized in v2.6.

It's better to remove Cache from its name

fixed

@SeanHai SeanHai force-pushed the improve_rename branch 2 times, most recently from 670ddfe to 428eb0e Compare December 8, 2023 08:34
@SeanHai
Copy link
Contributor Author

SeanHai commented Dec 10, 2023

cicheck

@SeanHai SeanHai force-pushed the improve_rename branch 2 times, most recently from da75c9e to e1c15c8 Compare December 11, 2023 11:47
Signed-off-by: wanghai01 <seanhaizi@163.com>
@SeanHai
Copy link
Contributor Author

SeanHai commented Dec 11, 2023

cicheck

…rformance

Signed-off-by: wanghai01 <seanhaizi@163.com>
@SeanHai
Copy link
Contributor Author

SeanHai commented Dec 12, 2023

cicheck

@@ -186,14 +188,21 @@ FSStatusCode MemoryFsStorage::DeleteFsUsage(const std::string& fsName) {
return FSStatusCode::OK;
}

FSStatusCode MemoryFsStorage::Tso(uint64_t* ts, uint64_t* timestamp) {
*timestamp = curve::common::TimeUtility::GetTimeofDayMs();
*ts = tsId_.fetch_add(1, std::memory_order_relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is all tso has increasing sequence number, not divide into every timestamp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tso has a monotonically increasing sequence number and a timestamp which used to determine tx lock timeout.

@SeanHai SeanHai merged commit 0c722ae into opencurve:master Dec 12, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants