Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curvefs: support curvebs as data backend #1207

Merged
merged 1 commit into from
Apr 8, 2022

Conversation

wu-hanqing
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #1032

Problem Summary:

What is changed and how it works?

What's Changed:

How it Works:

Side effects(Breaking backward compatibility? Performance regression?):

Check List

  • Relevant documentation/comments is changed or added
  • I acknowledge that all my contributions will be made under the project's license

@wu-hanqing wu-hanqing force-pushed the curvefs/volume-space branch 2 times, most recently from 48cbede to f6202d8 Compare March 24, 2022 08:29
@wu-hanqing
Copy link
Contributor Author

recheck

2 similar comments
@wu-hanqing
Copy link
Contributor Author

recheck

@wu-hanqing
Copy link
Contributor Author

recheck

@wu-hanqing wu-hanqing force-pushed the curvefs/volume-space branch 2 times, most recently from 4172780 to 85ec688 Compare March 29, 2022 02:52
@@ -51,6 +51,7 @@ spaceserver.rpcTimeoutMs=1000

#### bdev
bdev.confpath=/etc/curve/client.conf
bdev.threadnum=10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this threadnum used for? You can add some notes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you say that you have used other learning interfaces before?

volumeBlockSize=4096
volumeName=volume
volumeUser=user
volumePassword=password
volumeBlockGroupSize=134217728
# support |AtStart| and |AtEnd|
volumeBitmapLocation=AtStart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each bolckGroup has a bitmap and the blockGroupSize is 128MiB and 4MiB is signed by one bit of bitmap. Only 4 bytes of bitmap on each blockGroup? 4B + 128MiB ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimum allocate granularity(block size) is 4KiB, so we need a 4KiB bitmap and each bit represents whether a block is used or not.

And 4MiB is an optimization for continuity of file data-block allocation.

you can see #959

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change from AtStart to AtEnd?

@@ -150,6 +150,7 @@ enum FsFileType {
TYPE_FILE = 2;
TYPE_SYM_LINK = 3;
TYPE_S3 = 4;
TYPE_VOLUME = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the meaning of FsFileType here? Kind of backend S3 or Volume ? The file type of filesystem like regular file , dir , sym_link?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are TYPE_S3 and TYPE_VOLUME the same level as the other three?

@@ -194,7 +195,7 @@ message Inode {
required FsFileType type = 14;
optional string symlink = 15; // TYPE_SYM_LINK only
optional uint64 rdev = 16;
optional VolumeExtentList volumeExtentList = 17; // TYPE_FILE only
map<uint64, VolumeExtentList> volumeExtentMap = 17; // TYPE_FILE only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning of key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

@@ -67,15 +68,19 @@ class InodeWrapper : public std::enable_shared_from_this<InodeWrapper> {
status_(InodeStatus::Normal),
metaClient_(metaClient),
openCount_(0),
dirty_(false) {}
dirty_(false) {
BuildExtentCache();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will rebuild each time cost much when 'VolumeExtentList' is bigger in inode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, so we should separate inode's attributes and s3chunkinfo/volumeextentlist later.

// 1. open a file with O_RDWR
// 2. write "hello, world"
// 3. unlink this file
// 4. pread from 0 to 13, expected "hello, world"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why expected "hello, world" after unlink the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the short answer is, unlink is delayed until all processes close the file.

you can see unlink's descriptions for more information.

https://linux.die.net/man/2/unlink

availableSize_(availableSize),
totalGroups_(size / blockGroupSize),
allocatedGroups_(allocatedGroups.size()) {
for (auto& g : allocatedGroups) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const auto& g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

@@ -43,7 +43,7 @@ DEFINE_bool(detail, false, "show more infomation");
DEFINE_string(partitionId, "1,2,3", "partition id");
DEFINE_string(metaserverId, "1,2,3", "metaserver id");
DEFINE_bool(noconfirm, false, "execute command without confirmation");
DEFINE_uint64(blockSize, 1048576, "block size");
DEFINE_uint64(blockSize, 0, "volume size");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set default value same as the item in conf '4096' ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

@wu-hanqing wu-hanqing force-pushed the curvefs/volume-space branch 3 times, most recently from 6f2e72c to fb221fd Compare March 31, 2022 11:31
@@ -43,7 +43,7 @@ DEFINE_bool(detail, false, "show more infomation");
DEFINE_string(partitionId, "1,2,3", "partition id");
DEFINE_string(metaserverId, "1,2,3", "metaserver id");
DEFINE_bool(noconfirm, false, "execute command without confirmation");
DEFINE_uint64(blockSize, 1048576, "block size");
DEFINE_uint64(blockSize, 4096, "volume block size");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following ‘volumeBlockSize’ is the blocksize of the volume, not this ‘blockszie’.
The interface of createfs has a blocksize, and in fsdetail, s3 and volume have their own blockszie.

@@ -312,7 +312,7 @@ int Metaserver::PersistDataToLocalFile(std::shared_ptr<LocalFileSystem> fs,
}

int writtenCount = fs->Write(fd, data.c_str(), 0, data.size());
if (writtenCount < data.size()) {
if (writtenCount < 0 || static_cast<size_t>(writtenCount) != data.size()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why writtenCount < data.size() does not satisfy this requirement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, according to the rule, when a signed integral compare with an unsigned integral, the signed value will be converted to unsigned.

for example, data's size is 4096, and fs->Write returns -1, previous if (writtenCount < data.size) will be false.

see https://godbolt.org/z/fK1z8Penb

if (request.has_volumeextentlist()) {
VLOG(1) << "update inode has extent";
old->mutable_volumeextentlist()->CopyFrom(request.volumeextentlist());
// FIXME(wuhanqing): turncate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why truncate cannot be processed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncate needs release space, and we don't implement yet

@@ -51,6 +51,7 @@ spaceserver.rpcTimeoutMs=1000

#### bdev
bdev.confpath=/etc/curve/client.conf
bdev.threadnum=10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you say that you have used other learning interfaces before?

volumeBlockSize=4096
volumeName=volume
volumeUser=user
volumePassword=password
volumeBlockGroupSize=134217728
# support |AtStart| and |AtEnd|
volumeBitmapLocation=AtStart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change from AtStart to AtEnd?

required uint64 blockGroupSize = 6;
required BitmapLocation bitmapLocation = 7;

// TODO(all): maybe we need curvebs cluster's mds ip and port in here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use the id to mark the curvebs cluster. The correspondence between the id and the curvebs cluster is recorded in the curvefs-mds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you say that you have used other learning interfaces before?

it's because, for unaligned write operation, RMW is performed, so there might be a data consistent problem, like #985. and if this really happens, we should redesign the implementation here. so I just add a dedicated thread pool now.

Can we change from AtStart to AtEnd?

yep, both are supported, but they have no much difference.

Maybe we can use the id to mark the curvebs cluster. The correspondence between the id and the curvebs cluster is recorded in the curvefs-mds.

currently, it's only affects how we extend the volume, now we can only add an extra client.conf in mds.

and if we are going to support volume in different curvebs clusters, we may need ip/port or id.

@@ -306,5 +280,13 @@ void FuseVolumeClient::FlushData() {
// TODO(xuchaojie) : flush volume data
}

void FuseVolumeClient::SetSpaceManagerForTesting(SpaceManager *manager) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement FuseOpFlush. If cto, persist metadata and data, otherwise only persist data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

@@ -350,5 +396,34 @@ InodeWrapper::UpdateInodeStatus(InodeOpenStatusChange statusChange) {
return CURVEFS_ERROR::OK;
}

// TODO(wuhanqing): update extents incrementally
CURVEFS_ERROR InodeWrapper::SyncVolumeExtents() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an interface for SyncVolumeExtents, why SyncFullInode doesn't just update other parts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SyncVolumeExtents is unused and is been deleted.

But, I think this class is going to be unmaintainable, it has different responsibilities, and somewhere use this interface and somewhere else, but may do the same thing. And, it has to know the data backend.

@@ -148,6 +148,60 @@ void MDSBaseClient::AllocS3ChunkId(uint32_t fsId,
curvefs::mds::MdsService_Stub stub(channel);
stub.AllocateS3Chunk(cntl, &request, response, nullptr);
}

// TODO(all): do we really need pass `fsId` all the time?
// each curve-fuse process only mount one filesystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fsid is used to mark curve-fuse. The block group is uniquely identified by the owner,
the representation in the s3 scene is fsid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we can store fsid as a global variable, so that, we can just use this variable, and don't have to pass it as a parameter all the time and everywhere.

<< ", len: " << len << ", block write requests: " << writes;

ssize_t nr = blockDeviceClient_->Writev(writes);
if (nr < 0 /*|| nr != len*/) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why comment out the latter part: nr != len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for unaligned requests, the return value will be bigger than the origin read/write length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remark here
image

mutable curve::common::RWLock lock_;

// key is offset
std::unordered_map<uint64_t, std::map<uint64_t, PExtent>> extents_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extents_ needs to be limited in size or length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, it's the latest extents of an inode.

@wu-hanqing wu-hanqing force-pushed the curvefs/volume-space branch 3 times, most recently from a9a9f52 to a155b56 Compare April 5, 2022 07:43
WORKSPACE Outdated
@@ -20,7 +20,7 @@ load("@bazel_tools//tools/build_defs/repo:git.bzl", "git_repository")

git_repository(
name = "com_github_baidu_braft",
remote = "https://github.com/baidu/braft",
remote = "https://gitee.com/baidu/braft",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to modify here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix


LockGuard lk(mtx_);
for (auto& group : allocatedGroups_) {
auto err = ClearBlockGroup(group.second);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark as TODO here: etcdclient need to implement remove by prefix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

return MakeNode(req, parent, name, mode, FsFileType::TYPE_FILE, rdev, e);
VLOG(3) << "FuseOpMkNod, parent: " << parent << ", name: " << name
<< ", mode: " << mode << ", rdev: " << rdev;
return MakeNode(req, parent, name, mode, FsFileType::TYPE_VOLUME, rdev, e);
}

CURVEFS_ERROR FuseVolumeClient::FuseOpFsync(fuse_req_t req, fuse_ino_t ino,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FuseOpFlush do not implement in line275
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

<< ", len: " << len << ", block write requests: " << writes;

ssize_t nr = blockDeviceClient_->Writev(writes);
if (nr < 0 /*|| nr != len*/) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remark here
image

@wu-hanqing wu-hanqing force-pushed the curvefs/volume-space branch 3 times, most recently from 5493b25 to 386fa73 Compare April 7, 2022 08:33
@@ -298,13 +274,23 @@ CURVEFS_ERROR FuseVolumeClient::Truncate(Inode *inode, uint64_t length) {

CURVEFS_ERROR FuseVolumeClient::FuseOpFlush(fuse_req_t req, fuse_ino_t ino,
struct fuse_file_info *fi) {
// Todo: flush data and note cto
return CURVEFS_ERROR::OK;
VLOG(9) << "FuseOpFlush, ino: " << ino;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curvefs::client::common::FLAGS_enableCto = true, FuseOpFlush need to sync inode for cto(close will call flush internal)

@wu-hanqing wu-hanqing merged commit edc5fda into opencurve:master Apr 8, 2022
@wu-hanqing wu-hanqing deleted the curvefs/volume-space branch April 13, 2022 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

curvefs use curve block storage as data storage backend
4 participants