Skip to content

Conversation

@leftwo
Copy link
Contributor

@leftwo leftwo commented Mar 30, 2022

This covers the next piece of reconciliation where we transfer extent files from
one downstairs to another using a dropshot server and generating the API
to stream the extent files.

This builds on several previous parts that did the work of identifying mismatches
between downstairs and creating a solve list.

This piece executes the solve, and results in all three downstairs being the same
on completion of the solving. This also enables a basic functionality that should
work when no outside errors are encountered (like a downstairs reset during
solve).

To get an idea of what happens when a downstairs does an extent repair, I have
copied this block comment from downstairs/src/region.rs:

     /**
       * Repair an extent from another downstairs
       *
       * We need to repair an extent in such a way that an interruption
       * at any time can be recovered from.
       *
       * Let us assume we are repairing extent 012
       *  1. Make new 012.copy dir  (extent name plus: .copy)
       *  2. Get all extent files from source side, put in 012.copy directory
       *  3. fsync files we just downloaded
       *  4. Rename 012.copy dir to 012.replace dir
       *  5. fsync extent directory ( 00/000/ where the extent files live)
       *  6. Replace current extent 012 files with copied files of same name
       *    from 012.replace dir
       *  7. Remove any files in extent dir that don't exist in replacing dir
       *  8. fsync files after copying them (new location).
       *  9. fsync containing extent dir
       * 10. Rename 012.replace dir to 012.completed dir.
       * 11. fsync extent dir again (so dir rename is persisted)
       * 12. Delete completed dir.
       *
       *  This also requires the following behavior on every extent open:
       *   A. If xxx.Copy directory found, delete it.
       *   B. If xxx.Completed directory found, delete it.
       *   C. If xxx.Replace dir found start at step 4 above and continue
       *      on through 6.
       *   D. Only then, open extent.
       */

Testing: 7000 or so loops of: make a downstairs different, then repair it.

Upstairs:
Update the repair logic to now do more than just close and then
re-open extents. The upstairs will now build the complete repair
steps needed to repair an extent. This includes the logic about
which downstairs should receive which repair commands.

Protocol:
New ExtentFlush to flush a specific extent.
New ExtentRepair command for repairing a specific extent.
Replaced a few specific ACKs with a generic repair ACK.

Downstairs:
Made the repair server always on, and starting at port + REPAIR_PORT_OFFSET.
New repair-api command that can dump a json file for the repair-client API
Added the logic to support the new repair messages.

Downstairs region:
The downstairs logic for execution of a repair lives in this file.
Created new functions to get the names for the new repair directories and
a bunch of tests around them. We can now get names for specific
parts of the extent, the housing directory, the copy, replace, and
completed directories, as well as provide an optional file extension
to the file name.
During repair, we check for and remove any copy or completed directories.
If there is a replay directory, then the extent open will do the
replay and replace existing extent files, only if the open is read/write.
New command to just flush a specific extent.

Downstairs repair:
This new file is the dropshot server for extent repair requests.
It listens on the specified port and serves (anyone currently) the
list of files for an extent, or one of the actual extent files.

Repair-client:
Generate a library using progenitor that is used by a downstairs to
request repair extents from another downstairs. This library generation
uses the json file that is output from the downstairs repair-api command.

Tests:
More tests for repair
More tests for building repair work list.
A New test that will in a loop:
Create a workload that will produce downstairs out of sync
Restart and repair the out of sync downstairs.
Improved usage messages for create-generic-ds.sh

What work is still outstanding with this:
The following are not part of this commit, but will be addressed in a follow up
commit and issues will be made for these. I felt this PR was big enough as is, and
the work here does allow more testing to start.

  • Error handling when interrupted during download of repair files.
  • Overall error handling if a repair job returns error to the upstairs.
  • Panic if copy directory exists when going to create it.
  • Panic if copy file exists when going to create it.
  • Panic if replacement directory preconditions are not as expected
  • Panic if move_replacement_extent copy fails
  • fsflush of directories after movement.
  • Progenitor needs an auto detect of a change in the API for downstairs-repair.
  • Repair client/server authentication. Currently anyone can request extent files from the repair server.

TBD: How we handle TLS and keys for ds->ds communication.

leftwo added 2 commits March 29, 2022 18:02
This covers the next piece of reconciliation where we actually copy
extent files from one downstairs to another.

Upstairs:
Update the repair logic to now do more than just close and then
re-open extents. The upstairs will now build the complete repair
steps needed to repair an extent.  This includes the logic about
which downstairs should receive which repair commands.

Protocol:
New ExtentFlush to flush a specific extent.
New ExtentRepair command for repairing a specific extent.
Replaced a few specific ACKs with a generic repair ACK.

Downstairs:
Made repair mode always on, and starting at port + REPAIR_PORT_OFFSET.
New repair-api command that can dump a json file for the repair-client API
Added the logic to support the new repair messages.

Downstairs region:
The downstairs coordination of a repair live in this file
New functions to get the names for the new repair directories and
a bunch of tests around them.  We can now get names for specific
parts of the extent, the housing directory, the copy, replace, and
completed directories, as well as provide an optional file extension
to the file name.
During repair, we check for and remove any copy or completed directories.
If there is a replay directory, then the extent open will do the
replay and replace existing extent files, only if the open is read/write.
New command to just flush a specific extent.

Downstairs repair:
This new file is the dropshot server for extent repair requests.
It listens on the specified port and serves (anyone currently) the
list of files for an extent, or one of the actual extent files.

Repair-client:
Generate a library using progenitor that is used by a downstairs to
request repair extents from another downstairs. This library generation
uses the json file that is output from the downstairs repair-api command.

Tests:
More tests for repair
More tests for building repair work list.
A New test that will in a loop:
 Create a workload that will produce downstairs out of sync
 Restart and repair the out of sync downstairs.
Improved usage messages for create-generic-ds.sh
@leftwo leftwo requested review from ahl, jclulow and jmpesp March 30, 2022 06:45
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: 🔥

leftwo added 3 commits March 31, 2022 13:50
Added a new function for validating repair files.

Other small updates based on PR comments.
@jmpesp
Copy link
Contributor

jmpesp commented Apr 4, 2022

👍 re-reviewed, 🚤

Copy link
Contributor

@ahl ahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Please forgive my chattiness, but I haven't been in this code and it's all very cool.

My only lingering question is why the repair API is on a different port rather than being part of the downstairs API? Is there a big theory statement or something that could clarify?

* Produce the string name of the data file for a given extent number
*/
fn extent_path<P: AsRef<Path>>(dir: P, number: u32) -> PathBuf {
pub fn extent_file_name(number: u32, extension: Option<&str>) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should extension be an Option<&str> or might it make sense to have this be some other enum to limit the set of inputs we might see here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, an enum here is probably the way to do it. I refactored this part a bunch of times and this seems like a good idea. Of course, once I did this, then a bunch of other places then became obvious to update as well.

*/
#[endpoint {
method = GET,
path = "/extent/{eid}/data",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this right that these 4 endpoints are basically the same? What would you think of something like this:

    #[derive(Deserialize, JsonSchema)]
    pub enum FileType {
        #[serde(rename = "data")
        Data,
        #[serde(rename = "db")
        Database,
        #[serde(rename = "db-shm")
        DatabaseSharedMemory, // ??
        #[serde(rename = "db-wal")
        DatabaseLog,
    }

    // ...

    struct FileSpec {
        eid: u32,
        file_type: FileType,
    }

    #[endpoint {
        method = GET,
        path = "/extent/{eid}/{file_type}",
    }]
    // ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's simpler. I've updated it a bit and then updated a bit more now that their is an enum for ExtentExtension. I can probably polish it a bit more, but at some point I want to get this back under test.

Comment on lines 215 to 218
#[derive(Deserialize, Serialize, JsonSchema)]
struct ExtentFiles {
files: Vec<String>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine either way, but you could (I think??) just return Vec<String> directly rather than wrapping them in a struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, simpler, done.

// The db file should always exist.
full_name.set_extension("db");
if !full_name.exists() {
return Err(HttpError::for_bad_request(None, "EBADF".to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we distinguish a client error (i.e. bogus eid) vs server error (i.e. internal inconsistency)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. And it can be difficult to know if we really received a bogus eid, or if the file that should exist for a given eid is missing. Right now the extent_file_name just transforms the name without actually checking if that extent is valid in the given region.

For the dropshot client (the downstairs requesting a file), they don't really care why there is a failure, as there is not much they can do about it and the repair will have to abort. It would be nice to know from a debug point of view though, so whomever is cleaning up the mess here will at least have some breadcrumbs as to why.

The one situation where the downstairs client could do something different is if the server either never responded and the client hit a timeout and can retry, or the server responded with some sort of "too busy, try again later" (not even sure if dropshot will do this).

Comment on lines 255 to 282
let mut full_name = extent_dir;

let extent_name = extent_file_name(eid, None);
full_name.push(extent_name.clone());
let mut files = Vec::new();
// The data file should always exist
if !full_name.exists() {
return Err(HttpError::for_bad_request(None, "EBADF".to_string()));
}
files.push(extent_name);
// The db file should always exist.
full_name.set_extension("db");
if !full_name.exists() {
return Err(HttpError::for_bad_request(None, "EBADF".to_string()));
}
files.push(extent_file_name(eid, Some("db")));

// The db-shm file may exist.
full_name.set_extension("db-shm");
if full_name.exists() {
println!("Exists: {:?}", full_name.file_name().unwrap());
files.push(extent_file_name(eid, Some("db-shm")));
}
// The db-wal file may exist.
full_name.set_extension("db-wal");
if full_name.exists() {
files.push(extent_file_name(eid, Some("db-wal")));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut full_name = extent_dir;
let extent_name = extent_file_name(eid, None);
full_name.push(extent_name.clone());
let mut files = Vec::new();
// The data file should always exist
if !full_name.exists() {
return Err(HttpError::for_bad_request(None, "EBADF".to_string()));
}
files.push(extent_name);
// The db file should always exist.
full_name.set_extension("db");
if !full_name.exists() {
return Err(HttpError::for_bad_request(None, "EBADF".to_string()));
}
files.push(extent_file_name(eid, Some("db")));
// The db-shm file may exist.
full_name.set_extension("db-shm");
if full_name.exists() {
println!("Exists: {:?}", full_name.file_name().unwrap());
files.push(extent_file_name(eid, Some("db-shm")));
}
// The db-wal file may exist.
full_name.set_extension("db-wal");
if full_name.exists() {
files.push(extent_file_name(eid, Some("db-wal")));
}
let files = &[
(extent_file_name(eid, None), true),
(extent_file_name(eid, Some("db")), true),
(extent_file_name(eid, Some("db-shm"), false),
(extent_file_name(eid, Some("db-wal"), false),
].into_iter()
.filter_map(|(file, required)| {
if file.exists() {
Some(Ok(file))
} else if required {
Some(Err(HttpError::for_bad_request(None, "EBADF".to_string())))
} else {
None
}
})
.collect::<Result<_>>()?;

I don't think this is perfect, but a sketch for your consideration. It may be more confusing than simplifying, but I wanted to try it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a bunch of ways to get this to work, but I could not come up with one using the
interator like you did here. I did get something that I think is a little better than what was there. Let me know your thoughts on what I cobbled together.

I also added some additional tests that I should have had originally.

Refactor the openapi a bit, less code duplication.
Copy link
Contributor Author

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed a bunch of @ahl comments.

* Produce the string name of the data file for a given extent number
*/
fn extent_path<P: AsRef<Path>>(dir: P, number: u32) -> PathBuf {
pub fn extent_file_name(number: u32, extension: Option<&str>) -> String {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, an enum here is probably the way to do it. I refactored this part a bunch of times and this seems like a good idea. Of course, once I did this, then a bunch of other places then became obvious to update as well.

// The db file should always exist.
full_name.set_extension("db");
if !full_name.exists() {
return Err(HttpError::for_bad_request(None, "EBADF".to_string()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. And it can be difficult to know if we really received a bogus eid, or if the file that should exist for a given eid is missing. Right now the extent_file_name just transforms the name without actually checking if that extent is valid in the given region.

For the dropshot client (the downstairs requesting a file), they don't really care why there is a failure, as there is not much they can do about it and the repair will have to abort. It would be nice to know from a debug point of view though, so whomever is cleaning up the mess here will at least have some breadcrumbs as to why.

The one situation where the downstairs client could do something different is if the server either never responded and the client hit a timeout and can retry, or the server responded with some sort of "too busy, try again later" (not even sure if dropshot will do this).

Comment on lines 255 to 282
let mut full_name = extent_dir;

let extent_name = extent_file_name(eid, None);
full_name.push(extent_name.clone());
let mut files = Vec::new();
// The data file should always exist
if !full_name.exists() {
return Err(HttpError::for_bad_request(None, "EBADF".to_string()));
}
files.push(extent_name);
// The db file should always exist.
full_name.set_extension("db");
if !full_name.exists() {
return Err(HttpError::for_bad_request(None, "EBADF".to_string()));
}
files.push(extent_file_name(eid, Some("db")));

// The db-shm file may exist.
full_name.set_extension("db-shm");
if full_name.exists() {
println!("Exists: {:?}", full_name.file_name().unwrap());
files.push(extent_file_name(eid, Some("db-shm")));
}
// The db-wal file may exist.
full_name.set_extension("db-wal");
if full_name.exists() {
files.push(extent_file_name(eid, Some("db-wal")));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a bunch of ways to get this to work, but I could not come up with one using the
interator like you did here. I did get something that I think is a little better than what was there. Let me know your thoughts on what I cobbled together.

I also added some additional tests that I should have had originally.

*/
#[endpoint {
method = GET,
path = "/extent/{eid}/data",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's simpler. I've updated it a bit and then updated a bit more now that their is an enum for ExtentExtension. I can probably polish it a bit more, but at some point I want to get this back under test.

Copy link
Contributor

@ahl ahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

Comment on lines 131 to 138
#[endpoint {
method = GET,
path = "/extent/{eid}/db-shm",
unpublished = false,
path = "/newextent/{eid}/{fileType}",
}]
async fn get_shm(
async fn get_extent_file(
rqctx: Arc<RequestContext<FileServerContext>>,
path: Path<Eid>,
path: Path<FileSpec>,
) -> Result<Response<Body>, HttpError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems better; agreed?

&self,
mut copy_dir: PathBuf,
extension: Option<&str>,
extension: Option<ExtentExtension>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively you could have the "no extension" variant in ExtentExtension or maybe just use FileType directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a "data" and changed it to be called "ExtentType" which I think fits better with what the need is for this file.

Comment on lines 10 to 11
"summary": "For a given extent, return a vec of strings representing the names of",
"description": "the files that exist for that extent.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note that the first line is presumed to be the terse summary. It's not a big deal for this kind of internal API, but you might want to change the comment to something like:

/**
 * Get the list of files related to an extent.
 *
 * For a given extent, return a set of strings representing the names of
 * the files that exist for that extent.
 */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update that

@leftwo
Copy link
Contributor Author

leftwo commented Apr 20, 2022

Just realized I did not answer this.
The repair and the "main downstairs" run on different ports because the repair data transfer goes between two downstairs and not through the upstairs. While the upstairs is running (and connected to all downstairs) it will issue internal commands to tell a downstairs that it needs to repair, and that it should get the data from this other downstairs. Once a downstairs has received a repair message from an upstairs, it will contact the "repair downstairs" directly, (at the repair port).

Let me figure out where I should put that as a comment somewhere so others can find it. I have lots of smaller comments on specifics, but nothing that ties them all together

Added big picture comment on repair process.
Copy link
Contributor Author

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have all of @ahl 's comments addressed now.

Comment on lines 10 to 11
"summary": "For a given extent, return a vec of strings representing the names of",
"description": "the files that exist for that extent.",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update that

Comment on lines 215 to 218
#[derive(Deserialize, Serialize, JsonSchema)]
struct ExtentFiles {
files: Vec<String>,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, simpler, done.

&self,
mut copy_dir: PathBuf,
extension: Option<&str>,
extension: Option<ExtentExtension>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a "data" and changed it to be called "ExtentType" which I think fits better with what the need is for this file.

@leftwo leftwo merged commit a70d458 into main Apr 21, 2022
@leftwo leftwo deleted the repairs branch April 21, 2022 00:29
@leftwo leftwo restored the repairs branch April 21, 2022 00:29
@leftwo leftwo deleted the repairs branch May 5, 2022 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants