GSoC 2023: Native Backup #120

Jubilee101 · 2023-05-05T21:37:48Z

Jubilee101
May 5, 2023
Collaborator

Intro

Hi,

My name is Haoran Zhang. I have been selected as a GSoC contributor for pgmoneta this year (2023). I will work on replacing our current backup implementation.

Currently pgmoneta generates backups by sending pg_basebackup command to Postgres database server. We want to get rid of this dependency on Postgres binaries by implementing the process this command triggers. This mainly involves handling the client side of Postgres's streaming replication protocal.

And since WAL replication shares the same infrastructure and protocal, we will try to implement this part as well.

You can find the most updated details in the progress log below (I'm a bit chatty). Or check out my proposal, though I should warn you that the I wrote the proposal under limited knowledge of this project. So reading this forum will provide you with more accurate details.

I'm open to suggestions and ideas, feel free to discuss them here or email me at andrewzhr9911@gmail.com

May the goddess Moneta be with us this summer :)

List of the work done and to do:

Note that more details about the milestones may be added to this list as I start to work on them (M for milestone)

Related issues:

Progress Log

08/04/2023
Finally, #140 is merged into main branch. We got our native WAL infrastructure! This marks the end of my GSoC projects, huge thanks to Jesper for his patient, reponsive and inspiring guidance. I'll remain as one of the contributors to pgmoneta community and work towards more advanced backup strategies. For future developers, I sincerely hope the logs I kept below could be of help to you guys, good luck!

08/03/2023
Jesper found that our wal implementation is running at 100% CPU usage rate. I looked into it and realized that the socket is non-blocking by pgmoneta's default, and therefore creating a busy while loop. We fixed that by adding sleep(1000000L) in between. Now CPU usage rate is stable at about 3%.

I also found that backup takes much longer time now. I initially suspected it's because the child wal process and the parent backup process are sharing the same socket, because they have the same socket number. But turns out it's normal because two processes have diffferent file descriptor tables(https://stackoverflow.com/questions/27746417/multiple-local-processes-have-the-same-socket). Forgive my rusty network programming knowledge.

The real reason is simply that my database is getting bigger because of too many pgbench. I guess I have to do some cleanup after pgbench. Also we could look into faster way to do our backup.

07/31/2023
The termination problem got fixed by config->running. I also enhanced our handling on receiving CopyDone from server for reaching the end of the timeline.

07/30/2023
I spent the weekend working on our receivewal implementation and get an initial but pretty nice version working. Server starts on the received log normally. And the program exits on server shut down. One thing that worth mentioning is that server sometimes sends data across the wal segment boundary. So I had to save those data and write to the next WAL file later. Also it's not stopping on ctrl+c and cannot deal with multiple timeline. Other than that, it's a really nice implementation.

07/29/2023
Turns out receiving WAL is a never ending process, in an infinite while loop, unless we are requesting WAL from an older timeline. So normally we should not expect to see CommandComplete or CopyDone.

I also spent some time figuring out the WAL naming format. And here is what I find.
As we know each row of log in the WAL has an LSN, mutiple logs form one WAL segment, which also corresponds to one WAL file. So assume the segment size is 16MB, which is by Postgres default, the address, or the offset within each segment of each log record is 24 bit, and the total address length for an LSN is 64bit, then we can roughly seperate the address into

|---------------------40 bit---------------------|-----------24 bit-----------|
|------------segment addr------------------------|------log offset------------|

As for the segment address, the first 32 bits is what I call a segment group ID, or segment group address, we call it X, the last 8 bits is what I call a segment in-group offset, we call it Y.

|------------------32 bit--------|---------8 bit--------------------|-----------24 bit-----------|
|------------segment group ID----|-----segment in-group offset------|------log offset------------|

Then the WAL format can be denoted as

|---timeline ID---|----X----|----Y----|

Each part in this name has 4 bytes (32 bits), and all the address, ID and offset are in hex format, so each number in the name you see stands for 4 bits. And the remaining bits are padded by 0.

So now comes to the interesting question we care about, if we know an LSN of a record, which basically is a 64 bits integer, how do we construct the xlog file name? Please note that now the segment size is likely not the default 16 MB, but the principle is the same. We first need to remove the in-log offset part, so we devide LSN by the log segment size, and we get what I call the segment number(segno), you can also call it segment ID or segment address, this part corresponds to the 40 higher bits we talked about earlier.

Then we must know how many segments are there within each segment group, which is equivalent to the total number of addresses a segment in-group offset can represent. Bare in mind that the segment group ID is always 32 bit, and the total length of the addr is 64 bit, so the segment in-group offset along with the in-log offset also have 32 bit. That means the total number of addresses to denote a segment in-group offset is 2^32 / segment size, this number is what we want: the number of segments within each segment group.

To make this clearer, let's take the default 16MB segment size for example, the in-log offset is 24bit, and the segment in-group offset has 32 - 24 = 8 bit, it can represent 2^8 addresses, which can also be derived by 2^32 / 2^24. you can also think about it this way: only when the address increase from 0x00 to 0x11, the higher 32 bit address will increment by 1. This is why we call it number of segments per segment group.

Now that we know the number of segments per group, the rest is easy. The segment group ID is just segno / segments per group, and the segment in-group offset is just segno % segments per group.

07/25/2023
The naming format of the wal file is weird. I found a good post explaining this: https://www.crunchydata.com/blog/postgres-wal-files-and-sequuence-numbers

I did some simple coding to verify the process, basically I send the request and then just blindly receive messages and log their types. It is the same trick I used when writing native backup. But it's not working for the wal receiver. Firstly it is much slower to get the CopyBothResponse or the log data, I don't know if that's normal. And it seems that the server stops sending me anything after sending just one CopyData message. And sometimes it stops after sending me just the CopyBothResponse. I'm still looking into why, I think investigating the sender side code in postgres is a good start.

07/23/2023
I'm starting to worry that time isn't on my side now. I'm studying the details of the wal replication process by reading the postgres implementation but I worry I'm not fast enough. Anyway I found a great blog explaining Postgres's timeline https://www.highgo.ca/2021/11/01/the-postgresql-timeline-concept/ and point-in-time recovery https://www.highgo.ca/2021/10/01/postgresql-14-continuous-archiving-and-point-in-time-recovery/

07/19/2023
#137 has been merged. I think it's a good time that we start working on our wal solution. I'll probably take the next few days off to catch a breath and work on some personal issues. But I'll still do some initial investigation and designing on this.

07/18/2023
The bug indeed lies within WAL but we have little control right now. It requires understaing the WAL in pgmoneta. You can find more details in the discussion below. We replace symlink() with symlinkat() when restoring backup. This way we can move archive around without breaking the symlinks. It is magical.

07/16/2023
Jesper reports some bug on his side about our native backup last Friday (07/14). But I couldn't recreate the same error. I strongly suspect the problem lies within the WAL part. Since I have little control for now I'll have to wait until Monday to further discuss this with Jesper. But I don't think this is big problem.

I spent quite some time in the weekend working on our libarchive replacement and it finally worked. Thanks to this reference wiki: https://github.com/libarchive/libarchive/wiki/Examples#a-basic-write-example. PR has been submitted: #137 I couldn't decide what to do with tablespaces. Seems that user level table spaces are not restored. I wonder if this is a bug...

I also wonder why Jesper wants to replace his archive implementation with libarchive. His version works perfectly and I'm impressed that it's done from scratch.

07/12/2023
Thanks to Jesper I think I found out why server complains about missing wal segment. I suspect the problem is that I often just use ctrl+c to stop pgmoneta, leaving partially received wal segment. And when it's started again and trying to fetch this segment from server, it could have been removed from server side because I didn't use replication slot to preserve this segment.

#134 is merged into main branch, hooray!

07/11/2023
#130 has been merged. I'll be working on archive.c for the next few days.

The integration PR #134 is submitted. This integration is not perfect, especially when it comes to receiving wal. We have too little control over this. Just executing a pg_receivewal command doesn't really go well with our native backup solution. And sometimes crushing bugs can occur.

There are some bugs I can fix though. For example, later I can work on skipping compressing symlink.

07/06/2023
With the help of valgrind, I located several memory leak issues in both main branch and my own working branch. For the problem in our main branch, I created issue #131 and PR #132. Fix for my own implementation has been submitted to #130

07/05/2023
Ok my mistake, Azure doesn't take credit card. After changing payment I was able to launch a Fedora 37 instance, so problem solved.

Support for server version > 15 has been submitted to PR #130

07/04/2023
Jesper wants this PR to also work for server whose version >15. I spent today working on a the basic structure. I can't believe that the biggest challenge I faced today is finding a Fedora 37 AMI on AWS. Because Postgres 15 is only available for Fedora > 36. I also tried Azure for a while but they don't allow me to rent their VM for some reason. I was able to find one on AWS. It charges a little more for the software. But it does the job, sort of... I was able to use it after it first launched, but it cannot boot after I stop it. It's driving me crazy because now I have to relaunch and re-install everything if I stop the instance. Anyway though, I was able to verify my basic structure and the copy out process anyway. The rest of the implementation should be easy enough.

I'm actually quite curious for other developers, how do you guys setup your environment? Are you using VMs on cloud or dual system, or anything else?

07/03/2023
This is just ridiculous, postgres sends the default table space THE LAST! So I cannot update the symbolic link until I receive everything else. Ugh!

The new PR has been submitted. See #130 . I will pause and take a break since Jesper's away and unable to provide feedback. I'm ahead of my schedule so waiting a little doesn't really hurt :D This is probably a good time to look into how exactly does receving xlog stream works actually.

06/30/2023
Great news! I got the tar file receiving and extraction to work! Now all I have to do is to do that again with other tablespaces and change the symbolic link under basedir.

06/26/2023
Ok here are some details I found about the streamer process. First, regarding mainifest, it's directly stored as backup manifest.tmp at the backup base directory and renamed to backup_manifest when it ends. The renaming is a sign of streaming completes. And as for the backup data, one streamer called bbstreamer_tar_parser will first parse the data according to tar format, the data will be labeled as BBSTREAMER_MEMBER_HEADER, BBSTREAMER_MEMBER_CONTENTS,BBSTREAMER_MEMBER_TRAILER and BBSTREAMER_ARCHIVE_TRAILER. If the data chunk for one particular label is incomplete, I think it will pause to receive more, I need to verify this. Anyway, the labeled chunk of data will then be sent to the next streamer: bbstreamer_tar_header. It will act differently according to the label. For example, I think it create and opens the corresponding file on receive the header labeled data, and will write subsequent content labeled data to this file. The server seems to ignore terminating trailing chunks of 0s in the tar file when sending it. And if client doesn't untar the file, one streamer will append these chunks of 0s to it. But when parsing the tar file this seems not necessary. I need to looking to why. So I think I need to first understand tar file format, and see if there's anything I can use in wf_archive. Then do a cleaner implementation, the postgres one is too complicated for us.

06/24/2023
I don't think we really need a "streamer", Postgres uses this objective oriented streamer because it needs to support multiple compression, decompression and extraction ops. But we only need to extract the tar file. So one funtionality should be enough.

06/23/2023
Got the third PR merged into main branch. This one has the same functionality as pgmoneta_query_execute. I had to write it again because the data is now in stream buffer. I also made some changes to pgmoneta_consume_stream_buffer so that I can make use of existing APIs. See #128

06/21/2023
The second PR of this project has been merged. I'm looking into what compression workflow is actually doing. And seems that it just compress every single file in the data directory. So I still need to untar everything I receive. Next step would be figure out the relative directory structure and the names of all the files server sends me. So that I can know which directory to save those files. This should be fairly simple.

06/17/2023
Turns out it was just one stupid mistake. Now it works perfectly. I already submitted PR, see #127

06/16/2023
Ok I'm pretty sure something's wrong with my implementation. Because server keeps complaining about broken pipe and losing connection with client (which unfortunately, is me). But if I just use pgmoneta_query_execute, the issue doesn't exist. I don't understand why because I'm basically doing the same thing, execpt reusing the buffer. But this is a good direction to look into. I'm sorry this takes so long. Fingers crossed I can get this nasty bug sorted by the end of this week.

06/14/2023
Ok turns out SHA256 needs to be enclosed with quotation marks ('SHA256'). Now the server accepts the command. Weiredly wal.c is complaining, even though I did nothing on that. I had to disable wal.c for now and come back to deal with it later. Because even though server gets the command, it cannot send data to my side. I need to look into this first.

This is the error message on the server side:

2023-06-14 05:03:41.232 UTC [4529] STATEMENT:  BASE_BACKUP LABEL 'pgmoneta_base_backup_20230614050340' FAST MANIFEST 'yes' MANIFEST_CHECKSUMS 'SHA256' ;
2023-06-14 05:03:41.232 UTC [4529] ERROR:  base backup could not send data, aborting backup

06/13/2023
I implemented functionalities to read data into a stream buffer and consume the data as one message by another. It should work, except it doesn't, read() keeps returning -1 to me. I fix some small issue with my BASE_BACKUP command but it doesn't fix the problem. So it's debugging time!

I also found that in the replication mode (with replication flag set to true when connecting to server), server doesn't recognize simple query such as asking for server version. So I have to connect to server with no replication flag set, ask for the version, disconnect and reconnect with replication flag set. So let me know if this causes any problem and you have better ways.

So of all the things I tried, I forgot the easiest one: just checking the primary server log. And it just says 'syntax error'...
The current command is like this: BASE_BACKUP LABEL 'pgmoneta_base_backup_20230614034921' FAST MANIFEST 'yes' MANIFEST_CHECKSUMS SHA256 Let me know if there's anything off.

06/06/2023
So after 2 days of enlightening (and painful) source code reading, I think I finally get most details clear. Let's start with the beginning.

Before sending BASE_BACKUP command to server, client does 2 things first, runIdentifySystem to get timeline ID, this is used to for WAL backup I think, so I'm not going to implement this in milestone 2 (and we probably don't need it in M3 as well, for server version > 9.3, the timeline id is also sent as a row data when the replication starts). And generateRecoveryConf, this is needed for server version < 12, the generated recovery.conf will later be injected into the main tablespace of the backup data. But seems that we don't need this, since in the pg_basebackup command we currently use, -R option is not specified (what a relief lol).

After receiving BASE_BACKUP command, server sends the backup data (along with manifest, if we ask for it, and we do) and WAL (if we ask for it using -X stream, and we do) in two different connections. In Postgres, receiving WAL stream is handled in a child process, this part mainly uses a function called ReceiveXlogStream, which will be our main focus in M3, and since it's also used in wal.c, we can add it to common functions.

As for our current main focus -- receving backup data, the protocol is actually not very complicated in terms of logic. The server will first send two ordinary result set (in row data format), the first one tells you starting position and timeline id of the WAL, we can ignore this in M2. The second one tells you the info like table space name, one row for each table space. Then it starts to send out the actual backup data, one tar file for each table, along with a manifest in the end. The data is sent like this: first server will send a CopyOutResponse message, this denotes the starting point of the copy stream, then a bunch of CopyData message, always one row of archive data per message. Then finish with a CopyDone message, marking the end of copy stream.

For server version < 15, server will start multiple copy streams, meaning for each table and the manifest it will send one CopyOutResponse, followed by the CopyData messages and CopyDone. And for version >= 15, everything is sent in one copy stream. The copy-out data now has a format to distinguish everything, which I believe is not updated in the doc. It's like this:
|--TypeByte--|--Data--|
There are 4 types of copy data:

'n' the starting point of a new archive data, the data will be two strings, the archive file name and its location/address.
'd' the actual data, could be archive or manifest
'p' progress report, I think we can ignore this for now
'm' the starting point of manifest data, there is no payload data for this message

It is worth mentioning that for the archive name the server sends, if it's NULL or '\0', it means this is the main data directory and should be named as base.tar.

As we receive copy-out data, we want to write it to disk. Postgres does this by assigning a streamer for each archive, it has a FILE descriptor and will write the received chunk of data to the file. It also does special things like compression or untar. Judging from our current pg_baseackup command options, I believe we want plain files. So we need to untar the archive some point of time. I think it's easier to untar the archive when it's received completely, but postgres parse the tar file as it's being received, I need to discuss it with Jesper.

Another problem Jesper and I both noticed is that our current read data/query execution functionality is not sufficient enough to handle large data stream. For one thing, CopyData message has different format from DataRow message, so we need to parse it differently. Another more serious problem is that our buffer memory is ever-growing, i.e. it never shrinks, or reuse memory space of messages already consumed. Postgres handles this by maintaining cursors of bytes received and bytes consumed in the buffer. And every time before it reads new data, it left shift the data unconsumed using memmove, overwriting consumed bytes and making room for new messages. The buffer still grows, but almost only when the available memory space is less than 8192 B, this is to prevent reading a partial packet, which is usually shorter than 8192 B. This is probably good fix for this problem. It needs more discussion.

06/02/2023
I was not very productive this week because I got so many questions on the details of the replication protocal just by reading the doc (honestly I don't even know what to ask). I had no choice but to dive into the postgres source code today. The workflow was so much clearer but the workload is large. I got a lot of the questions answered but more questions emerged. And I need to mind the version as well.

So my plan in the next few days is to dig into source code and figure out every detail I can. Then decide what functionalities I should implement to get the protocol working. My plan is to leave out WAL streaming (functionalities in this part can be shared to our wal.c) for the time being, just focus on receiving base backup data. My worry right now is that since everything sent over the TCP is sticked together, and our current query execution implementation receives them all at once, we may run out of memory space because we are not writing backup data to disk until the very end, and there could be trouble handling the logic since CopyOutResponse or CopyData is not ordinary query response (they have raw data, not just rows and columns).

Several functions that's worth digging into:

ReceiveXlogStream this is for receiving WAL, so not very urgent for now
PQgetResult does postgres have tricks handling sticky packet and `CopyOutResponse message? Does it prevent itself from reading backup data (because it needs to be received later by special functions)?
ReceiveArchiveStream receive backup tar stream all at once for version >= 15, how does it receive everything without eating up all the memory? Why is it tar if the command allows plain data? How does it seperate everything(data and manifest for each file) apart?
ReceiveTarFile Receive tar data for each file, this is for version < 15. Again, why is it only tar? And how does it seperate everything apart?
ReceiveBackupManifest For receiving manifest data for each file
fsync_pgdata make data persistent on disk

Also some conceptual question, like what's archive, what's the difference between archive and data?

05/30/2023
So I struggled a lot with the details with the protocol and the command format. And I think I may be overthinking this. There's no need to stick to what the protocol says. It's easier to just query the version again. Also we don't need to use all the options in BASE_BACKUP command. Only a few are useful to us and a lot can be hard-coded for our purpose. That saves me a lot of trouble here. We already got this part of functionalities merged into main branch, see #125

05/26/2023
Regarding version, I did find a function called pgmoneta_read_version. But that only reads from a txt file called /PG_VERSION, which is generated by pg_basebackup AFTER the backup is generated. So I need to create some function to read server version here. According to the doc, after frontend password is authenticated, backend will send the parameters it feels interesting before sending ReadyForQuery message, in which server version is included. I just need to receive that message after authentication.

05/24/2023
I'm slowly starting the project, the pace is expected at the beginning of the project. So far I have created two branches: 0001_common and 0002_native_backup. On native backup branch I just finished sending the startup message, I thought I need to create and send the startup message myself, but turns out it's already taken care of in pgmoneta_server_authenticate, so I'm now focusing on creating the backup message on the common branch.

I was able to reuse some code from other message creation function but basebackup still have some differences. For one thing it has gone through some changes in terms of format and options. Here are the formats for BASE_BACKUP from v10 - v15:

v10: BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ]
v11: BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ] [ NOVERIFY_CHECKSUMS ]
v12: BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ] [ NOVERIFY_CHECKSUMS ]
v13: BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ] [ NOVERIFY_CHECKSUMS ] [ MANIFEST manifest_option ] [ MANIFEST_CHECKSUMS checksum_algorithm ]
v14: BASE_BACKUP [ LABEL 'label' ] [ PROGRESS ] [ FAST ] [ WAL ] [ NOWAIT ] [ MAX_RATE rate ] [ TABLESPACE_MAP ] [ NOVERIFY_CHECKSUMS ] [ MANIFEST manifest_option ] [ MANIFEST_CHECKSUMS checksum_algorithm ]
v15: BASE_BACKUP [ ( option [, ...] ) ] Guess they finally realized how ridiculously long this command has become :| (in this version there are 14 options). Plus in this new version the options are seperated by ',' instead of space

So natually I have to accommodate these changes, I think we probably have some functionality to check server version. If not I'll just have to implement one. I can use some ideas here~

Another thing that's really bothering is that you can see how many options are there, and all of them are optional. So I'm thinking it might be worthwhile to create a struct base_backup_options to wrap these parameters, otherwise the function definition will be too long.

05/17/2023
We got Jesper's patch merged into main branch. This will be the foundation of this project. This patch includes:

A series of functions that follows a pattern like pgmoneta_create_XXX_message, which are used to create a certain command to be sent to Postgres server. I can refer to that when I create the BASE_BACKUP command message
A function pgmoneta_query_execute that executes the query, i.e. sends the command message to server and receives the response. I can probably use that to send the BASE_BACKUP command directly, or at least refer to it. But I'm still not sure what's the response for a BASE_BACKUP command, I'll look into the documentation for this
Getter, logger, cleaner for the query response, could be very useful

The functionalities above follow the Frontend/Backend Protocol, the message format can be found here. As BASE_BACKUP gets received by server, a Streaming Replication Protocol will be initiated. This will be the second and the major part of this project. These documentations are very important and I'll make sure I understand the details in the next few days.

We also created issue #123 for this project.

jesperpedersen · 2023-05-08T15:36:48Z

jesperpedersen
May 8, 2023
Maintainer

👍

0 replies

Jubilee101 · 2023-05-17T21:28:39Z

Jubilee101
May 17, 2023
Collaborator Author

@jesperpedersen I just updated my doc, please let me know your thoughts. I'm not sure if I should move the checklist to the issue #123 page, as @MariamFahmy98(hi!) did last year. Also she created multiple PRs and issues but I'm not sure if I should do the same. I have a feeling that I only need to create one PR addressing #123 and push commits to it. And in the end they could be squashed into one single commit before merged. Let me know what you think :)

17 replies

jesperpedersen May 18, 2023
Maintainer

Yeah, if it needs to be part of the public API

Jubilee101 May 18, 2023
Collaborator Author

Ok I get it, finally. Thanks for your patience, I've never seen workflow like this before so I'm a little confused :(

jesperpedersen May 18, 2023
Maintainer

Just think of it as: I'm working on my project, but somebody is working on a related area -- how much code can we share, and help each other with

Jubilee101 May 18, 2023
Collaborator Author

But I still can't merge those common functions until I'm certain they are correct and sufficient for the native backup to be implemented, is that ok?

jesperpedersen May 18, 2023
Maintainer

Yes, we will get the bugs fixed

Jubilee101 · 2023-05-23T02:49:07Z

Jubilee101
May 23, 2023
Collaborator Author

Hi @jesperpedersen, I read from the Frontend/Backend Protocol doc that you need to send a startup packet/message to the server before sending the actual query. I then looked up and verified this in postgres source code, there's a function called PQconnectPoll that sets up the tcp connection and sends the startup packet. However in our query execute functionality this seems to be ommitted, can you tell me if this is intentional? Should I add this?

I also find it quite hard and time consuming to dive into the source code to verify such details in the protocol, and I think you mentioned about pgprtdbg as a tool to generate trace files of a protocol, do you think it's a good use case for tedious tasks like this?

Besides, I'm trying to prioritize things here, and before sending base_backup command, pg_basebackup also does things like generateRecoveryConf or runIdentifySystem, I haven't really dive into their implementation and purpose, do you think they are necessary? I want to get a minimum version that works first. So I think it's necessary for me to be able to tell what's important and what's not (the postgres source code really is complicated!).

5 replies

jesperpedersen May 23, 2023
Maintainer

Its the replication protocol - https://www.postgresql.org/docs/current/protocol-replication.html - for this. Of course you can look at the PostgreSQL implementation, but we are doing a clean room implementation - like https://github.com/pgmoneta/pgmoneta/blob/main/src/libpgmoneta/message.c#L577

We can enhance pgprtdbg if needed - just do a standard initdb, create a database, and then use that for backup.

You can get the starting point from IDENTIFY_SYSTEM and TIMELINE_HISTORY - those are already in the code. Start by looking at BASE_BACKUP and work your way backwards

Jubilee101 May 23, 2023
Collaborator Author

Thanks! It's just that I thought you need to use frontend/backend protocol first to send the base_backup message and initiate the replication protocol, hence the need for a startup message first. Given what you said, I guess it's not necessary?

Jubilee101 May 23, 2023
Collaborator Author

Wait, the first line in replication protocol doc says "To initiate streaming replication, the frontend sends the replication parameter in the startup message.". Looks like I need to send this special startup message before sending the backup command. And looks like you've already taken care of that when creating the startup message. Let me give it a try.

jesperpedersen May 23, 2023
Maintainer

https://github.com/pgmoneta/pgmoneta/blob/main/src/libpgmoneta/message.c#L532 w/ replication = true

Jubilee101 May 23, 2023
Collaborator Author

Yeah I saw that, gotcha! That's exactly what I need.

Jubilee101 · 2023-05-25T03:16:36Z

Jubilee101
May 25, 2023
Collaborator Author

@jesperpedersen I just updated the progress above. Let me know what you think~ BTW, do I need to tag you everytime I update that?

1 reply

jesperpedersen May 25, 2023
Maintainer

No, I'm watching

Jubilee101 · 2023-05-27T02:03:02Z

Jubilee101
May 27, 2023
Collaborator Author

@jesperpedersen I think I need to make some changes to the authentication functions to extract server version info, because that part of info is sticked with 'AuthenticationOK'. I put some details above, let me know if you have better idea to approach this. Thank you.

8 replies

jesperpedersen May 30, 2023
Maintainer

You can send the functionality in multiple pull requests, so isolated improvements like this are ok to send now.

Jubilee101 May 30, 2023
Collaborator Author

Pull request sent! So in the future I just add commits onto this one, no need to squash them together, right?

jesperpedersen May 31, 2023
Maintainer

It is better to open a new pull request for each part, and reference the same issue number

Jubilee101 May 31, 2023
Collaborator Author

Nah I mean on top of this commit, but still openning new PR. Anyway, I think I get your point.

jesperpedersen May 31, 2023
Maintainer

Yeah, always rebase against current main

jesperpedersen · 2023-06-06T12:25:40Z

jesperpedersen
Jun 6, 2023
Maintainer

I want to see if there are functionalities that WAL and BASE_BACKUP share, but it seems that on the server end, the processes BASE_BACKUP and START_REPLICATION trigger are not the same. For example, when it's START_REPLICATON, server responds with CopyBothResponse and the WAL data is sent as a series of CopyData messages. Whilest when it's BASE_BACKUP, server starts by two ordinary result sets and then a series of CopyOutResponse. It feels weird to me that two commands sharing the same protocol have different interactions between frontend and backend.

It is a matter of dealing with the core protocol - https://www.postgresql.org/docs/devel/protocol-message-formats.html - in both cases. They were developed at different times, you are interested you can find it the https://www.postgresql.org/list/pgsql-hackers/ archives.

I'm not entirely sure how is the backup data sent, this part is not very clear in the doc I think. The only clue I found is in the doc for CopyOutResponse, which says This message will be followed by copy-out data.. I want to confirm that for base backup this is also the case, that the backup data will be sent after the CopyOutResponse. And how do I know the length of the copy-out data (it doesn't seem to belong to any message format, just raw text or binaries)?

Read about the Copy* messages in the core protocol.

Our current query execution implementation receives everything before dealing with it. This is ok when the response is short, but for basebackup the backup data could be really large, I wonder if it's still ok for us to store all the data in memory until the backup streaming is complete. Do you think it necessary to create a function to receive a chunk of data, write it to disk and then receive the next chunk so that we can reuse the memory space?

Yeah, we will have - or at least very likely - look into a solution with that. We could define a setting for maximum memory usage...

3 replies

Jubilee101 Jun 6, 2023
Collaborator Author

Hi Jesper, thanks for the response! Sorry I deleted the original question on the forum because I found most questions answered by doc and source code, which I've been studying recently. I updated the details above, let me know if you find things off. I will put all functionalities related to WAL stream to milestone 3 (this part is in another process anyway), and focus on getting copy-out data in this milestone.

I do have things that I want to discuss though:

I'm pretty sure backup data are sent in form of tar files for each tablespace, and we want plain files(correct me if I'm wrong). So we need to untar it at some point of time. Postgres seems to parse the tar file as the data comes in. I wonder why postgres doesn't just untar or extract the file when the streaming completes, and it has this complete tar file stored on disk. It seems much easier.
Regarding reading copy out data, I think what postgres does is pretty clever -- maintain cursors for consumed data and data available, and left shift unconsumed data before reading new data using memmove. I think it's probably a good idea to implement this on pgmoneta as well. I'm not suggesting changing current read_message or pgmoneta_query_execute, but creating a new function specifically for receiving large amount of copy-out data, inside which a buffer is reused as the data unconsumed shifts leftwards.
I'm curious about memory management on pgmoneta, not sure if I get it, especially pgmoneta_memory_message(). All it does is returning message again and again when it's called. Are we reusing this chunk of memory again and again?

jesperpedersen Jun 7, 2023
Maintainer

There is some tar functionality in wf_archive.c, so you may have to extract that to a proper API
Yeah, lets start with that
It is just to limit memory allocations. For this we should probably have a max_memory_buffer setting, and stream to disk from there

Jubilee101 Jun 7, 2023
Collaborator Author

Gotcha, I'll start getting M2 going, let's dicuss details later.

Jubilee101 · 2023-06-20T05:52:38Z

Jubilee101
Jun 20, 2023
Collaborator Author

Hi @jesperpedersen , I'm now moving on to work on writing the data received to disk, I plan to implement a streamer struct to do the job. I have some questions though, about archive and compression in pgmoneta.

I don't really understand node struct, is it some kind of container to hold some parameters for different work flow?
It seems that if users configures compression type, a compression workflow will execute right after backup workflow finishes. And in all the compression workflows, there is a line tarfile = pgmoneta_get_node_string(*o_nodes, "tarfile");. I checked and this tarfile node was created in wf_archive.c (https://github.com/pgmoneta/pgmoneta/blob/d02413706cabc33faf6ae82d51559f64222dadda/src/libpgmoneta/wf_archive.c#LL328C25-L328C25). I'm confused with the relationships with these nodes and workflows. Does this mean that the archive workflow will always be executed before the backup and the compression workflow? Also I'm not 100% sure but I think there will be multiple tarfiles received from the server side, which directory is this tarfile node referring to, i.e. which tar file is the compression workflow trying to compress?
If the file is going to be compressed anyway, do I still need to untar it after I receive it from the server side?

And it would be nice if you could take a look at my PR and let me know if it's good to go, before I start next phase's coding work. That'll save me from some rebasing work :)

1 reply

jesperpedersen Jun 20, 2023
Maintainer

struct node is used to pass data between the different phases in a work flow - like taking a backup
This is the definition of a backup work flow - https://github.com/pgmoneta/pgmoneta/blob/main/src/libpgmoneta/workflow.c#L95 - archive is a separate work flow. You will untar what you receive such that the resulting directory structure is a PostgreSQL installation with the backup marker set.
Yes

You can try to comment out the different phases to see the result of that.

Jubilee101 · 2023-06-23T21:08:44Z

Jubilee101
Jun 23, 2023
Collaborator Author

Hi @jesperpedersen, currently we only have one tablespace, the default one, and its content is directly stored under the data directory under each timestamp directory. But what if server has other tablespaces? Where do we want to store these extra tablespaces? Also I want to confirm that postgres doesn't allow nested tablespaces, correct?

2 replies

jesperpedersen Jun 24, 2023
Maintainer

I have created #129 for this - I'll take a look at it.

Just ignore them for now...

Jubilee101 Jun 24, 2023
Collaborator Author

Gotcha, I'll just store the base one for now and ignore the rest.

Jubilee101 · 2023-06-26T23:20:03Z

Jubilee101
Jun 26, 2023
Collaborator Author

Hi @jesperpedersen , I think there are existing libraries out there that can untar a file with a function call. Do you have any idea why Postgres chooses to parse and extract the tar file itself as it receives the file stream, instead of, e.g. receive base.tar and use a library to untar it after the streaming comletes? I'm curious here.

4 replies

jesperpedersen Jun 27, 2023
Maintainer

Likely because of portability between platforms and lack of availability of certain libraries on platforms.

If there is an easy library that is distributed with all major Linux and *BSD installation we can switch to that instead.

Jubilee101 Jun 27, 2023
Collaborator Author

I see, I'll look into it. And I don't think parsing it is too much trouble anyway.

Jubilee101 Jun 27, 2023
Collaborator Author

I think we can just use libarchive: http://www.libarchive.org/, but I need your opinion to make sure.

jesperpedersen Jun 27, 2023
Maintainer

👍

Jubilee101 · 2023-06-28T21:32:48Z

Jubilee101
Jun 28, 2023
Collaborator Author

Hi @jesperpedersen , I'm having a tiny problem here. In order to use libarchive I need to include their archive.h, unfortunately we already have a local one. Do you know any way to resolve this conflict?

Currently archive.h is under \usr\include, perhaps I need to way to tell compiler to search under this directory instead when it comes to this header file.

2 replies

jesperpedersen Jun 29, 2023
Maintainer

You can play around with

#include "archive.h"

and

#include <archive.h>

Otherwise we can look at renaming our header file...

jesperpedersen Jun 29, 2023
Maintainer

I have merged the initial support for tablespaces in #129, so I would do a pull request that just replaces the existing tar logic with the new library.

Jubilee101 · 2023-06-29T14:32:59Z

Jubilee101
Jun 29, 2023
Collaborator Author

You can play around with
#include "archive.h"
and
#include <archive.h>
Otherwise we can look at renaming our header file...

Yeah I have tried that. Sadly it didn't work. We probably need to rename ours then. Do you have any suggestion?

I have merged the initial support for tablespaces in #129,

That is great! I'll adjust my implementation according to it.

so I would do a pull request that just replaces the existing tar logic with the new library.

I'm not following. Do you mean the ones under wf_archive?

10 replies

Jubilee101 Jun 29, 2023
Collaborator Author

Well, how should I update it? Can you point me somewhere?

jesperpedersen Jun 29, 2023
Maintainer

The symbolic links look like,

.../pg_tblspc/16389 -> /some/path/on/disk

pg_default and pg_global doesn't go into pg_tblspc.

Jubilee101 Jun 29, 2023
Collaborator Author

Ah I see, just took a closer look into pg_tblspc. I think I can do that using symlink() or execute a command like ln -s /tablespace/on/disk 16389. Let me give it a try.

jesperpedersen Jun 29, 2023
Maintainer

See https://github.com/pgmoneta/pgmoneta/blob/main/src/libpgmoneta/utils.c#L1466

Jubilee101 Jun 29, 2023
Collaborator Author

Ah I see, thank you very much. Let me finish what's on my hand. Then I'll pause and work on replacing our tar functionalities with libarchive. I'm close to finishing the receive backup stream functionalities, it will work for server whose version < 15. For those above v15 the process is very different, so I'll submit another PR for that later.

Jubilee101 · 2023-07-06T06:45:22Z

Jubilee101
Jul 6, 2023
Collaborator Author

@jesperpedersen I found some memory leak in both my implemention and our current base backup implementation (the command line one). I was able to locate and fix our current one but for my implementation it's much harder to find. I will submit a fix to #130 after I found the leak. So please hold off merging it. And I'll also submit a separate PR addressing the memory leak in our current implementation shortly.

1 reply

Jubilee101 Jul 7, 2023
Collaborator Author

Good news! I located the memory leak, I wasn't supposed to start and stop logging in wf_basebackup. So this problem is not inside my backup functionality, I also fixed some other problems and it's safe to merge! Thanks to valgrind!

I also create a issue and another PR addressing the memory leak issue. Let me know your thoughts.

Jubilee101 · 2023-07-10T21:08:47Z

Jubilee101
Jul 10, 2023
Collaborator Author

@jesperpedersen Changes you requested for #130 have been submitted. I'm thinking about our next steps. Which one do you want me to prioritize: receiving the wal or replacing our archive functionalities with libarchive?

1 reply

jesperpedersen Jul 10, 2023
Maintainer

Do the archive support first

jesperpedersen · 2023-07-11T12:43:25Z

jesperpedersen
Jul 11, 2023
Maintainer

#130 has been merged - so you can replace wf_basebackup.c with wf_backup.c now

8 replies

Jubilee101 Jul 11, 2023
Collaborator Author

Yeah I get that. But I'm not planning to integrate the functionalities into wf_backup just yet (though I have a draft on 0002_native_backup). Because we are still missing receiving xlog functionality, which happens in parallel with receiving the backup data, in a child process. What I mean is that we could do the integration along with renaming after this part is finished as well.

Jubilee101 Jul 11, 2023
Collaborator Author

Oh I just checked the source code on postgres. Seems that receiving xlog is done by sending the START_REPLICATION command to the server, essentially doing what pg_receivewal does and doesn't matter if it's done in parallel or in order. Then I think it makes sense to integrate what we have and do the renaming right now. I'll work on it right away.

jesperpedersen Jul 11, 2023
Maintainer

Yes, the pg_receivewal functionality is always running

Jubilee101 Jul 11, 2023
Collaborator Author

So do I need to specifically call pgmoneta_wal() in wf_backup? I'm guessing no?

jesperpedersen Jul 11, 2023
Maintainer

Correct, that is a no

Jubilee101 · 2023-07-11T19:35:50Z

Jubilee101
Jul 11, 2023
Collaborator Author

Hi @jesperpedersen , I find the integration not as easy as I expected. Here are some problem I found:

the zstd workflow compresses symbolic links in pg_tblspc as well. And those files lost functionality as symlinks. And I cannot uncompress it, unzstd just complains that this is not a file but a symlink. Should we skip this directory?
I haven't really dived into how receive_wal works in either pgmoneta or postgres. But I'm pretty sure something is off. If I clear backup\ and do a clean backup, often times errors like this repeatedly pop out:

pg_receivewal: not renaming "0000000100000000000000C8.partial", segment is not complete
pg_receivewal: error: unexpected termination of replication stream: ERROR:  requested WAL segment 0000000100000000000000C8 has already been removed
pg_receivewal: error: disconnected

I am not sure whether I should set WAL to true when creating and sending BASE_BACKUP command to server. In postgres this options is only set when -X is set to fetch in pg_basebackup. But we set it to stream originally. And yet we are not streaming wal currently in basebackup.

Also If I don't set this WAL, server is not sending wal under pg_wal to me. And pgmoneta_read_wal() will return NULL

9 replies

Jubilee101 Jul 12, 2023
Collaborator Author

Should be fixed in #135

Great!

Are you using a backup_slot / wal_slot setup ? Did you change postgresql.conf compared to the default ?

Oh that must be it! I didn't confiugre replication slot. I'm just using the default configure in 01_install.md I was suspecting it has something to do with the error. Let me try that and let you know.

Do you see this with -X none as well ?

Let me try that and let you know

Jubilee101 Jul 12, 2023
Collaborator Author

The problem seems to be fixed after using replication slot. I suspect the problem is that I often just use ctrl+c to stop pgmoneta, leaving partially received wal segment. And when it's started again and trying to fetch this segment from server, it could have been removed from server side because I didn't use replication slot to preserve this segment.

I also tried using -X none, and this time I see nothing under pg_wal. I probably need to dive into source code to understand how is this directory populated.

#135 is also verified. The symlinks are not compressed this time. Now that we are on this, do you think user level tablespaces should be compressed as well?

Also changes to #134 have been submitted.

jesperpedersen Jul 12, 2023
Maintainer

Lets circle back to it later - maybe the replication slot part needs to move to 01_install.md... Or we can figure something out...

I'll look at the tablespaces part...

Jubilee101 Jul 12, 2023
Collaborator Author

Thanks. I'm so happy we got our native backup solution working and merged at least. I'll start working on replacing archive functionality. I think we still have plenty of time, so I'll start look into wal after that. And then we could look into the replication slot part.

jesperpedersen Jul 12, 2023
Maintainer

👍

Jubilee101 · 2023-07-14T19:18:14Z

Jubilee101
Jul 14, 2023
Collaborator Author

Hi @jesperpedersen, what is the correct way to start postgres server on a restored backup? Is it something like pg_ctl -D /tmp/primary-20230712181205/ start ?

6 replies

Jubilee101 Jul 14, 2023
Collaborator Author

What do you mean backup_label ?

jesperpedersen Jul 14, 2023
Maintainer

Its a file in the "root" directory of the backup

Jubilee101 Jul 14, 2023
Collaborator Author

Could you tell me the exact commands you used to restore the backup and start it? I just made a backup and successfully started server from it.

jesperpedersen Jul 14, 2023
Maintainer

... restore primary newest current /tmp/verify

Jubilee101 Jul 14, 2023
Collaborator Author

This is weird. I can start the server normally. Here is the log

2023-07-14 20:20:16.944 UTC [3179] LOG:  starting PostgreSQL 13.6 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9), 64-bit
2023-07-14 20:20:16.945 UTC [3179] LOG:  listening on IPv6 address "::1", port 5432
2023-07-14 20:20:16.945 UTC [3179] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-07-14 20:20:16.945 UTC [3179] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-07-14 20:20:16.945 UTC [3179] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-07-14 20:20:16.947 UTC [3181] LOG:  database system was interrupted; last known up at 2023-07-14 20:17:26 UTC
2023-07-14 20:20:16.975 UTC [3181] LOG:  redo starts at 0/F2000028
2023-07-14 20:20:16.975 UTC [3181] LOG:  consistent recovery state reached at 0/F2000100
2023-07-14 20:20:16.975 UTC [3181] LOG:  invalid record length at 0/F3000060: wanted 24, got 0
2023-07-14 20:20:16.975 UTC [3181] LOG:  redo done at 0/F3000028
2023-07-14 20:20:16.980 UTC [3179] LOG:  database system is ready to accept connections

I suspected that the problem is the empty pg_wal directory as we talked about. But mine seems to be automatically populated with wal segments when restored. Could you check yours?

Jubilee101 · 2023-07-14T20:51:55Z

Jubilee101
Jul 14, 2023
Collaborator Author

@jesperpedersen I recreated the similar error after removing wal segments under pg_wal, here is the log:

2023-07-14 20:48:33.870 UTC [3342] LOG:  starting PostgreSQL 13.6 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9), 64-bit
2023-07-14 20:48:33.872 UTC [3342] LOG:  listening on IPv6 address "::1", port 5432
2023-07-14 20:48:33.872 UTC [3342] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-07-14 20:48:33.872 UTC [3342] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-07-14 20:48:33.872 UTC [3342] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-07-14 20:48:33.874 UTC [3344] LOG:  database system was shut down at 2023-07-14 20:48:31 UTC
2023-07-14 20:48:33.874 UTC [3344] LOG:  invalid primary checkpoint record
2023-07-14 20:48:33.874 UTC [3344] PANIC:  could not locate a valid checkpoint record
2023-07-14 20:48:34.471 UTC [3342] LOG:  startup process (PID 3344) was terminated by signal 6: Aborted
2023-07-14 20:48:34.472 UTC [3342] LOG:  aborting startup due to startup process failure
2023-07-14 20:48:34.503 UTC [3342] LOG:  database system is shut down

Could you check yours?

There are 3 additional problems I found about the restored server:

postgres user doesn't have access to the backup restored by other user, I have to specifically change ownership to postgres using chown. Is there a better way to do this?
repl replication slot doesn't exist on the restored server. It seems to only exist on the original server and won't be backed up.
Table spaces are not restored. pg_tblspc is empty, sigh~

9 replies

jesperpedersen Jul 17, 2023
Maintainer

It is a matter if there is a checkpoint in the WAL stream - we need to analyze the WAL stream in order to know that. PostgreSQL 15+ has a pg_checkpoint role for CHECKPOINT;, so we can't really assume that, which leaves superuser which we don't want.

For now, lets just continue with false until we have an understanding of what the best way forward is.

Jubilee101 Jul 17, 2023
Collaborator Author

But running pg_basebackup doesn't seem to have this problem. Does it have something to do with the fact that it starts a wal replication receiving process the same time it starts to receive the backup?

Jubilee101 Jul 17, 2023
Collaborator Author

I still have doubt about this though. Forgive me since my knowledge about wal is just what I learnt in db class. But even if the actual data on disk is behind what the log reflects, wouldn't postgres still be able to restore things by a bunch of redos and undos?

jesperpedersen Jul 18, 2023
Maintainer

PostgreSQL needs a checkpoint in the WAL stream to be 100% that data is correct. I'll think about the best way forward... pgmoneta needs to understand the WAL format to do more advanced scenarios like incremental and delta backups, but that is def not for the 0.7 release...

Jubilee101 Jul 18, 2023
Collaborator Author

Got it. Let me know if you have ideas.

Jubilee101 · 2023-07-18T00:23:20Z

Jubilee101
Jul 18, 2023
Collaborator Author

@jesperpedersen Changes to support tablespaces have been submitted to PR #137 . I hope you don't mind some design changes, especially the tar directory structure. Please let me know what you think.

I'm curious though, our previous implementation works well, why do we want to replace it with libarchive?

1 reply

jesperpedersen Jul 18, 2023
Maintainer

Less code to maintain, and libarchive could open up for some more advanced use-cases later on

Jubilee101 · 2023-07-24T07:48:27Z

Jubilee101
Jul 24, 2023
Collaborator Author

Hi @jesperpedersen , I spent the weekend studying the Postgres's wal replication process. It looks clear enough but there are some details that I'm not sure about, I'm gonna list them below, along with some questions, please correct me if you find any misconception:

It seems that by sending START_REPLICATION with a given timeline id and an xlogpos, server will send xlog segments starting from this xlogpos but end when reaching the end of this timeline. It ends by sending us the next timeline id. But what if we only have one timeline? What will it send then? I can't seem to find this part specified anywhere in the doc.
I used to think that receiving wal is a non-stop process, that it runs forever. After reading the code I think I'm wrong. But I'm a little unsure about how to tell that the streaming has stopped. Do we just look for command complete? Is it possible that the server just stops sending silently? If so, how could we tell? We probably need a timeout instead of blocking on reading a sockect forever. I'm asking because I saw a function HandleEndOfCopyStream to handle situation like this.
Since using replication slot is kind of necessary in our case, we need to report our status to the server from time to time. In postgres I think this is reported if new xlog data has been flushed to disk, and regularly with a fixed time interval, or sometimes if the server specifically requires it, by sending us a keepalive message. I wonder if there's a rule about this time interval here, what would be a proper timeout interval to send status?
The message payload required in the status report message is very confusing to me, it goes like this:

Byte1('r')
Identifies the message as a receiver status update.

Int64
The location of the last WAL byte + 1 received and written to disk in the standby.

Int64
The location of the last WAL byte + 1 flushed to disk in the standby.

Int64
The location of the last WAL byte + 1 applied in the standby.

Int64
The client's system clock at the time of transmission, as microseconds since midnight on 2000-01-01.

Byte1
If 1, the client requests the server to reply to this message immediately. This can be used to ping the server, to test if the connection is still healthy.

What's the difference between flushed to disk and written to disk? And do we really know the last WAL byte applied in the standby?

Do we need to stream timeline history file as well? I saw postgres does this.
I think each wal segment file has a fixed xlog segment size, I checked and I think the default size is (1610241024), do we just use the default size or is this something configurable by the user?
I find that in postgres server sometimes sends CopyEnd message to server to stop the replication process when certain condition is checked and reached -- probably because user specifies a stopping point. Is this really necessary for us?

2 replies

jesperpedersen Jul 24, 2023
Maintainer

In most cases there will only be one timeline. But, it is something we need to account for.
If we have a timeout then we need to do a TRACE log, and loop back. The WAL process should end with shutdown or seeing the end packet
We don't require a replication slot, but it is easier to setup if used. We will save this bullet for later - it is mainly for synchronized replication support
We won't assume a specific filesystem, so we will use the same number for both
MIA
Ideally, yes
We will use the same WAL segment size as the PostgreSQL server - https://github.com/pgmoneta/pgmoneta/blob/main/src/libpgmoneta/server.c#L121
We will test this area - ideally we don't want reconnects

Jubilee101 Jul 24, 2023
Collaborator Author

Thank you, this is helpful. I'll do some quick coding to get an clearer overview of the process next, like what I did in base backup.

Jubilee101 · 2023-07-25T22:42:30Z

Jubilee101
Jul 25, 2023
Collaborator Author

Hi @jesperpedersen , I hope you don't mind me borrowing some code snippet in wal.c from your old patch. I wrote a simple structure but something is wrong. I'm still looking into why, you can find the details in the log above. Also you mentioned your wal receiver didn't work out, could you be more specific? what went wrong back then?

17 replies

jesperpedersen Jul 26, 2023
Maintainer

pgmoneta must be able to catch up on a WAL stream - should we keep track of the LSN (XXX/XXX) and/or timeline ? Maybe 0/0 is enough for our current use-cases

Jubilee101 Jul 26, 2023
Collaborator Author

Don't we run IDENTIFY_SYSTEM everytime before sending the replication command?

Jubilee101 Jul 26, 2023
Collaborator Author

I just tried, 0/0 could cause problem. Because the earliest logs could have been removed if the user didn't configure replication slots.

Jubilee101 Jul 26, 2023
Collaborator Author

Hey I was playing with the script you gave me earlier to generate some WAL. I just use pgbench -d mydb -f xxx.sql. But seems that pgbench doesn't take meta commands such as \o or \gexec. And I don't want to use psql because I want the changes rolled back after executing the script. Am I doing anything wrong?

jesperpedersen Jul 27, 2023
Maintainer

Yeah, run IDENTIFY_SYSTEM upon startup of the process.

Yeah, we can likely use the functions described in https://www.postgresql.org/docs/current/functions-admin.html as well to get a better understand of where the system is at the moment.

I think for your use-case of just generating WAL you can just use the standard pgbench commands, e.g.

pgbench -s 100 -i ...
pgbench -M prepared -T 60 ...

Jubilee101 · 2023-08-02T16:57:05Z

Jubilee101
Aug 2, 2023
Collaborator Author

Hi @jesperpedersen, how did you notice the CPU usage problem? I'm looking into the cause here.

7 replies

jesperpedersen Aug 2, 2023
Maintainer

What if, https://github.com/Jubilee101/pgmoneta/blob/hzhang/wal/src/libpgmoneta/wal.c#L181 doesn't return anything ?

Jubilee101 Aug 2, 2023
Collaborator Author

I think I located the tight while loop: https://github.com/Jubilee101/pgmoneta/blob/c5465d0058afa21a856b8e88a36c3220d5405015/src/libpgmoneta/message.c#L1564

It seems that if it is not in blocking mode, if server has nothing to send it'll just return -1 immediately. Do you have any suggestions to fix this?

jesperpedersen Aug 2, 2023
Maintainer

Did you try having a SLEEP(1000000L); in that case ?

Jubilee101 Aug 2, 2023
Collaborator Author

I tried that only when it return 0. I'll try that then. Alternatively should we consider using poll()?

Jubilee101 Aug 2, 2023
Collaborator Author

Hmm, waiting a little does solve the problem. CPU rate is now at most 3.7%.

Jubilee101 · 2023-08-02T19:32:02Z

Jubilee101
Aug 2, 2023
Collaborator Author

@jesperpedersen The CPU use rate problem got fixed by sleeping on socket returning -1. But we have another problem, basebackup now takes tremendously longer time, nearly 300 secs. I'm looking into the reason right now.

5 replies

Jubilee101 Aug 2, 2023
Collaborator Author

I think I know why... Somehow the backup process and the wal process is using the same socket...

Jubilee101 Aug 3, 2023
Collaborator Author

But I really don't understand why... socket() should not return the same fd twice, if the first one is not even closed. And how do both backup and wal work well if they are sharing sockets?

Jubilee101 Aug 3, 2023
Collaborator Author

Oh I guess this is normal: https://stackoverflow.com/questions/27746417/multiple-local-processes-have-the-same-socket. My network programming knowledge is a little rusty. Then it's some other reason...

Jubilee101 Aug 3, 2023
Collaborator Author

I switched back to pg_receivewal version and do basebackup again. And the time is the same. Seems that longer base backup time is just because my databases getting bigger (I guess I ran pg_bench too many times), and has nothing to do with our native wal solution.

I thought postgres does cleanup after benchmarking, apparently it doesn't 🤕

By the way, why is the socket by default set to non-blocking in pgmoneta? If it's in blocking mode we wouldn't have the cpu problem in the first place.

jesperpedersen Aug 3, 2023
Maintainer

pgbench doesn't clean after itself - you have to do that.

If sockets are blocking then we can't easily interrupt them... we can look into this

Jubilee101 · 2023-08-03T20:35:21Z

Jubilee101
Aug 3, 2023
Collaborator Author

The progress log in the first section is becoming too long. I reversed the order so that the newest log would be on top. I hope this helps. Though I don't think many people would read this 😅

0 replies

Jubilee101 · 2023-08-03T23:18:55Z

Jubilee101
Aug 3, 2023
Collaborator Author

Hey @jesperpedersen , should we worry that the wal process might try to read config->running in the shared memory after it's already been unmapped?

I don't know whether that could happen or not.

1 reply

jesperpedersen Aug 4, 2023
Maintainer

I think we should be ok on this front

GSoC 2023: Native Backup #120

Jubilee101 May 5, 2023 Collaborator

Intro

List of the work done and to do:

Progress Log

Replies: 24 comments · 120 replies

jesperpedersen May 8, 2023 Maintainer

Jubilee101 May 17, 2023 Collaborator Author

jesperpedersen May 18, 2023 Maintainer

Jubilee101 May 18, 2023 Collaborator Author

jesperpedersen May 18, 2023 Maintainer

Jubilee101 May 18, 2023 Collaborator Author

jesperpedersen May 18, 2023 Maintainer

Jubilee101 May 23, 2023 Collaborator Author

jesperpedersen May 23, 2023 Maintainer

Jubilee101 May 23, 2023 Collaborator Author

Jubilee101 May 23, 2023 Collaborator Author

jesperpedersen May 23, 2023 Maintainer

Jubilee101 May 23, 2023 Collaborator Author

Jubilee101 May 25, 2023 Collaborator Author

jesperpedersen May 25, 2023 Maintainer

Jubilee101 May 27, 2023 Collaborator Author

jesperpedersen May 30, 2023 Maintainer

Jubilee101 May 30, 2023 Collaborator Author

jesperpedersen May 31, 2023 Maintainer

Jubilee101 May 31, 2023 Collaborator Author

jesperpedersen May 31, 2023 Maintainer

jesperpedersen Jun 6, 2023 Maintainer

Jubilee101 Jun 6, 2023 Collaborator Author

jesperpedersen Jun 7, 2023 Maintainer

Jubilee101 Jun 7, 2023 Collaborator Author

Jubilee101 Jun 20, 2023 Collaborator Author

jesperpedersen Jun 20, 2023 Maintainer

Jubilee101 Jun 23, 2023 Collaborator Author

jesperpedersen Jun 24, 2023 Maintainer

Jubilee101 Jun 24, 2023 Collaborator Author

Jubilee101 Jun 26, 2023 Collaborator Author

jesperpedersen Jun 27, 2023 Maintainer

Jubilee101 Jun 27, 2023 Collaborator Author

Jubilee101 Jun 27, 2023 Collaborator Author

jesperpedersen Jun 27, 2023 Maintainer

Jubilee101 Jun 28, 2023 Collaborator Author

jesperpedersen Jun 29, 2023 Maintainer

jesperpedersen Jun 29, 2023 Maintainer

Jubilee101 Jun 29, 2023 Collaborator Author

Jubilee101 Jun 29, 2023 Collaborator Author

jesperpedersen Jun 29, 2023 Maintainer

Jubilee101 Jun 29, 2023 Collaborator Author

jesperpedersen Jun 29, 2023 Maintainer

Jubilee101 Jun 29, 2023 Collaborator Author

Jubilee101 Jul 6, 2023 Collaborator Author

Jubilee101 Jul 7, 2023 Collaborator Author

Jubilee101 Jul 10, 2023 Collaborator Author

jesperpedersen Jul 10, 2023 Maintainer

jesperpedersen Jul 11, 2023 Maintainer

Jubilee101 Jul 11, 2023 Collaborator Author

Jubilee101 Jul 11, 2023 Collaborator Author

jesperpedersen Jul 11, 2023 Maintainer

Jubilee101 Jul 11, 2023 Collaborator Author

Jubilee101
May 5, 2023
Collaborator

Replies: 24 comments 120 replies

jesperpedersen
May 8, 2023
Maintainer

Jubilee101
May 17, 2023
Collaborator Author

jesperpedersen May 18, 2023
Maintainer

Jubilee101 May 18, 2023
Collaborator Author

jesperpedersen May 18, 2023
Maintainer

Jubilee101 May 18, 2023
Collaborator Author

jesperpedersen May 18, 2023
Maintainer

Jubilee101
May 23, 2023
Collaborator Author

jesperpedersen May 23, 2023
Maintainer

Jubilee101 May 23, 2023
Collaborator Author

Jubilee101 May 23, 2023
Collaborator Author

jesperpedersen May 23, 2023
Maintainer

Jubilee101 May 23, 2023
Collaborator Author

Jubilee101
May 25, 2023
Collaborator Author

jesperpedersen May 25, 2023
Maintainer

Jubilee101
May 27, 2023
Collaborator Author

jesperpedersen May 30, 2023
Maintainer

Jubilee101 May 30, 2023
Collaborator Author

jesperpedersen May 31, 2023
Maintainer

Jubilee101 May 31, 2023
Collaborator Author

jesperpedersen May 31, 2023
Maintainer

jesperpedersen
Jun 6, 2023
Maintainer

Jubilee101 Jun 6, 2023
Collaborator Author

jesperpedersen Jun 7, 2023
Maintainer

Jubilee101 Jun 7, 2023
Collaborator Author

Jubilee101
Jun 20, 2023
Collaborator Author

jesperpedersen Jun 20, 2023
Maintainer

Jubilee101
Jun 23, 2023
Collaborator Author

jesperpedersen Jun 24, 2023
Maintainer

Jubilee101 Jun 24, 2023
Collaborator Author

Jubilee101
Jun 26, 2023
Collaborator Author

jesperpedersen Jun 27, 2023
Maintainer

Jubilee101 Jun 27, 2023
Collaborator Author

Jubilee101 Jun 27, 2023
Collaborator Author

jesperpedersen Jun 27, 2023
Maintainer

Jubilee101
Jun 28, 2023
Collaborator Author

jesperpedersen Jun 29, 2023
Maintainer

jesperpedersen Jun 29, 2023
Maintainer

Jubilee101
Jun 29, 2023
Collaborator Author

Jubilee101 Jun 29, 2023
Collaborator Author

jesperpedersen Jun 29, 2023
Maintainer

Jubilee101 Jun 29, 2023
Collaborator Author

jesperpedersen Jun 29, 2023
Maintainer

Jubilee101 Jun 29, 2023
Collaborator Author

Jubilee101
Jul 6, 2023
Collaborator Author

Jubilee101 Jul 7, 2023
Collaborator Author

Jubilee101
Jul 10, 2023
Collaborator Author

jesperpedersen Jul 10, 2023
Maintainer

jesperpedersen
Jul 11, 2023
Maintainer

Jubilee101 Jul 11, 2023
Collaborator Author

Jubilee101 Jul 11, 2023
Collaborator Author

jesperpedersen Jul 11, 2023
Maintainer

Jubilee101 Jul 11, 2023
Collaborator Author