Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1974364: Change the way of gathering ovn db #245

Merged

Conversation

npinaeva
Copy link
Member

Change the way of gathering ovs db from compacting + copying db files to ovsdb-client backup.
This change should solve the following problems:

  • copying the .db file might pick partial data
  • even after compacting, it's not guaranteed that there are no transactions (which could happen after compacting and before copying the file). This makes it difficult for parsers other than ovsdb-server, e.g: insights.

Also, files from ovsdb-client backup may be used by ovsdb-client restore for local debugging.

Test:
Run must-gather, check network_logs/<OVNKUBE_MASTER_POD>_nbdb.gz and network_logs/<OVNKUBE_MASTER_POD>_sbdb.gz files.

Example of previous nbdb file:

OVSDB CLUSTER 147818 853ae92cae5be5e42481dc8f3d5a1fa390512145
{
        "cluster_id": "9f6509ea-c238-40c5-9c36-ade7b0c84b8c",
        "local_address": "ssl:10.0.140.173:9643",
        "name": "OVN_Northbound",
        "prev_data": [
             {
                "cksum": "2352750632 28701",
                "name": "OVN_Northbound",
                "tables": { ... },
                "version": "5.31.0"
            },
            {
                "ACL": {...},
                "Address_Set": {...},
                ...,
                "_comment": "compacting database online",
            }
        ],
        "prev_eid": "dfe07494-a8af-4c0d-88c0-5846733ad367",
        "prev_election_timer": 4000,
        "prev_index": 2543,
        "prev_servers": {
            "029b740c-993b-48f1-8c42-8b83c1cdd81e": "ssl:10.0.236.27:9643",
            "421f5dd9-8be4-4b8f-a951-ff1f91a7e0ed": "ssl:10.0.182.236:9643",
            "8aea2933-fc67-4b77-a2a9-fc4deccdafd5": "ssl:10.0.140.173:9643"
        },
        "prev_term": 21,
        "server_id": "8aea2933-fc67-4b77-a2a9-fc4deccdafd5"
    },
OVSDB CLUSTER 58 0d4b5064adad35db8615b6ed1ae18f71e8a81332
{
        "term": 21,
        "vote": "8aea2933-fc67-4b77-a2a9-fc4deccdafd5"
}

Example of current nbdb file (contains information from "prev_data" of old nbdb file):

OVSDB JSON 12282 edd18dab995cc426f8b957d0c20ab0b1e8079ffb
{
            "cksum": "2352750632 28701",
            "name": "OVN_Northbound",
            "tables": { ... },
            "version": "5.31.0"
}
OVSDB JSON 194111 882154e9ab233dc48ba3ef523db3bf220a7a5da1
{
            "ACL": {...},
            "Address_Set": {...},
            ...,
            "_comment": "produced by \"ovsdb-client backup\"",
        }
}

New version of db files also contains more fields, e.g.:
old version:

"ACL": {
           "002d0da5-d575-4e18-b1b3-4f6d335361b1": {
               "action": "allow-related",
                "direction": "to-lport",
                "match": "ip4.src==10.128.0.2",
                "name": "",
                "priority": 1001
         },

new version:

"ACL": {
            "002d0da5-d575-4e18-b1b3-4f6d335361b1": {
                "action": "allow-related",
                "direction": "to-lport",
                "external_ids": [
                    "map",
                    []
                ],
                "log": false,
                "match": "ip4.src==10.128.0.2",
                "meter": [
                    "set",
                    []
                ],
                "name": "",
                "priority": 1001,
                "severity": [
                    "set",
                    []
                ]
            },

About ephemeral columns - new db files contain ephemeral columns like Connection.is_connected https://github.com/ovn-org/ovn/blob/master/ovn-nb.ovsschema#L467 although ovsdb-client backup docs say

The output does not include ephemeral columns, which by design do not survive across restarts of ovsdb-server.

"Connection": {
            "32a56ea9-b04e-491a-a811-0915ee0d52a3": {
                "external_ids": [
                    "map",
                    []
                ],
                "inactivity_probe": 60000,
                "is_connected": false,
                "max_backoff": [
                    "set",
                    []
                ],
                "other_config": [
                    "map",
                    []
                ],
                "status": [
                    "map",
                    []
                ],
                "target": "pssl:9641"
            }
        },

Not sure if it's the case, but this link says

Clustered OVSDB does not support the OVSDB “ephemeral columns” feature. ovsdb-tool and ovsdb-client change ephemeral columns into persistent ones when they work with schemas for clustered databases. Future versions of OVSDB might add support for this feature.

…s to ovsdb-client backup.

This change should solve the following problems:
- copying the .db file might pick partial data
- even after compacting, it's not guaranteed that there are no transactions (which could happen after compacting and before copying the file). This makes it difficult for parsers other than ovsdb-server, e.g: insights.
Also, files from ovsdb-client backup may be used by ovsdb-client restore for local debugging
@openshift-ci openshift-ci bot added the bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. label Jun 29, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 29, 2021

@npinaeva: This pull request references Bugzilla bug 1974364, which is invalid:

  • expected the bug to target the "4.9.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1974364: Change the way of gathering ovn db

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jun 29, 2021
@npinaeva
Copy link
Member Author

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/severity-unspecified Referenced Bugzilla bug's severity is unspecified for the PR. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jun 30, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 30, 2021

@npinaeva: This pull request references Bugzilla bug 1974364, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (anusaxen@redhat.com), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@amorenoz
Copy link

Thanks for the PR @npinaeva.

About ephemeral columns - new db files contain ephemeral columns like Connection.is_connected https://github.com/ovn-org/ovn/blob/master/ovn-nb.ovsschema#L467 although ovsdb-client backup docs say

The output does not include ephemeral columns, which by design do not survive across restarts of ovsdb-server.

Can you check if, as you also point out, ovsdb-server has removed the "ephemeral" flag of that column? You should be able to see it in the schema, which is the first OVSDB JSON object:

OVSDB JSON 12282 edd18dab995cc426f8b957d0c20ab0b1e8079ffb
{
            "cksum": "2352750632 28701",
            "name": "OVN_Northbound",
            "tables": { ... },
            "version": "5.31.0"
}

If we want to keep ephemeral columns, another approach that could at least guarantee we don't store transactions (which make the file bigger and difficult its parsing by programs other than ovsdb-server, such as insights) is:
First 'cp' the file and 'then' run ovsdb-tool compact on the copy. What do you think?

@amorenoz
Copy link

Also, if you end up using ovsdb-client backup, you might want to the --timeout option which would ensure that a broken/buggy ovsd-server does not block the mustgather script. If the timetout does expire, maybe cp is the best we can do

@npinaeva
Copy link
Member Author

Thank you for the comments, @amorenoz

  1. Yes, I forgot to attach schema description, it says nothing about ephemeral
"Connection": {
                "columns": {
                    ...
                    "is_connected": {
                        "type": "boolean"
                    },
                   ...
                },

But cped report has the same description for is_connected column, although its instance for Connection looks like this

                "Connection": {
                    "32a56ea9-b04e-491a-a811-0915ee0d52a3": {
                        "inactivity_probe": 60000,
                        "target": "pssl:9641"
                    }
                },

Also found this comment https://bugzilla.redhat.com/show_bug.cgi?id=1818754#c8 where db on update states that ephemeral columns are not supported for clusters and are changed to persistent (our Connection table has 2 ephemeral columns: status and is_connected), so maybe these columns are not ephemeral for a long time now, which also means that it shouldn't be a problem for gathering.

The next thing we should decide on is desired format for gathered db files

  1. cp + compact: smaller file size (may also get partial data while copying)
  2. backup: can be used by ovsdb directly

If we first copy file and then compact, I think there is no difference with creating a backup for parsers (both methods just create a snapshot of db)?

About file size: the difference for just created cluster is (old way->new way): nbdb=148kb->206kb, sbdb=2mb->2.2mb. Not sure if this difference is considerable and if it's going to increase on a real-life cluster.

So, I think there is no need to copy db file manually, we can use the benefits of backup considering the difference in file size is not so important to us. What do you think? I also think we may need @astoycos opinion on that (and also to confirm that ephemeral columns may not exist anymore)

> "${NETWORK_LOG_PATH}"/"${OVNKUBE_MASTER_POD}"_sbdb_size_pre_compact

oc exec -n openshift-ovn-kubernetes "${OVNKUBE_MASTER_POD}" -c sbdb -- ovs-appctl -t /var/run/ovn/ovnsb_db.ctl ovsdb-server/compact
oc exec -n openshift-ovn-kubernetes "${OVNKUBE_MASTER_POD}" -c sbdb -- ovsdb-client backup unix:/var/run/ovn/ovnsb_db.sock> \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space before >

@astoycos
Copy link
Contributor

astoycos commented Jul 1, 2021

Also, if you end up using ovsdb-client backup, you might want to the --timeout option

We should definitely do this. You could implement logic so that if the ovsdb-client command times out then we just manually cp out the file instead

About file size: the difference for just created cluster is (old way->new way): nbdb=148kb->206kb, sbdb=2mb->2.2mb. Not sure if this difference is considerable and if it's going to increase on a real-life cluster.

No I don't think this is considerable, we should try to not move back to cping

On the ephemeral columns, I don't think they are super important for debugging... If the server is not connected to the db then we would most likely be seeing other "no connection" errors all over OVN/ OVN-K

@amorenoz
Copy link

amorenoz commented Jul 1, 2021

The next thing we should decide on is desired format for gathered db files

1. cp + compact: smaller file size (may also get partial data while copying)

2. backup: can be used by ovsdb directly

If we first copy file and then compact, I think there is no difference with creating a backup for parsers (both methods just create a snapshot of db)?

About file size: the difference for just created cluster is (old way->new way): nbdb=148kb->206kb, sbdb=2mb->2.2mb. Not sure if this difference is considerable and if it's going to increase on a real-life cluster.

So, I think there is no need to copy db file manually, we can use the benefits of backup considering the difference in file size is not so important to us. What do you think? I also think we may need @astoycos opinion on that (and also to confirm that ephemeral columns may not exist anymore)

Looking at some of the problems we're having with large-scale clusters, I think I've changed my mind.
I'd say the best option is copy(first) and then compact. The reason is simple: ovsdb-server is single-threaded and both compact+cp (current approach) and backup options could take "down" the ovsdb-server for some time. On big, busy clusters this could impact the overall ovn control plane.
(Note compacting after 'cp' is done with 'ovsdb-tool' instead of 'ovs-appctl').

@amorenoz
Copy link

amorenoz commented Jul 1, 2021

No I don't think this is considerable, we should try to not move back to cping

Thinking about this "to cp or not to cp " dilemma. Writing into the file is done using buffered IO which is flushed just after each transaction:

https://github.com/openvswitch/ovs/blob/f8be30acf2eb60d567bb7386b98f5cb58ddb9119/ovsdb/log.c#L622-L624

Without locking mechanism, it's technically possible copy partial data. However, this should not be very often, in fact, it's possibility is neglected even by the ovsdb-client man page that claims:

            Another way to back up a standalone or active-backup
             database is to copy its database file, e.g. with cp.  This
             is safe even if the database is in use.

If this rare situation does happen, we can always ask for another mustgather report.

However, I think interacting with the system we're trying to gather information from is wrong by design. Besides, if ovsdb-server itself is crashed, hung, buggy, etc (which might even be the cause we want to pick up information in the first place), all these interactions could fail.

On the ephemeral columns, I don't think they are super important for debugging... If the server is not connected to the db then we would most likely be seeing other "no connection" errors all over OVN/ OVN-K

Agree

@npinaeva
Copy link
Member Author

npinaeva commented Jul 2, 2021

Sooo, the key point here from ovsdb-client man page is:

Another way to back up a standalone or active-backup database is to copy its database file, e.g. with cp.

And ovs docs says

OVSDB supports three service models for databases: standalone, active-backup, and clustered.

So, our case is clustered db, and ovsdb-tool doesn't work with clustered dbs:

This command also does not work with clustered databases. Instead, in either case, send the ovsdb-server/compact command to ovsdb-server, via ovs-appctl).

Thus, if we don't want to bother ovsdb-server we can only cp ovsdb files, but without compact it's very difficult to analyse/use these files (because db is clustered it looks like this:

OVSDB CLUSTER 158 004a93d515217d124e07e6bff2458f2b5ea6f33b
{"servers":{"88d82f10-adfb-456c-b221-264101a35ea4":"ssl:10.0.129.35:9643","c46f5142-875c-474c-a3f8-6d5b3e881647":"ssl:10.0.214.113:9643"},"term":1,"index":3}
OVSDB CLUSTER 19 0130e3a167ccf41c3c4248c8e49bb926e93fd1dd
{"commit_index":3}
OVSDB CLUSTER 221 cf0ab7d175713bc716d536488e854c9c71db5311
{"servers":{"88d82f10-adfb-456c-b221-264101a35ea4":"ssl:10.0.129.35:9643","c46f5142-875c-474c-a3f8-6d5b3e881647":"ssl:10.0.214.113:9643","792b8cd4-9561-4fad-9df8-57f7951a5d57":"ssl:10.0.160.184:9643"},"term":1,"index":4}
OVSDB CLUSTER 19 dcb67953db57bc4dd10cf2f30aa6a23d12cb9a08
{"commit_index":4}
OVSDB CLUSTER 447 97cadefe63a253326e69e1422cb8c490ae276aeb
{"eid":"bc607e80-58f2-4764-a817-dddbf2aab0b2","data":[null,{"_date":1625155578930,"NB_Global":{"08f12187-3683-472c-b0b2-101ad2520f21":{"connections":["uuid","e7bd7c13-ac10-435a-b08f-23ddc94b22ea"]}},"Connection":{"e7bd7c13-ac10-435a-b08f-23ddc94b22ea":{"target":"pssl:9641","inactivity_probe":60000}},"_comment":"ovs-nbctl: ovn-nbctl --no-leader-only -t 5 set-connection pssl:9641 -- set connection . inactivity_probe=60000"}],"term":1,"index":5}
OVSDB CLUSTER 19 b5f2ad8b2cee96398cdb1a8c7e88747aa0095c32
{"commit_index":5}

We can also try to run ovsdb-server somewhere out of cluster and make it handle copied ovsdb files, but since db is clustered, and you can see fields like "local_address": "ssl:10.0.140.173:9643" in db files, I think you can't just change its db file and making "a copy" of ovsdb cluster may be a struggle.

The only way to get nice output directly is to ask ovsdb-server either to compact db or to backup it, meaning we have to bother ovsdb-server, so I've run out of ideas here.

Any thoughts please?

@npinaeva
Copy link
Member Author

npinaeva commented Jul 2, 2021

Also, from ovsdb(7)

A more common backup strategy is to periodically take and store a snap‐
shot. For the standalone and active-backup service models, making a
copy of the database file, e.g. using cp, effectively makes a snapshot,
and because OVSDB database files are append-only, it works even if the
database is being modified when the snapshot takes place. This
approach does not work for clustered databases.
Another way to make a backup, which works with all OVSDB service mod‐
els, is to use ovsdb-client backup, which connects to a running data‐
base server and outputs an atomic snapshot of its schema and content,
in the same format used for standalone and active-backup databases.

Which I think means that using ovsdb-client backup is the only way to gather data from a cluster and also the only way to restore data from a snapshot later.

@amorenoz
Copy link

amorenoz commented Jul 2, 2021

Sooo, the key point here from ovsdb-client man page is:

Another way to back up a standalone or active-backup database is to copy its database file, e.g. with cp.

And ovs docs says

OVSDB supports three service models for databases: standalone, active-backup, and clustered.

So, our case is clustered db, and ovsdb-tool doesn't work with clustered dbs:

This command also does not work with clustered databases. Instead, in either case, send the ovsdb-server/compact command to ovsdb-server, via ovs-appctl).

Ugh. We could use cluster-to-standalone which we anyhow need to run if we want to restore the content of the db on another instance of ovsdb-server, but we would loose the raft logs.

The only way to get nice output directly is to ask ovsdb-server either to compact db or to backup it, meaning we have to bother ovsdb-server, so I've run out of ideas here.

Any thoughts please?

If we 'cp' first we need to drop the raft information, but maybe we can print it in another manner. I'll take a deeper look at the clutered-related commands to see what is less intrusive (backup, compact or dumping the clustered data)

@npinaeva
Copy link
Member Author

npinaeva commented Jul 2, 2021

I think ovsdb-client backup also returns db in a standalone format, because there is no specific format for clustered db dumps (Clustered db can also be restored from standalone format). I am also wondering do you think we need this "clustered" format of db? I can't see how it can be used as opposed to standalone?

@amorenoz
Copy link

amorenoz commented Jul 2, 2021

If there are issues with the ovsdb-server clustering algorithm (raft), we would need to look into the clustered-only columns/logs. So, we should see if this information can be dumped by other means.

@npinaeva
Copy link
Member Author

npinaeva commented Jul 5, 2021

I see, but then I suppose we use ovsdb-tool show-log as https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#viewing-history says.
It returns the same information that's written in db files but in a different more readable format:
raw_db_file:

OVSDB CLUSTER 22 88864d6c5d12d321baf3e0078adf50e9d4b9de60
{"commit_index":2408}
OVSDB CLUSTER 208 4ab7a94a45694fd8f98c06ae431d0a78142e64e1
{"eid":"e968064f-42d7-45d6-bd06-cb5cf92019f0","data":[null,{"_date":1625477813378,"_comment":"ovn-northd","Logical_Switch_Port":{"cea64fe9-37ba-40e3-bbab-f0bfa32a90ef":{"up":false}}}],"term":17,"index":2409}

show-log:

record 4821:
 commit_index: 2408

record 4822:
 term: 17
 index: 2409
 eid: e968
 2021-07-05 09:36:53.378 "ovn-northd"
  table Logical_Switch_Port row "openshift-marketplace_redhat-operators-kbcxj" (cea64fe9):
    up=false

And then we can have ovsdb-client backup output as a database snapshot and ovsdb-tool show-log output as cluster logs. What do you think?

@amorenoz
Copy link

amorenoz commented Jul 8, 2021

I see, but then I suppose we use ovsdb-tool show-log as https://docs.openvswitch.org/en/latest/ref/ovsdb.7/#viewing-history says.
It returns the same information that's written in db files but in a different more readable format:
raw_db_file:

OVSDB CLUSTER 22 88864d6c5d12d321baf3e0078adf50e9d4b9de60
{"commit_index":2408}
OVSDB CLUSTER 208 4ab7a94a45694fd8f98c06ae431d0a78142e64e1
{"eid":"e968064f-42d7-45d6-bd06-cb5cf92019f0","data":[null,{"_date":1625477813378,"_comment":"ovn-northd","Logical_Switch_Port":{"cea64fe9-37ba-40e3-bbab-f0bfa32a90ef":{"up":false}}}],"term":17,"index":2409}

show-log:

record 4821:
 commit_index: 2408

record 4822:
 term: 17
 index: 2409
 eid: e968
 2021-07-05 09:36:53.378 "ovn-northd"
  table Logical_Switch_Port row "openshift-marketplace_redhat-operators-kbcxj" (cea64fe9):
    up=false

And then we can have ovsdb-client backup output as a database snapshot and ovsdb-tool show-log output as cluster logs. What do you think?

My biggest concern is the potential impact on the cluster itself if they are big (e.g: stopping pods from coming up for 10s of seconds).

I would say the safest thing would be:

  • run 'ovsdb-client' on the db to dump the important information that will be lost in the next steps
  • copy the db file
  • run ovsdb-tool cluster-to-standalone
  • run ovsdb-tool compact
  • save the standalone-compacted db

cc @dceara

@npinaeva
Copy link
Member Author

npinaeva commented Jul 8, 2021

I would say the safest thing would be:

* run 'ovsdb-client' on the db to dump the important information that will be lost in the next steps

* copy the db file

* run ovsdb-tool cluster-to-standalone

* run ovsdb-tool compact

* save the standalone-compacted db

Do you mean run ovsdb-tool show-logs first to collect logs and then cp db file to only get db snapshot? Or what command for ovsdb-client should be run first?

@dceara
Copy link

dceara commented Jul 8, 2021

I would say the safest thing would be:

* run 'ovsdb-client' on the db to dump the important information that will be lost in the next steps

* copy the db file

* run ovsdb-tool cluster-to-standalone

* run ovsdb-tool compact

* save the standalone-compacted db

Do you mean run ovsdb-tool show-logs first to collect logs and then cp db file to only get db snapshot? Or what command for ovsdb-client should be run first?

I think @amorenoz meant ovsdb-tool show-log indeed, or at least that's what seems to me like is missing with this process.

But this makes me wonder, why not just copy the original, uncompacted, db file; gzip it and that's it? There's the small risk that we get some inconsistent data (because of buffered I/O) but except for that we would be getting the most useful information. Other tools can later run offline compaction/change to standalone, if it makes their life easier.

We only have 3 SB DB files (maybe larger) and 3 NB DB files (usually way smaller than the SB ones).

@amorenoz
Copy link

amorenoz commented Jul 8, 2021

I would say the safest thing would be:

* run 'ovsdb-client' on the db to dump the important information that will be lost in the next steps

* copy the db file

* run ovsdb-tool cluster-to-standalone

* run ovsdb-tool compact

* save the standalone-compacted db

Do you mean run ovsdb-tool show-logs first to collect logs and then cp db file to only get db snapshot? Or what command for ovsdb-client should be run first?

I think @amorenoz meant ovsdb-tool show-log indeed, or at least that's what seems to me like is missing with this process.

ovsdb-tool cannot be filtered right? I don't see the benefit of 'ovsdb-tool show-log' vs copying the uncompacted, clusted db file.

But this makes me wonder, why not just copy the original, uncompacted, db file; gzip it and that's it? There's the small risk that we get some inconsistent data (because of buffered I/O) but except for that we would be getting the most useful information. Other tools can later run offline compaction/change to standalone, if it makes their life easier.

We only have 3 SB DB files (maybe larger) and 3 NB DB files (usually way smaller than the SB ones).

The problem is those tools have to be run manually since insights cannot run ovs binary automatically. This whole discussion started as a way to better integrate with insights. However, the conclusion seems to be that we only have 3 ways to go:

  1. we duplicate the db content (copy both db and compacted-standalone db)
  2. we loose raft (+ ephemeral) content (just copy compacted-standalone db)
  3. we interact with the DB (via ovs-appctl or ovsdb-client)
  4. we just copy the db as is and let the rest of the tools happen at debugging time (i.e: we do the work in the insights side)

We can keep discussing things like:

  • would the size of the archive be blocker for 1 ?
  • how bad would 2 be in terms of lost information? @dceara?

But the more we discuss, the more I lean towards 4

@dceara
Copy link

dceara commented Jul 8, 2021

The problem is those tools have to be run manually since insights cannot run ovs binary automatically. This whole discussion started as a way to better integrate with insights. However, the conclusion seems to be that we only have 3 ways to go:

1. we duplicate the db content (copy both db and compacted-standalone db)

2. we loose raft (+ ephemeral) content (just copy compacted-standalone db)

3. we interact with the DB (via ovs-appctl or ovsdb-client)

4. we just copy the db as is and let the rest of the tools happen at debugging time (i.e: we do the work in the insights side)

We can keep discussing things like:

* would the size of the archive be blocker for  `1` ?

* how bad would `2` be in terms of lost information? @dceara?

With 2 we also lose recent transaction history too, which, if the data is collected early, is often very useful for debugging. E.g., it allows checking when something was deleted, whereas in the compacted version we just don't have the record.

But the more we discuss, the more I lean towards 4

Me too.

@npinaeva
Copy link
Member Author

npinaeva commented Jul 8, 2021

I think we still will loose some information about transactions because ovsdb-server automatically compacts databases when they grow too much (and compacting only leaves current db state in file), but copying db file really seems like the only way to collect data without interacting with ovsdb.

And since just copying raw db file is the most informative way to collect the data, maybe we should do that (although extracting transactions data from that file may be difficult).

Do we agree on that?

@dceara
Copy link

dceara commented Jul 9, 2021

I think we still will loose some information about transactions because ovsdb-server automatically compacts databases when they grow too much (and compacting only leaves current db state in file), but copying db file really seems like the only way to collect data without interacting with ovsdb.

And since just copying raw db file is the most informative way to collect the data, maybe we should do that (although extracting transactions data from that file may be difficult).

It's actually the opposite, having the raw db file is the only way to get transactions data. So this is a plus in my opinion.

Do we agree on that?

This sounds good to me (from core OVN perspective) but I'll let @amorenoz share his opinions too from tools (insights) perspective.

@npinaeva
Copy link
Member Author

npinaeva commented Jul 9, 2021

It's actually the opposite, having the raw db file is the only way to get transactions data. So this is a plus in my opinion.

I'm comparing extracting transactions data from db file with reading nice output from ovsdb-tool show-log, but since we decided not to interact with db, copying db really is the only way to get that data.

@amorenoz
Copy link

I think we don't have any other way. We'll have to work around the format being difficult to parse at the insights/tooling stage. +1 for just copying the file

@sferich888
Copy link
Contributor

I am a bit worried about the size of these db files that were collecting and who much that will affect the overall size of the must-gather archive. If I recall we made a switch away from collecting the db files directly so that we could reduce the overall size of the archive.

That said; given this script gather_network_logs is optional (and not run by default) I don't see an issue switching back to this if engineering feel its the only way to debug situations within the SDN. That said we may want to warn users about the DB/archive size that might be generated.

@amorenoz
Copy link

From another thread, @dceara reported the following sizes of the db files on a scale test (i.e: we could consider these as worst-case):

uncompacted: 319M (66M gzipped)
compacted: 131M (27M gzipped)

I agree that if this was to be collected by default we would have to find a way to reduce the footprint significantly. If these numbers are still too high, we could consider having two flavors of gather_network_logs script

@npinaeva
Copy link
Member Author

So, do you think this PR needs some changes? Or we leave it as is?

@sferich888
Copy link
Contributor

Leave it

@sferich888
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 16, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 16, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: npinaeva, sferich888

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 16, 2021
@openshift-merge-robot openshift-merge-robot merged commit d4b3f38 into openshift:master Jul 16, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 16, 2021

@npinaeva: All pull requests linked via external trackers have merged:

Bugzilla bug 1974364 has been moved to the MODIFIED state.

In response to this:

Bug 1974364: Change the way of gathering ovn db

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@npinaeva npinaeva deleted the change-ovn-db-gathering branch July 20, 2021 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants