Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functional pipeline UI #8

Merged
merged 27 commits into from
Jun 10, 2023
Merged

Functional pipeline UI #8

merged 27 commits into from
Jun 10, 2023

Conversation

kalensk
Copy link
Owner

@kalensk kalensk commented Jun 10, 2023

No description provided.

jimmyyih and others added 27 commits May 23, 2023 14:19
It's very important to make sure the master and standby master are
synchronized before running Initialize and Execute. This is because
Revert will rsync the standby master data directory onto the master
data directory which can cause severe master corruption if the standby
master not up-to-date. We have this check for GPDB 6X+ but it was
seemingly missed for 5X because the catalog view gp_stat_replication
does not exist on 5X. However, we're only checking the master and
standby master replication status so there's no need to use the
cluster-wide gp_stat_replication view; we can simply use
pg_stat_replication which will only report the replication status of
the individual segment (in this case, just the master segment).
We already check that the cluster is synchronized when creating the
source cluster config file at the beginning of Initialize. However,
the state of the source cluster could have changed any time afterwards
between that early Initialize step and upgrading the master segment in
Execute. If the cluster was out-of-sync and we run Revert after
Execute, there is potential for severe catalog corruption. In
particular, this is more likely to happen to the master and standby
master since there's a GPDB-specific mechanism to bypass the
synchronous WAL replication when the standby master is not
responsive. To prevent this, we add a quick sanity check for cluster
synchronization before the Initialize pg_upgrade checks and more
importantly before doing the actual upgrade operation in Execute.
Pass the DUMP_PATH environment variable with the path to the dump file
when making the functional pipeline such as
`make DUMP_PATH=dump/5X/dump.sql.xz functional-pipeline`
Set max_statement_mem to be half of the total memory, and set
statement_mem to be 1/16th of the max_statement_mem.

For an n2-standard-16 instnace with 64GB of memory this sets
max_statement_mem to 32GB and statement_mem to 2GB.
Ensure resource filenames are unique with git branch name to avoid
collisions when multiple pipelines are run.

Specify BranchName and resource filenames during pipeline creation since
the Concourse GCS resource does not guarantee pulling the latest
resource. See issue frodenas/gcs-resource#47
When printing the duration just pass string rather than the timer object
to keep things simple. No need for the function to have access to the
entire object when it's not needed.
Remove AddClusters which is no longer used. Rename add_clusters.go to
for clarity.
To be consistent with the online documentation and internal naming use
--upgrade-id when running `gpupgrade config show --upgrade-id`.
Keep it simple! Since upgrade ID is only ever used as a string there is
no reason not to just have it be that type. It also allows us to
serialize it in config.json as a string rather than a random number
making it easy to correlate and debug on customer systems.
Add yearMonthDay to filename format.
Show coverage in unit and integration output.
Need to wait for "both" the work being done and for the progress bars
being incremented (ie: rendered).

The progress bar is tied to the number of seed scripts executed rather
than the number of actual generated scripts written. This is because the
progress bar needs to know up front how many units of work to do. It
makes sense to increment the bar right after the seed scripts are
executed since this is what the actual bar is tied to. The previous
implementation had that.

However, there is a small section of code in which the bar is
incremented before the generated script is fully written to. This may
show the bar being completed even though there is still a little work to
do in writing the file. In reality this is negligible since for even
extremely large scripts the file is written very fast on current
systems.

Given all of the above some people find it easier to increment the bar
after the script was written even though this is technically not
correct. Given the way the code is structured this results in the bar
being incremented in two places: 1) once after the seed scripts are run
and there is no generated script output to write, and 2) after the seed
scripts are run and after the generated script is written.

So it is a tradeoff of having the increment scattered in two places
versus a negligible scenario where the progress bar could show being
completed while the file is being written.
Consolidate all (non-unit) test files into the test directory.

The motivation is that there are soon to be golang acceptance tests to slowly
replace the BATs tests. As we were working on where to best place the upcoming
golang acceptance tests the organization of the existing integration tests were
discovered.

Co-authored-by: Kevin Yeap <kyeap@vmware.com>
Previously, we were using a passed constraint on the terraform resource
such as:
```
- get: terraform
  passed: [ generate-cluster ]
```

However, for long running jobs 3+ days it appears Concourse would expire
or roll off the specific terraform resource associated with the
generate-cluster job. This would cause the destroy jobs to hang while
"waiting for a suitable set of input versions":
`terraform - no satisfiable builds from passed jobs found for set of
inputs`

Thus, pass the saved_cluster_env_files from GCP to the ccp_destroy task
to avoid this issue.
The previous version of golangci/golangci-lint-action@v2 was showing the
following warning:

Warning: The `save-state` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/
For extra validation post-upgrade for end-to-end tests run gpcheckcat.
This is in addition to validating the standby and mirrors post-upgrade
by failing over and failing back.

Note, we do not run gpcheckcat or validate mirrors after the pg_upgrade
acceptance tests or multi-host gpupgrade acceptance tests. This would
require leaving the cluster intact after running the tests which is
cumbersome during dev iteration and upon failures.
The timestamp in the log prefix was not being updated. The logger from
the go standard library sets the static prefix during initialization,
and does not call it before each log message. Thus, the timestamp is
fixed. Use the built in standard log timestamp.

Update log prefix to:
2023/06/07 14:44:11 gpupgrade:gpadmin:sdw-1:161174 [INFO]: State directory /home/gpadmin/.gpupgrade already present. Skipping.

from:
20230607:14:44:11 gpupgrade:gpadmin:sdw-1:161174-[INFO]:-State directory /home/gpadmin/.gpupgrade already present. Skipping.
Tsquery types cannot be part of a distribution key. When generating
ALTER statements that modify tsquery datatype to text datatype there is
no need to check if the tsquery column is being used as a distribution
key. This was likely leftover code when gen_fix_tsquery_to_text.sql was
copied from the script that handled deprecated name types.
Revert and finalize data migration scripts were generating duplicate
tsquery partition table indexes. Indexes on partition tables using
tsquery fall under the jurisdiction of two different data migration
scripts.

1. recreate_partition_indexes_step
2. recreate_indexes_on_deprecated_built_in_types

To fix this, recreate_partition_indexes_step_1.sql will no longer
generate CREATE INDEX statements for tsquery partition tables.

This was decided based on the following considerations
1. recreate_indexes_on_deprecated_built_in_types is a script with a
   narrower scope.
2. We do not want to create a script order dependency where a CREATE
   INDEX statement in recreate_partition_indexes_step depends on an
   ALTER statement in the corresponding
   gen_fix_tsquery_to_text/gen_change_tsquery_to_text
Altering a column type when the column is a partition key is not
allowed. For this reason, intialize migration scripts for deprecated
tsquery type does not generate an ALTER statement for tables partitioned
on a tsquery column. Tables that are partitioned on a tsquery column
should also not generate an ALTER statement for revert and finalize
migration scripts.

When recreating indexes on partition tables during revert and finalize,
dropped child partition indexes are also brought back. This is a known
behavior that has not been addressed. Tsquery partition indexes
recreation is not immune to this behavior.

Commit where tsquery ALTER statements were disabled in intialize for tables
partitioned on a tsquery column:
1c455b8
Update the name parameter to Run Tests for clarity when viewing in the
github actions page.
@kalensk kalensk merged commit 0f5f120 into main Jun 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants