diff --git a/.gitignore b/.gitignore index eea397c6..8e499dbd 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,4 @@ # Hugo public/ resources/_gen +.hugo_build.lock diff --git a/content/en/docs/release-oversight/disruption-testing/_index.md b/content/en/docs/release-oversight/disruption-testing/_index.md new file mode 100644 index 00000000..c89f8f9b --- /dev/null +++ b/content/en/docs/release-oversight/disruption-testing/_index.md @@ -0,0 +1,4 @@ +--- +title: "Disruption Testing" +description: An overview for how disruption tests work and are configured. +--- diff --git a/content/en/docs/release-oversight/disruption-testing/backend_queries.md b/content/en/docs/release-oversight/disruption-testing/backend_queries.md new file mode 100644 index 00000000..ddbab876 --- /dev/null +++ b/content/en/docs/release-oversight/disruption-testing/backend_queries.md @@ -0,0 +1,88 @@ +--- +title: "Testing Backends For Availability" +description: This is an overview for how backends are queried for their availability status. +--- + +### Overview Diagram + +This diagram shows how backends are queried to determine their availability: + +![Query Backends1](/query_backends1.png) + + +* (1) Starting from a call to + [StartAllAPIMonitoring](https://github.com/openshift/origin/blob/08eb7795276c45f2be16e980a9687e34f6d2c8ec/test/extended/util/disruption/controlplane/known_backends.go#L13), + one of several BackendSamplers are created: + +{{% card-code header="[origin/test/extended/util/disruption/controlplane/known_backends.go](https://github.com/openshift/origin/blob/08eb7795276c45f2be16e980a9687e34f6d2c8ec/test/extended/util/disruption/controlplane/known_backends.go#L54)" %}} + +```go + backendSampler, err := createKubeAPIMonitoringWithNewConnections(clusterConfig) +``` + +{{% /card-code %}} + +* (2) Then a disruptionSampler is created with that BackendSampler + https://github.com/openshift/origin/blob/08eb7795276c45f2be16e980a9687e34f6d2c8ec/pkg/monitor/backenddisruption/disruption_backend_sampler.go#L410 + +{{% card-code header="[origin/pkg/monitor/backenddisruption/backenddisruption/disruption_backend_sampler.go](https://github.com/openshift/origin/blob/08eb7795276c45f2be16e980a9687e34f6d2c8ec/pkg/monitor/backenddisruption/disruption_backend_sampler.go#L410)" %}} + +```go + disruptionSampler := newDisruptionSampler(b) + go disruptionSampler.produceSamples(producerContext, interval) + go disruptionSampler.consumeSamples(consumerContext, interval, monitorRecorder, eventRecorder) +``` + +{{% /card-code %}} + +* (3) The `produceSamples` function is called to produce the disruptionSamples. This function is built around + a [`Ticker`](https://go.dev/src/time/tick.go) that fires every 1 second. The `checkConnection` function is + called to send an Http GET to the backend and look for a response from the backend. + +{{% card-code header="[origin/pkg/monitor/backenddisruption/disruption_backend_sampler.go](https://github.com/openshift/origin/blob/08eb7795276c45f2be16e980a9687e34f6d2c8ec/pkg/monitor/backenddisruption/disruption_backend_sampler.go#L506)" %}} + + +```go + func (b *disruptionSampler) produceSamples(ctx context.Context, interval time.Duration) { + ticker := time.NewTicker(interval) + defer ticker.Stop() + for { + // the sampleFn may take a significant period of time to run. In such a case, we want our start interval + // for when a failure started to be the time when the request was first made, not the time when the call + // returned. Imagine a timeout set on a DNS lookup of 30s: when the GET finally fails and returns, the outage + // was actually 30s before. + currDisruptionSample := b.newSample(ctx) + go func() { + sampleErr := b.backendSampler.checkConnection(ctx) + currDisruptionSample.setSampleError(sampleErr) + close(currDisruptionSample.finished) + }() + + select { + case <-ticker.C: + case <-ctx.Done(): + return + } + } + } +``` + +{{% /card-code %}} + +* (4) The `checkConnection` function, produces `disruptionSamples` which represent the startTime of the Http GET and + an associated `sampleErr` that trackes if the Http GET was successful (sampleErr set to `nil`) or failing (the error + is saved). The `disruptionSamples` are stored in a slice referenced by the `disruptionSampler`. + +* (5) The `consumeSamples` function takes the disruptionSamples and determines when disruption started and stopped. It + then records Events and records Intervals/Conditions on the monitorRecorder. + + +{{% card-code header="[origin/pkg/monitor/backenddisruption/disruption_backend_sampler.go](https://github.com/openshift/origin//blob/master/pkg/monitor/backenddisruption/disruption_backend_sampler.go#L504)" %}} + +```go + func (b *disruptionSampler) consumeSamples(ctx context.Context, interval time.Duration, monitorRecorder Recorder, eventRecorder events.EventRecorder) { +``` + +{{% /card-code %}} + +* (6) Intervals on the monitorRecorder are used by the synthetic tests. \ No newline at end of file diff --git a/content/en/docs/release-oversight/disruption-testing/code-implementation.md b/content/en/docs/release-oversight/disruption-testing/code-implementation.md new file mode 100644 index 00000000..0836249c --- /dev/null +++ b/content/en/docs/release-oversight/disruption-testing/code-implementation.md @@ -0,0 +1,252 @@ +--- +title: "Code Implementation" +description: An overview for how disruption tests are implemented, the core logic that makes use of the historical data, and how to go about adding a new tests. +--- + +## Overview + +{{% alert title="Note!" color="primary" %}} +In the examples below we use the `Backend Disruption` tests, but the same will hold true for the alerts durations. +{{% /alert %}} + +To measure our ability to provide upgrades to OCP clusters with minimal +downtime the Disruption Testing framework monitors select backends and +records disruptions in the backend service availability. +This document serves as an overview of the framework used to provide +disruption testing and how to configure new disruption tests when needed + +## Matcher Code Implementation + +Now that we have a better understanding of how the disruption test data is generated and updated, let's discuss how the code makes use of it. + +### Best Matcher + +The [origin/pkg/synthetictests/allowedbackenddisruption/query_results.json](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedbackenddisruption/query_results.json) file that we updated previously is embedded into the `openshift-tests` binary. At runtime, we ingest the raw data and create a `historicaldata.NewMatcher()` object which implements the `BestMatcher` interface. + +{{% card-code header="[origin/pkg/synthetictests/allowedbackenddisruption/types.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedbackenddisruption/types.go#L53-L77)" %}} + +```go +//go:embed query_results.json +var queryResults []byte + +var ( + readResults sync.Once + historicalData historicaldata.BestMatcher +) + +const defaultReturn = 2.718 + +func getCurrentResults() historicaldata.BestMatcher { + readResults.Do( + func() { + var err error + genericBytes := bytes.ReplaceAll(queryResults, []byte(` "BackendName": "`), []byte(` "Name": "`)) + historicalData, err = historicaldata.NewMatcher(genericBytes, defaultReturn) + if err != nil { + panic(err) + } + }) + + return historicalData +} +``` + +{{% /card-code %}} + +### Best Guesser + +The core logic of the current best matcher will check if we have an exact match in the historical data. An exact match is one that contains the same `Backend Name` and [JobType](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/platformidentification/types.go#L16-L23). When we don't have an exact match, we make a best guess effort by doing a fuzzy match for data we don't have. Fuzzy matching is done by iterating through all the `nextBestGuessers` and stopping at the first match that fits our criteria and checking if it's contained in the data set. + +{{% card-code header="[origin/pkg/synthetictests/historicaldata/types.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/types.go#L89-L111)" %}} + +```go + exactMatchKey := DataKey{ + Name: name, + JobType: jobType, + } + + if percentiles, ok := b.historicalData[exactMatchKey]; ok { + return percentiles, "", nil + } + + for _, nextBestGuesser := range nextBestGuessers { + nextBestJobType, ok := nextBestGuesser(jobType) + if !ok { + continue + } + nextBestMatchKey := DataKey{ + Name: name, + JobType: nextBestJobType, + } + if percentiles, ok := b.historicalData[nextBestMatchKey]; ok { + return percentiles, fmt.Sprintf("(no exact match for %#v, fell back to %#v)", exactMatchKey, nextBestMatchKey), nil + } + } +``` + +{{% /card-code %}} + +### Default Next Best Guessers + +[Next Best Guessers](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L13-L53) are functions that can be chained together and will return either a `true` or `false` if the current `JobType` matches the desired logic. In the code snippet below, we check if `MicroReleaseUpgrade` matches the current `JobType`, if false, we continue down the list. The [combine](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L179-L191) helper function gives you the option to chain and compose a more sophisticated check. In the example below, if we can do a [PreviousReleaseUpgrade](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L100-L113) the result of that will be fed into [MicroReleaseUpgrade](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L80-L98) and if no function returns `false` during this chain, we have successfully fuzzy matched and can now check the historical data has information for this match. + +{{% card-code header="Ex: `nextBestGuessers` [origin/pkg/synthetictests/historicaldata/next_best_guess.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L13-L53)" %}} + +```go +var nextBestGuessers = []NextBestKey{ + MicroReleaseUpgrade, + PreviousReleaseUpgrade, + ... + combine(PreviousReleaseUpgrade, MicroReleaseUpgrade), + ... +} +``` + +{{% /card-code %}} + +{{% card-code header="Ex: `PreviousReleaseUpgrade` [origin/pkg/synthetictests/historicaldata/next_best_guess.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L100-L113)" %}} + +```go +// PreviousReleaseUpgrade if we don't have data for the current toRelease, perhaps we have data for the congruent test +// on the prior release. A 4.11 to 4.11 upgrade will attempt a 4.10 to 4.10 upgrade. A 4.11 no upgrade, will attempt a 4.10 no upgrade. +func PreviousReleaseUpgrade(in platformidentification.JobType) (platformidentification.JobType, bool) { + toReleaseMajor := getMajor(in.Release) + toReleaseMinor := getMinor(in.Release) + + ret := platformidentification.CloneJobType(in) + ret.Release = fmt.Sprintf("%d.%d", toReleaseMajor, toReleaseMinor-1) + if len(in.FromRelease) > 0 { + fromReleaseMinor := getMinor(in.FromRelease) + ret.FromRelease = fmt.Sprintf("%d.%d", toReleaseMajor, fromReleaseMinor-1) + } + return ret, true +} +``` + +{{% /card-code %}} + +{{% card-code header="Ex: `MicroReleaseUpgrade` [origin/pkg/synthetictests/historicaldata/next_best_guess.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/historicaldata/next_best_guess.go#L80-L98)" %}} + +```go +// MicroReleaseUpgrade if we don't have data for the current fromRelease and it's a minor upgrade, perhaps we have data +// for a micro upgrade. A 4.10 to 4.11 upgrade will attempt a 4.11 to 4.11 upgrade. +func MicroReleaseUpgrade(in platformidentification.JobType) (platformidentification.JobType, bool) { + if len(in.FromRelease) == 0 { + return platformidentification.JobType{}, false + } + + fromReleaseMinor := getMinor(in.FromRelease) + toReleaseMajor := getMajor(in.Release) + toReleaseMinor := getMinor(in.Release) + // if we're already a micro upgrade, this doesn't apply + if fromReleaseMinor == toReleaseMinor { + return platformidentification.JobType{}, false + } + + ret := platformidentification.CloneJobType(in) + ret.FromRelease = fmt.Sprintf("%d.%d", toReleaseMajor, toReleaseMinor) + return ret, true +} +``` + +{{% /card-code %}} + +## Adding new disruption tests + +Currently disruption tests are focused on disruptions created during upgrades. +To add a new backend to monitor during the upgrade test +Add a new backendDisruptionTest +{{% card-code header="Ex: `NewBackendDisruptionTest` [origin/test/extended/util/disruption/backend_sampler_tester.go](https://github.com/openshift/origin/blob/master/test/extended/util/disruption/backend_sampler_tester.go#L34-L41)" %}} +```go +func NewBackendDisruptionTest(testName string, backend BackendSampler) *backendDisruptionTest { + ret := &backendDisruptionTest{ + testName: testName, + backend: backend, + } + ret.getAllowedDisruption = alwaysAllowOneSecond(ret.historicalP95Disruption) + return ret +} + +``` +{{% /card-code %}} +via NewBackendDisruptionTest to the e2e upgrade AllTests. + +{{% card-code header="Ex: `AllTests` [origin/test/e2e/upgrade/upgrade.go](https://github.com/openshift/origin/blob/master/test/e2e/upgrade/upgrade.go#L54-L86)" %}} +```go +func AllTests() []upgrades.Test { + return []upgrades.Test{ + &adminack.UpgradeTest{}, + controlplane.NewKubeAvailableWithNewConnectionsTest(), + controlplane.NewOpenShiftAvailableNewConnectionsTest(), + controlplane.NewOAuthAvailableNewConnectionsTest(), + controlplane.NewKubeAvailableWithConnectionReuseTest(), + controlplane.NewOpenShiftAvailableWithConnectionReuseTest(), + controlplane.NewOAuthAvailableWithConnectionReuseTest(), + + ... + } + +``` +{{% /card-code %}} + +{{% card-code header="Ex: `NewKubeAvailableWithNewConnectionsTest` [origin/test/extended/util/disruption/controlplane/controlplane.go](https://github.com/neisw/origin/blob/ce3a9bb9e3f5662873214cc0d2dd03e9748f3c14/test/extended/util/disruption/controlplane/controlplane.go#L13-L22)" %}} +```go +func NewKubeAvailableWithNewConnectionsTest() upgrades.Test { + restConfig, err := monitor.GetMonitorRESTConfig() + utilruntime.Must(err) + backendSampler, err := createKubeAPIMonitoringWithNewConnections(restConfig) + utilruntime.Must(err) + return disruption.NewBackendDisruptionTest( + "[sig-api-machinery] Kubernetes APIs remain available for new connections", + backendSampler, + ) +} + +``` +{{% /card-code %}} + + + +If this is a completely new backend being tested then [query_results](https://github.com/openshift/origin/blob/master/pkg/synthetictests/allowedbackenddisruption/query_results.json) +data will need to be added or, if preferable, NewBackendDisruptionTestWithFixedAllowedDisruption can be used instead of NewBackendDisruptionTest and the allowable disruption hardcoded. + +### Updating test data + +{{% alert color="primary" %}} +For information on how to get the historical data please refer to the [Architecture Diagram](../data-architecture) +{{% /alert %}} + +Allowable disruption values can be added / updated in [query_results](https://github.com/openshift/origin/blob/master/pkg/synthetictests/allowedbackenddisruption/query_results.json). +Disruption data can be queried from BigQuery using [p95Query](https://github.com/openshift/origin/blob/master/pkg/synthetictests/allowedbackenddisruption/types.go) + +## Disruption test framework overview + + +{{< inlineSVG file="/static/disruption_test_flow.svg" >}} + + +To check for disruptions while upgrading OCP clusters + +- The tests are defined by [AllTests](https://github.com/neisw/origin/blob/46f376386ab74ecfe0091552231d378adf24d5ea/test/e2e/upgrade/upgrade.go#L53) +- The disruption is defined by [clusterUpgrade](https://github.com/neisw/origin/blob/46f376386ab74ecfe0091552231d378adf24d5ea/test/e2e/upgrade/upgrade.go#L270) +- These are passed into [disruption.Run](https://github.com/neisw/origin/blob/2a97f51d4981a12f0cadad53db133793406db575/test/extended/util/disruption/disruption.go#L81) +- Which creates a new [Chaosmonkey](https://github.com/neisw/origin/blob/59599fad87743abf4c84f05952552e6d42728781/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go#L48) and [executes](https://github.com/neisw/origin/blob/59599fad87743abf4c84f05952552e6d42728781/vendor/k8s.io/kubernetes/test/e2e/chaosmonkey/chaosmonkey.go#L78) the disruption monitoring tests and the disruption +- The [backendDisruptionTest](https://github.com/neisw/origin/blob/0c50d9d8bedbd2aa0af5c8a583418601891ee9d4/test/extended/util/disruption/backend_sampler_tester.go#L34) is responsible for + - Creating the event broadcaster, recorder and monitor + - [Attempting to query the backend](../backend_queries) and timing out after the max interval (1 second typically) + - Analyzing the disruption events for disruptions that exceed allowable values +- When the disruption is complete the disruptions tests are validated via Matches / BestMatcher to find periods that exceed allowable thresholds + - [Matches](https://github.com/neisw/origin/blob/43d9e9332d5fb148b2e68804200a352a9bc683a5/pkg/synthetictests/allowedbackenddisruption/matches.go#L11) will look for an entry in [query_results](https://github.com/openshift/origin/blob/master/pkg/synthetictests/allowedbackenddisruption/query_results.json) if an exact match is not found it will utilize [BestMatcher](#best-matcher) to look for data with the closest variants match + +{{% comment %}} + +Remove comment block when we populate these sections +### Relationship to test aggregations + +TBD + +### Testing disruption tests + +TBD + +{{% /comment %}} diff --git a/content/en/docs/release-oversight/disruption-testing/data-architecture.md b/content/en/docs/release-oversight/disruption-testing/data-architecture.md new file mode 100644 index 00000000..21cb1006 --- /dev/null +++ b/content/en/docs/release-oversight/disruption-testing/data-architecture.md @@ -0,0 +1,118 @@ +--- +title: "Architecture Data Flow" +description: A high level look at how the disruption historical data is gathered and updated. +weight: 1 +--- + +### Resources + +{{% alert title="⚠️ Note!" color="warning" %}} +You'll need access to the appropriate groups to work with disruption data, please reach out to the TRT team for access. +{{% /alert %}} + +- [Periodic Jobs](https://github.com/openshift/release/tree/master/ci-operator/jobs/openshift/release) +- [BigQuery](https://console.cloud.google.com/bigquery?project=openshift-ci-data-analysis) +- [DPCR Job Aggregation Configs](https://github.com/openshift/continuous-release-jobs/tree/master/config/clusters/dpcr/services/dpcr-ci-job-aggregation) +- [Origin Synthetic Backend Tests](https://github.com/openshift/origin/tree/master/pkg/synthetictests/allowedbackenddisruption) + +## Disruption Data Architecture + +{{% alert color="info" %}} +The below diagram presents a high level overview on how we use our `periodic jobs`, `job aggregation` and `BigQuery` to generate the disruption historical data. +It does not cover how the tests themselves are run against a cluster. +{{% /alert %}} + +### High Level Diagram + +{{< inlineSVG file="/static/disruption_test_diagram.svg" >}} + +### How The Data Flows + +1. `Disruption uploader` jobs are run in the DPCR cluster, the current configuration can be found in the [openshift/continuous-release-jobs](https://github.com/openshift/continuous-release-jobs/tree/master/config/clusters/dpcr/services/dpcr-ci-job-aggregation). + +1. We grab a list of the Jobs names that we should gather from the `Jobs` table in `BigQuery` + +1. When e2e tests are done the results are uploaded to `GCS` and the results can be viewed in the artifacts folder for a particular job run. We only pull disruption data for job names specified in the `Jobs` table. + + Clicking the artifact link on the top right of a prow job and navigating to the `openshift-e2e-test` folder will show you the disruption results. (ex. `.../openshift-e2e-test/artifacts/junit/backend-disruption_[0-9]+-[0-9]+.json`). + +1. We fetch and parse out the results from the e2e runs. They are then pushed to the [openshift-ci-data-analysis](https://console.cloud.google.com/bigquery?project=openshift-ci-data-analysis) table in BigQuery. + +1. Currently, backend disruption data is queried from BigQuery and downloaded in `json` format. The resulting `json` file is then committed to [origin/pkg/synthetictests/allowedbackenddisruption/query_results.json](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedbackenddisruption/query_results.json) for **backend disruption** or [origin/pkg/synthetictests/allowedalerts/query_results.json](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedalerts/query_results.json) for **alert data** (see [how to query](#how-to-query-the-data)) + +### How To Query The Data + +Once you have access to BigQuery in the `openshift-ci-data-analysis` project, you can run the below query to fetch the latest results. + +#### Query + +{{% card-code header="[origin/pkg/synthetictests/allowedbackenddisruption/types.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedbackenddisruption/types.go#L13-L43)" %}} + +```sql +SELECT + BackendName, + Release, + FromRelease, + Platform, + Architecture, + Network, + Topology, + ANY_VALUE(P95) AS P95, + ANY_VALUE(P99) AS P99, +FROM ( + SELECT + Jobs.Release, + Jobs.FromRelease, + Jobs.Platform, + Jobs.Architecture, + Jobs.Network, + Jobs.Topology, + BackendName, + PERCENTILE_CONT(BackendDisruption.DisruptionSeconds, 0.95) OVER(PARTITION BY BackendDisruption.BackendName, Jobs.Network, Jobs.Platform, Jobs.Release, Jobs.FromRelease, Jobs.Topology) AS P95, + PERCENTILE_CONT(BackendDisruption.DisruptionSeconds, 0.99) OVER(PARTITION BY BackendDisruption.BackendName, Jobs.Network, Jobs.Platform, Jobs.Release, Jobs.FromRelease, Jobs.Topology) AS P99, + FROM + openshift-ci-data-analysis.ci_data.BackendDisruption as BackendDisruption + INNER JOIN + openshift-ci-data-analysis.ci_data.BackendDisruption_JobRuns as JobRuns on JobRuns.Name = BackendDisruption.JobRunName + INNER JOIN + openshift-ci-data-analysis.ci_data.Jobs as Jobs on Jobs.JobName = JobRuns.JobName + WHERE + JobRuns.StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 21 DAY) +) +GROUP BY +BackendName, Release, FromRelease, Platform, Architecture, Network, Topology +``` + +{{% /card-code %}} + +{{% card-code header="[origin/pkg/synthetictests/allowedalerts/types.go](https://github.com/openshift/origin/blob/a93ac08b2890dbe6dee760e623c5cafb1d8c9f97/pkg/synthetictests/allowedalerts/types.go#L17-L35)" %}} + +```sql +SELECT * FROM openshift-ci-data-analysis.ci_data.Alerts_Unified_LastWeek_P95 +where + alertName = "etcdMembersDown" or + alertName = "etcdGRPCRequestsSlow" or + alertName = "etcdHighNumberOfFailedGRPCRequests" or + alertName = "etcdMemberCommunicationSlow" or + alertName = "etcdNoLeader" or + alertName = "etcdHighFsyncDurations" or + alertName = "etcdHighCommitDurations" or + alertName = "etcdInsufficientMembers" or + alertName = "etcdHighNumberOfLeaderChanges" or + alertName = "KubeAPIErrorBudgetBurn" or + alertName = "KubeClientErrors" or + alertName = "KubePersistentVolumeErrors" or + alertName = "MCDDrainError" or + alertName = "PrometheusOperatorWatchErrors" or + alertName = "VSphereOpenshiftNodeHealthFail" +order by + AlertName, Release, FromRelease, Topology, Platform, Network +``` + +{{% /card-code %}} + +#### Downloading + +Once the query is run, you can download the data locally. + +![BigQuery Download](/bigquery_download.png) diff --git a/content/en/docs/release-oversight/disruption-testing/job-primer.md b/content/en/docs/release-oversight/disruption-testing/job-primer.md new file mode 100644 index 00000000..b392c077 --- /dev/null +++ b/content/en/docs/release-oversight/disruption-testing/job-primer.md @@ -0,0 +1,99 @@ +--- +title: "Job Primer" +description: Job Primer ci-tool used to generate job names for Big Query. +--- + +## Overview + +{{% alert title="⚠️ NOTE" color="warning" %}} +In `Job Primer` a job name is very important. Please make sure that the job names contain correct information. ([see options below](#naming-convention)) +{{% /alert %}} + +[JobPrimer](https://github.com/openshift/ci-tools/tree/master/pkg/jobrunaggregator/jobtableprimer) is the `ci-tool` that is used to populate the `BigQuery` `Jobs` table. The `Jobs` table is what dictates the periodic jobs are grabbed during `disruption` data gathering. Currently this tool is ran manually. + +## High Level Diagram + +{{< inlineSVG file="/static/job_primer_diagram.svg" >}} + +### How The Data Flows + +1. We first look at the `origin/release` repo to gather a list of the current release jobs that were created. The below command is ran to look through the current configuration and generate the job names. + + ```sh + ./job-run-aggregator generate-job-names > pkg/jobrunaggregator/jobtableprimer/generated_job_names.txt + ``` + +1. That `generated_jobs_names.txt` is then committed to the repo. + + **You must then rebuild the binary so the newly generated list is correctly embedded.** + +1. We then create the jobs in the BigQuery table by running the `prime-job-table` command. This will use the embedded `generated_jobs_names.txt` data and generate the `Jobs` rows based off of the naming convention (see below). After which the `Jobs` table should be updated with the latest jobs. + + ```sh + ./job-run-aggregator prime-job-table + ``` + +### Naming Convention + +Please make sure your job names follow the convention defined below. All job names must include addiqute information to allow proper data aggragetion. + +{{% pageinfo color="primary" %}} + +- Platform: + - aws, gcp, azure, etc... +- Architecture: (default: `amd64`) + - arm64, ppc64le, s390x +- Upgrade: (default: `assumes NOT upgrade`) + - upgrade +- Network: (default: `sdn && ipv4`) + - sdn, ovn + - ipv6, ipv4 +- Topology: (default: `assumes ha`) + - single +- Serial: (default: `assumes parallel`) + - serial + +{{% /pageinfo %}} + +{{% card-code header="[Code Location](https://github.com/openshift/ci-tools/blob/659fc3fed6ebe7ed7fb0bde25330fe2f47e20d0b/pkg/jobrunaggregator/jobtableprimer/job_typer.go#L13-L114)" %}} + +```go +func newJob(name string) *jobRowBuilder { + platform := "" + switch { + case strings.Contains(name, "gcp"): + platform = gcp + case strings.Contains(name, "aws"): + platform = aws + case strings.Contains(name, "azure"): + platform = azure + case strings.Contains(name, "metal"): + platform = metal + case strings.Contains(name, "vsphere"): + platform = vsphere + case strings.Contains(name, "ovirt"): + platform = ovirt + case strings.Contains(name, "openstack"): + platform = openstack + case strings.Contains(name, "libvirt"): + platform = libvirt + } + + architecture := "" + switch { + case strings.Contains(name, "arm64"): + architecture = arm64 + case strings.Contains(name, "ppc64le"): + architecture = ppc64le + case strings.Contains(name, "s390x"): + architecture = s390x + default: + architecture = amd64 + } + +... + + +``` + +{{% /card-code %}} diff --git a/layouts/partials/head-css.html b/layouts/partials/head-css.html index dedf6d07..8ea8cd19 100644 --- a/layouts/partials/head-css.html +++ b/layouts/partials/head-css.html @@ -38,4 +38,11 @@ tr.shown td.details-control { background: url('https://datatables.net/examples/resources/details_close.png') no-repeat center center; } + +.card-body > .highlight, +.card-body > .highlight pre { + overflow-x: auto; + max-width: 100%; + margin: 0; +} diff --git a/layouts/shortcodes/card-code.html b/layouts/shortcodes/card-code.html new file mode 100644 index 00000000..59d8d814 --- /dev/null +++ b/layouts/shortcodes/card-code.html @@ -0,0 +1,10 @@ +
+ {{- with $.Get "header" -}} +
+ {{- $.Get "header" | markdownify -}} +
+ {{end}} +
+ {{ $.Inner }} +
+
diff --git a/layouts/shortcodes/comment.html b/layouts/shortcodes/comment.html new file mode 100644 index 00000000..3a6ecbe7 --- /dev/null +++ b/layouts/shortcodes/comment.html @@ -0,0 +1 @@ +
{{ if .Inner }}{{ end }}
\ No newline at end of file diff --git a/layouts/shortcodes/inlineSVG.html b/layouts/shortcodes/inlineSVG.html new file mode 100644 index 00000000..62d17f93 --- /dev/null +++ b/layouts/shortcodes/inlineSVG.html @@ -0,0 +1,3 @@ +
+ {{ .Get "file" | readFile | safeHTML }} +
diff --git a/static/bigquery_download.png b/static/bigquery_download.png new file mode 100644 index 00000000..dbdb89af Binary files /dev/null and b/static/bigquery_download.png differ diff --git a/static/disruption_test_diagram.svg b/static/disruption_test_diagram.svg new file mode 100644 index 00000000..c042c047 --- /dev/null +++ b/static/disruption_test_diagram.svg @@ -0,0 +1,111 @@ + + + + + + + DPCR Cluster + + + + + + + + + Job Run + Uploader CronJobs + + + + + + Disruption Uploader + + GCP + + + + + job artifacts + + Buckets + + Big Query + + + + + + + + Fetch jUnit Results + for Jobs + + + e2e test cluster + + + + + e2e tests + + + + + + + Store + Results + + + + + + openshift-ci-data-analysis + + Job Run + + Update BigQuery + Disruption Table + + + + + + User + + + + + Query JSON Results + + GitHub + + + + + openshift/origin + query_results.json + + + + + Commit Results + + + + + Disruption Table + Jobs Table + + Fetch List of + Jobs To Gather + + + + + + \ No newline at end of file diff --git a/static/disruption_test_flow.svg b/static/disruption_test_flow.svg new file mode 100644 index 00000000..3783f3af --- /dev/null +++ b/static/disruption_test_flow.svg @@ -0,0 +1,16 @@ + + + + + + + Chaosmonkey.DoInitialize & Run AllTests (backendDisruptionTests)Execute the disruption (Upgrade)Stop Async Test & AnalyzeClose Stop ChannelbackendDisruptionTestCall backend & monitor \ No newline at end of file diff --git a/static/job_primer_diagram.svg b/static/job_primer_diagram.svg new file mode 100644 index 00000000..ade51f10 --- /dev/null +++ b/static/job_primer_diagram.svg @@ -0,0 +1,58 @@ + + + + + + + GCP + + Big Query + openshift-ci-data-analysis + + + + + Jobs Table + + + + + Job Primer + + GitHub + + + + + openshift/release + + + + + + + + + + + + Fetch List of Jobs + Generate list text file + + Parse job names + (network, topology, type) + + + + + + + Update Jobs Table + + + + + + \ No newline at end of file diff --git a/static/query_backends1.png b/static/query_backends1.png new file mode 100644 index 00000000..ce81c39e Binary files /dev/null and b/static/query_backends1.png differ