Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
20d6f11
Add procedure to disable services on OSP side (#407)
leifmadsen Dec 1, 2022
b3ad5e1
Don't remove existing 'stable-1.5' generated files (#412)
leifmadsen Dec 2, 2022
e23a1b3
Minor updates to dashboarding guide (#413)
leifmadsen Dec 7, 2022
1a6cecf
Fix syntax error in certificate renewal module (#416)
leifmadsen Dec 7, 2022
652b913
Removed sending-metrics-to-gnocchi-and-to-stf m… (#414)
JoanneOFlynn2018 Dec 7, 2022
762b386
Jof mas minor edits 1.5 (#421)
JoanneOFlynn2018 Dec 9, 2022
2e3fd4b
Update link to STF life cycle page (#423)
leifmadsen Dec 14, 2022
fe6d916
updated link (#428)
JoanneOFlynn2018 Jan 13, 2023
eb0a98d
Updated path to match PR#52 in dashboard repo (#427)
csibbitt Jan 19, 2023
f720011
Fixed alertmanager verification command (#430)
csibbitt Jan 19, 2023
251a2c2
Eliminate mentions of sensubility in OSP13 (#431)
csibbitt Jan 19, 2023
bfb73c3
A list of low hanging docs changes from our feature testing (#426)
csibbitt Jan 19, 2023
c964773
mg_master_2161659_minor-style-edit changed note text and position (#437)
mickogeary Jan 20, 2023
d252a4e
Remove note from importing dashboards procedure (#439)
leifmadsen Jan 24, 2023
674be31
Adjust network polling meter for ceilometer (#440)
leifmadsen Jan 25, 2023
53d2d36
Eliminate vestiges of "stf-default" (#442)
csibbitt Feb 3, 2023
7132367
Link to the amqp1 plugin header directly (#443)
leifmadsen Feb 6, 2023
ba383bd
Bump base image for building to Fedora 37 (#445)
leifmadsen Feb 9, 2023
665e823
mg_master_2168184_adding section with procedures for upgrade from 1.4…
mickogeary Feb 17, 2023
a9aa450
Fix alertmanager verification command
csibbitt Mar 1, 2023
7b066e1
Revert "Fix alertmanager verification command"
csibbitt Mar 1, 2023
bbd4edf
Fix alertmanager verification command (#450)
csibbitt Mar 1, 2023
6b4288f
Reference event enablement for virtual machine view (#451)
leifmadsen Mar 7, 2023
ca4a151
Expand supported OCP range through to 4.12 (#452)
leifmadsen Mar 9, 2023
f60f728
Adjust path to triple-ansible-inventory file (#454)
leifmadsen Mar 9, 2023
cc14175
Remove DCN related configuration artifacts (#455)
leifmadsen Mar 10, 2023
e85e177
Add SNMP trap configuration parameters (#449)
leifmadsen Mar 10, 2023
420f4ff
Expose ability to set certificate renewal target times (#453)
vkmc Mar 16, 2023
40bb993
[OSP13] Replacing "allovercloud" with "overcloud" in ansible command …
lnatapov Mar 16, 2023
d929d60
[OSP13] Replacing podman with docker in ansible command. (#457)
lnatapov Mar 20, 2023
9f92c95
Merge contents of master into stable-1.5
leifmadsen Mar 27, 2023
a026235
Fix improper merge conflict for build_tools
leifmadsen Mar 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions common/global/stf-attributes.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ ifeval::["{build}" == "upstream"]
:ProjectShort: STF
:MessageBus: Apache{nbsp}Qpid{nbsp}Dispatch{nbsp}Router
:SupportedOpenShiftVersion: 4.10
:NextSupportedOpenShiftVersion: 4.10
:NextSupportedOpenShiftVersion: 4.12
:CodeReadyContainersVersion: 2.6.0
endif::[]

Expand All @@ -60,5 +60,5 @@ ifeval::["{build}" == "downstream"]
:ProjectShort: STF
:MessageBus: AMQ{nbsp}Interconnect
:SupportedOpenShiftVersion: 4.10
:NextSupportedOpenShiftVersion: 4.10
:NextSupportedOpenShiftVersion: 4.12
endif::[]
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,24 @@ include::../modules/proc_creating-an-alert-route-in-alertmanager.adoc[leveloffse
include::../modules/proc_creating-an-alert-route-with-templating-in-alertmanager.adoc[leveloffset=+2]

//SNMP Traps
include::../modules/proc_configuring-snmp-traps.adoc[leveloffset=+1]
include::../modules/con_snmp-traps.adoc[leveloffset=+1]
include::../modules/proc_configuring-snmp-traps.adoc[leveloffset=+2]

//TLS Certificates duration
ifdef::include_when_13,include_when_17[]
include::../modules/con_tls-certificates-duration.adoc[leveloffset=+1]
include::../modules/proc_configuring-tls-certificates-duration.adoc[leveloffset=+2]
endif::include_when_13,include_when_17[]

//High availability
include::../modules/con_high-availability.adoc[leveloffset=+1]
include::../modules/proc_configuring-high-availability.adoc[leveloffset=+2]

//Configuring ephemeral storage
include::../modules/con_ephemeral-storage.adoc[leveloffset=+1]
ifeval::["{build}" == "upstream"]
include::../modules/proc_configuring-ephemeral-storage.adoc[leveloffset=+2]
endif::[]

//Observability strategy
include::../modules/con_observability-strategy.adoc[leveloffset=+1]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,8 @@ To collect metrics, events, or both, and to send them to the {Project} ({Project
* To plan your {OpenStackShort} installation and configuration {ProjectShort} for multiple clouds, see xref:configuring-multiple-clouds_assembly-completing-the-stf-configuration[].

* As part of an {OpenStackShort} overcloud deployment, you might need to configure additional features in your environment:

** To deploy data collection and transport to {ProjectShort} on {OpenStackShort} cloud nodes that employ routed L3 domains, such as distributed compute node (DCN) or spine-leaf, see xref:deploying-to-non-standard-network-topologies_assembly-completing-the-stf-configuration[].

// NOTE: removing this for now because it's not clear that this is necessary, and that recommendations here may actually be harmful. See RHBZ#2023902.
//** To deploy data collection and transport to {ProjectShort} on {OpenStackShort} cloud nodes that employ routed L3 domains, such as distributed compute node (DCN) or spine-leaf, see xref:deploying-to-non-standard-network-topologies_assembly-completing-the-stf-configuration[].
** To disable the data collector services, see xref:disabling-openstack-services-used-with-stf_assembly-completing-the-stf-configuration[].

ifdef::include_when_13[]
Expand All @@ -38,7 +37,8 @@ include::../modules/proc_validating-clientside-installation.adoc[leveloffset=+2]
include::../modules/proc_disabling-openstack-services-used-with-stf.adoc[leveloffset=+1]

// Gather information for deployment in non-standard network topologies in the OSP overcloud
include::../modules/proc_deploying-to-non-standard-network-topologies.adoc[leveloffset=+1]
// NOTE: removing this for now because it's not clear that this is necessary, and that recommendations here may actually be harmful. See RHBZ#2023902.
//include::../modules/proc_deploying-to-non-standard-network-topologies.adoc[leveloffset=+1]

ifdef::include_when_13[]
// If you synchronized container images to a local registry, create an environment file and include the paths to the container images
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
ifdef::include_when_13,include_when_17[]
ifdef::context[:parent-context: {context}]

[id="assembly-renewing-the-amq-interconnect-certificate_{context}"]
= Renewing the {MessageBus} certificate

Expand All @@ -18,3 +18,4 @@ include::../modules/proc_updating-the-amq-interconnect-ca-certificate.adoc[level
//reset the context
ifdef::parent-context[:context: {parent-context}]
ifndef::parent-context[:!context:]
endif::include_when_13,include_when_17[]
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Use the cloud view dashboard to view panels to monitor service resource usage, A
** For more information about {OpenStackShort} service monitoring, see xref:resource-usage-of-openstack-services_assembly-advanced-features[].

Virtual machine view dashboard::
Use the virtual machine view dashboard to view panels to monitor virtual machine infrastructure usage. Select a cloud and project from the upper left corner of the dashboard.
Use the virtual machine view dashboard to view panels to monitor virtual machine infrastructure usage. Select a cloud and project from the upper left corner of the dashboard. You must enable event storage if you want to enable the event annotations on this dashboard. For more information, see xref:creating-a-servicetelemetry-object-in-openshift_assembly-installing-the-core-components-of-stf[].

Memcached view dashboard::
Use the memcached view dashboard to view panels to monitor connections, availability, system metrics and cache performance. Select a cloud from the upper left corner of the dashboard.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,7 @@ endif::[]
apiVersion: infra.watch/v1beta1
kind: ServiceTelemetry
metadata:
name: stf-default
name: default
namespace: service-telemetry
spec:
clouds:
Expand Down
93 changes: 93 additions & 0 deletions doc-Service-Telemetry-Framework/modules/con_snmp-traps.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
[id="snmp-traps_{context}"]
= Sending alerts as SNMP traps

[role="_abstract"]
To enable SNMP traps, modify the `ServiceTelemetry` object and configure the `snmpTraps` parameters. SNMP traps are sent using version 2c.

[id="configuration-parameters-for-snmptraps_{context}"]
== Configuration parameters for snmpTraps

The `snmpTraps` parameter contains the following sub-parameters for configuring the alert receiver:

enabled:: Set the value of this sub-parameter to true to enable the SNMP trap alert receiver. The default value is false.
target:: Target address to send SNMP traps. Value is a string. Default is `192.168.24.254`.
port:: Target port to send SNMP traps. Value is an integer. Default is `162`.
community:: Target community to send SNMP traps to. Value is a string. Default is `public`.
retries:: SNMP trap retry delivery limit. Value is an integer. Default is `5`.
timeout:: SNMP trap delivery timeout defined in seconds. Value is an integer. Default is `1`.
alertOidLabel:: Label name in the alert that defines the OID value to send the SNMP trap as. Value is a string. Default is `oid`.
trapOidPrefix:: SNMP trap OID prefix for variable bindings. Value is a string. Default is `1.3.6.1.4.1.50495.15`.
trapDefaultOid:: SNMP trap OID when no alert OID label has been specified with the alert. Value is a string. Default is `1.3.6.1.4.1.50495.15.1.2.1`.
trapDefaultSeverity:: SNMP trap severity when no alert severity has been set. Value is a string. Defaults to an empty string.

Configure the `snmpTraps` parameter as part of the `alerting.alertmanager.receivers` definition in the `ServiceTelemetry` object:

[source,yaml,options="nowrap"]
----
apiVersion: infra.watch/v1beta1
kind: ServiceTelemetry
metadata:
name: default
namespace: service-telemetry
spec:
alerting:
alertmanager:
receivers:
snmpTraps:
alertOidLabel: oid
community: public
enabled: true
port: 162
retries: 5
target: 192.168.25.254
timeout: 1
trapDefaultOid: 1.3.6.1.4.1.50495.15.1.2.1
trapDefaultSeverity: ""
trapOidPrefix: 1.3.6.1.4.1.50495.15
...
----

[id="overview-of-the-mib-definition_{context}"]
== Overview of the MIB definition

Delivery of SNMP traps uses object identifier (OID) value `1.3.6.1.4.1.50495.15.1.2.1` by default. The management information base (MIB) schema is available at https://github.com/infrawatch/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt.

The OID number is comprised of the following component values:
* The value `1.3.6.1.4.1` is a global OID defined for private enterprises.
* The next identifier `50495` is a private enterprise number assigned by IANA for the Ceph organization.
* The other values are child OIDs of the parent.

15:: prometheus objects
15.1:: prometheus alerts
15.1.2:: prometheus alert traps
15.1.2.1:: prometheus alert trap default

The prometheus alert trap default is an object comprised of several other sub-objects to OID `1.3.6.1.4.1.50495.15` which is defined by the `alerting.alertmanager.receivers.snmpTraps.trapOidPrefix` parameter:

<trapOidPrefix>.1.1.1:: alert name
<trapOidPrefix>.1.1.2:: status
<trapOidPrefix>.1.1.3:: severity
<trapOidPrefix>.1.1.4:: instance
<trapOidPrefix>.1.1.5:: job
<trapOidPrefix>.1.1.6:: description
<trapOidPrefix>.1.1.7:: labels
<trapOidPrefix>.1.1.8:: timestamp
<trapOidPrefix>.1.1.9:: rawdata

The following is example output from a simple SNMP trap receiver that outputs the received trap to the console:

[source,options="nowrap"]
----
SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-SMI::enterprises.50495.15.1.2.1
SNMPv2-SMI::enterprises.50495.15.1.1.1 = STRING: "TEST ALERT FROM PROMETHEUS PLEASE ACKNOWLEDGE"
SNMPv2-SMI::enterprises.50495.15.1.1.2 = STRING: "firing"
SNMPv2-SMI::enterprises.50495.15.1.1.3 = STRING: "warning"
SNMPv2-SMI::enterprises.50495.15.1.1.4 = ""
SNMPv2-SMI::enterprises.50495.15.1.1.5 = ""
SNMPv2-SMI::enterprises.50495.15.1.1.6 = STRING: "TEST ALERT FROM "
SNMPv2-SMI::enterprises.50495.15.1.1.7 = STRING: "{\"cluster\": \"TEST\", \"container\": \"sg-core\", \"endpoint\": \"prom-https\", \"prometheus\": \"service-telemetry/default\", \"service\": \"default-cloud1-coll-meter\", \"source\": \"SG\"}"
SNMPv2-SMI::enterprises.50495.15.1.1.8 = Timeticks: (1676476389) 194 days, 0:52:43.89
SNMPv2-SMI::enterprises.50495.15.1.1.9 = STRING: "{\"status\": \"firing\", \"labels\": {\"cluster\": \"TEST\", \"container\": \"sg-core\", \"endpoint\": \"prom-https\", \"prometheus\": \"service-telemetry/default\", \"service\": \"default-cloud1-coll-meter\", \"source\": \"SG\"}, \"annotations\": {\"action\": \"TESTING PLEASE ACKNOWLEDGE, NO FURTHER ACTION REQUIRED ONLY A TEST\"}, \"startsAt\": \"2023-02-15T15:53:09.109Z\", \"endsAt\": \"0001-01-01T00:00:00Z\", \"generatorURL\": \"http://prometheus-default-0:9090/graph?g0.expr=sg_total_collectd_msg_received_count+%3E+1&g0.tab=1\", \"fingerprint\": \"feefeb77c577a02f\"}"
----


Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
[id="tls-certificates-duration_{context}"]
= Configuring the duration for the TLS certificates

[role="_abstract"]
To configure the duration of the TLS certificates that you use for the connections with
Elasticsearch and {MessageBus} in {Project} ({ProjectShort}),
modify the `ServiceTelemetry` object and configure the `certificates` parameters.

[id="configuration-parameters-for-tls-certificates-duration_{context}"]
== Configuration parameters for the TLS certificates

You can configure the duration of the certificate with the following sub-parameters of the `certificates` parameter:

endpointCertDuration:: The requested 'duration' or lifetime of the endpoint Certificate.
Minimum accepted duration is 1 hour. Value must be in units accepted by Go time.ParseDuration https://golang.org/pkg/time/#ParseDuration.
The default value is `70080h`.
caCertDuration:: The requested 'duration' or lifetime of the CA Certificate.
Minimum accepted duration is 1 hour. Value must be in units accepted by Go time.ParseDuration https://golang.org/pkg/time/#ParseDuration.
Default value is `70080h`.

NOTE:: The default duration of certificates is long, because you usually copy a subset of them in the {OpenStack} deployment when the certificates renew. For more information about the QDR CA Certificate renewal process, see xref:assembly-renewing-the-amq-interconnect-certificate_assembly[]

The `certificates` parameter for Elasticsearch is part of the `backends.events.elasticsearch` definition and is configured in the `ServiceTelemetry` object:

[source,yaml,options="nowrap"]
----
apiVersion: infra.watch/v1beta1
kind: ServiceTelemetry
metadata:
name: default
namespace: service-telemetry
spec:
...
backends:
...
events:
elasticsearch:
enabled: true
version: 7.16.1
certificates:
endpointCertDuration: 70080h
caCertDuration: 70080h
...
----

You can configure the `certificates` parameter for QDR that is part of the `transports.qdr` definition in the `ServiceTelemetry` object:

[source,yaml,options="nowrap"]
----
apiVersion: infra.watch/v1beta1
kind: ServiceTelemetry
metadata:
name: default
namespace: service-telemetry
spec:
...
transports:
...
qdr:
enabled: true
certificates:
endpointCertDuration: 70080h
caCertDuration: 70080h
...
----
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ $ oc edit stf default
apiVersion: infra.watch/v1beta1
kind: ServiceTelemetry
metadata:
name: stf-default
name: default
namespace: service-telemetry
spec:
alerting:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,13 @@ spec:
EOF
----
+
. Delete the left over objects that are managed by community operators
+
[source,bash]
----
$ for o in alertmanager/default prometheus/default elasticsearch/elasticsearch grafana/default lokistack/lokistack; do oc delete $o; done
----
+
. To verify that all workloads are operating correctly, view the pods and the status of each pod:
+
[source,bash,options="nowrap"]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,5 @@ endif::include_when_13,include_when_17[]

ifdef::include_when_16_1[]
.Additional resources
* To collect data through {MessageBus}, see https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/{vernum}/html/operational_measurements/collectd-plugins_assembly[the amqp1 plug-in].
* To collect data through {MessageBus}, see https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/{vernum}/html/operational_measurements/collectd-plugins_assembly#collectd_plugin_amqp1[the amqp1 plug-in].
endif::include_when_16_1[]
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,28 @@
[id="configuring-snmp-traps_{context}"]
= Configuring SNMP traps

[role="_abstract"]
You can integrate {Project} ({ProjectShort}) with an existing infrastructure monitoring platform that receives notifications through SNMP traps. To enable SNMP traps, modify the `ServiceTelemetry` object and configure the `snmpTraps` parameters.

For more information about configuring alerts, see xref:alerts_assembly-advanced-features[].

.Prerequisites

* Know the IP address or hostname of the SNMP trap receiver where you want to send the alerts
* Ensure that you know the IP address or hostname of the SNMP trap receiver where you want to send the alerts to.

.Procedure

. Log in to {OpenShift}.

. Change to the `service-telemetry` namespace:
+
[source,bash]
----
$ oc project service-telemetry
----

. To enable SNMP traps, modify the `ServiceTelemetry` object:
+
[source,bash]
----
$ oc edit stf default
----

. Set the `alerting.alertmanager.receivers.snmpTraps` parameters:
+
[source,yaml]
Expand All @@ -37,3 +42,55 @@ spec:
----

. Ensure that you set the value of `target` to the IP address or hostname of the SNMP trap receiver.

.Additional Information

For more information about available parameters for `snmpTraps`, see xref:configuration-parameters-for-snmptraps_assembly-advanced-features[].

[id="creating-alerts-for-snmp-traps_{context}"]
= Creating alerts for SNMP traps

You can create alerts that are configured for delivery by SNMP traps by adding labels that are parsed by the prometheus-webhook-snmp middleware to define the trap information and delivered object identifiers (OID). Adding the `oid` or `severity` labels is only required if you need to change the default values for a particular alert definition.

NOTE:: When you set the oid label, the top-level SNMP trap OID changes, but the sub-OIDs remain defined by the global `trapOidPrefix` value plus the child OID values `.1.1.1` through `.1.1.9`. For more information about the MIB definition, see xref:overview-of-the-mib-definition_{context}[].

.Procedure

. Log in to {OpenShift}.

. Change to the `service-telemetry` namespace:
+
[source,bash]
----
$ oc project service-telemetry
----

. Create a `PrometheusRule` object that contains the alert rule and an `oid` label that contains the SNMP trap OID override value:
+
[source,bash]
----
$ oc apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: null
labels:
prometheus: default
role: alert-rules
name: prometheus-alarm-rules-snmp
namespace: service-telemetry
spec:
groups:
- name: ./openstack.rules
rules:
- alert: Collectd metrics receive rate is zero
expr: rate(sg_total_collectd_msg_received_count[1m]) == 0
labels:
oid: 1.3.6.1.4.1.50495.15.1.2.1
severity: critical
EOF
----

.Additional information

For more information about configuring alerts, see xref:alerts_assembly-advanced-features[].
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ resource_registry:

parameter_defaults:
MetricsQdrConnectors:
- host: stf-default-interconnect-5671-service-telemetry.apps.infra.watch
- host: default-interconnect-5671-service-telemetry.apps.infra.watch
port: 443
role: edge
verifyHostname: false
Expand Down
Loading