Skip to content

Latest commit

 

History

History
264 lines (142 loc) · 19.7 KB

Incident_Management_Implementation.md

File metadata and controls

264 lines (142 loc) · 19.7 KB

A Sample Tools Implementation of Incident Management Solution

Authors:

	Arundhati Bhowmick (aruna1@us.ibm.com)

	Detlef Kleinfelder (detlef.kleinfelder@de.ibm.com)

	Melody Bienfang (mbienfan@us.ibm.com)

The tools in the Incident Management solution are implemented to provide an end-to-end view of application. You may choose to use multiple tools to handle different functionality of the application. The below picture shows multiple tools that manages cloud native as well as hybrid application.

CSMO Incident Management Implementation

Reference Tools Mapping

There are various ways to build the tool chain for an Incident Management solution. For this project we utilized the following set of tools to showcase end-to-end incident management of BlueCompute application that is hybrid in nature.

Monitoring - New Relic for resource monitoring and E2E monitoring of URLs and IBM Bluemix Application Management (BAM) for synthetic monitoring of the hybrid application and components.

Event Correlation - IBM Netcool Operations Insights to fulfill the event management and correlation activities.

Notification - IBM Alert Notification for notifying first responders on call via their preferred notification mean.

Collaboration - Slack for collaborating on the incidents with various personas of the resolution process.

Dashboard - Grafana to display an overall status of the BlueCompute business service with key performance metrics, allowing to drill down into detailed pages or launch additionals details of the other tools of the toolchain like New Relic and NOI.

Ticket & Trending - IBM Control Desk to record the incident ticket and all pertaining data to enable various personas to perform ticket trending analysis.

Understanding System Context Flows for the Tools in CSMO Toolchain Connecting BlueCompute Application

Here is the view to system context of each these tools to give you deeper and broader perspective of the flow and integration.

System Context Flow for New Relic

System Context Flow New Relic

The above figure shows the deep dive of New Relic Ressource Monitoring and its various components and various integrated tools for incident management and their interactions.

  1. New Relic offers the instrumentation of BlueCompute components like Node.js applications, nginx webserver/loadbalancers, java microservices and mysql databases and allows monitoring of key performance indicators for those ressources. It detects also Bluemix services used by various Bluemix applications. Both Public and SoftLayer instances are monitored. In Bluemix we will find native clound foundry applications as well as docker containers.

  2. The data will be transferred to New Relic management system which is accessible with the UI and API calls.

  3. If thresholds are exceeded based on defined alert policy settings one or more alerting channels can be used to forward identified incidents.

  4. These channels actually forward the incident to external event correlation, notification or emailing systems. In this scenario we are using NOI for event correlation and New Relic forwards its events to a Netcool Omnibus message bus probe via a WebHook channel.

  5. The New Relic Rest API allows to query the data from external tools like Dashboarding solution. In this scenario Grafana runtime polls the New Relic data via the Rest API continously to display status and key performance metrics.

System Context Flow for IBM Netcool Operations Insights (NOI)

System Context Flow NOI

The above figure shows the deep dive of NOI and its various components and various integrated tools for incident management and their interactions. One of the key takeaways from the diagram is that the solution supports a heterogeneous mixture of products and solutions, each feeding or being fed by the central NOI solution.

The following flow describes the setup and operations of this solution in an overall cloud service management space:

  1. BlueCompute application components & infrastructure are monitored by 3rd party solution New Relic for resource monitoring and URL response <and IBM BAM for synthetic monitoring (via ANS)>.

  2. The probes normalize the events into a common format and send them to the central Omnibus system. The monitoring events sent to NOI via these probes are then correlated, de-duplicated, analyzed & enriched. Further actions may be automated or performed by a first responder/incident owner/runbook automated service.

  3. Impact has three roles:

    • It extracts events from the BlueMix infrastructure and forwards events on to the collaboration and notification solutions.
    • The analytics component of NOI enriches the events and finds correlations between events based on seasonality or relationship, limiting the number of alarms forwarded and making sure that important issues are prioritized.
    • Impact also enriches technical events with organizational and environmental context information like deployment location, service relationships or affected client. Here a MySQL configuration data source is leveraged.

    The dashboards are used both to display events and allow manually forwarding of events if an operator decides to do so.

  4. The correlated events are forwarded to collaboration and notification tools. Action may be performed to solve the issues detected. NOI supports a variety of such solutions and in this document we will look at integration with Slack for collaboration and IBM Alert Notification System for notification and escalation. NOI can publish events to generic targets (i.e. a single Slack channel which is used for all alerts) or specific ones (i.e. NOI will be automated to create a dedicated Slack channel for a single generated event ).

  5. Runbooks connected to NOI are automated to update the event status based on the resolution of the issue. The status updates can also be manually handled within NOI. It also has capability to have bi-directional communication with notification tool so that event status update can take place in either tool. This updated status is then propagated.

System Context Flow for IBM Alert Notification System

System Context Flow Grafana

The above figure shows the deep dive of ANS and its various components and various integrated tools for incident management and their interactions.

  1. An alert is raised via NOI or the POST API and sent to the Alert Notification Service (API).

  2. Alert Notification process the alerts via Notification Policies and delivers the alert as specified in the policy (Email, SMS, Slack or Voice)

  3. Alert is delivered via one or more options, email, SMS, Slack or Voice to external targets.

  4. First Responder, Development and the Incident owner use Collaboration tools for alert resolution.

System Context Flow for Grafana

System Context Flow Grafana

The above figure shows the deep dive of Grafana and its various components and various integrated tools for incident management and their interactions.

  1. The Dashboard relies on data from various data sources which are accessed via various interfaces.

    • The configuration management data is read from the database with the help of sql client tools.
    • Status and key performance metrics from New Relic APM system is collected via the Rest APIs.
    • Event information is read from the Netcool Omnibus system by means of a Rest API.
    • Status and configuration information for Bluemix applications and containers are also retrieved view the Bluemix API/CLI.

    Data can be accessed either directly from the Dashboard Rest API data provider or from a separate runtime instance.

  2. A Perl Runtime collects on a regular scheduled basis data from various data sources which provide monitoring and status information for the BlueCompute application. In this scenario this includes

    • the ressource monitoring data from New Relic via a Rest API,
    • the Bluemix Cloud Foundry information for applications and containers via the CF API,
    • the NOI status information via the Netcool Rest API and
    • the configuration data from the configuration database data source on MySQL to read and enrich the monitoring data with environment context data like deployment location and service-relationships.
  3. The perl runtime mashes up all relevant data and writes the consolidated data into the Grafana database based on InfluxDB.

  4. Grafana accesses the data via its defined data sources and displays the mashed-up data from the InfluxDB and individual New Relic data inside the configured dashboard pages. Grafana allows also the launch of external URL pages in new browser tabs as part of the use case scenarios. This includes the launch of

    • the event viewer page from NOI displaying events in context of a page item displaying the associated events via an ad-hoc filter for the selected item
    • the BlueCompute Service Map from New Relic .

5.Via the event viewer the Runbook Automation can be triggered and displayed.

System Context Flow for IBM Control Desk

System Context Flow IBM Control Desk

The above figure shows the deep dive of IBM Control Desk and its various components and various integrated tools for incident management and their interactions.

  1. The user or request fulfillment system reports an incident. Stakeholders (e.g., the Application Owner) are continuously informed about the status of the incident.
  2. The sophisticated Monitoring and Logging tools, that includes IBM or Third Party tools, connected to the managed solutions detect the issues early and send alerts to the Event Correlation tool and unified Dashboard.
  3. The Event Correlation tool is empowered to correlate events from multiple sources and helps identifying and isolating the problem by alerting the Collaboration and Notification systems. First Responder team typically considers correlated events to narrow down the issue instantly. For complex issues the Incident Owner and Subject Matter Experts collaborate on the investigation and resolution.
  4. The Notification system creates collaboration channel with alerts specific to an incident allowing Incident Owner and Subject Matter Expert to have records within the incident investigation and mitigation.
  5. The Notification system creates an incident with specific details to allow the First responder to resolve using the incident record independent or in collaboration with others in a channel.
  6. The Dashboard are preconfigured to provide one single view of various sources of events from Event Correlation and Monitoring systems to guide the First Responder and Subject Matter Experts to isolate and resolve the issues by executing Runbooks.
  7. The First Responder Team is equipped with automation and well-defined Runbooks to resolve the issue instantly. The automated process also updates the status of the event so that the dashboard, notification and collaboration channels are synchronized.

System Context Flow for IBM Runbook Automation

System Context Flow IBM Runbook Automation

The above figure shows the deep dive of IBM Runbook Automation and the Runbook Automation workflow to resolve incidents.

  1. Subject Matter Experts create runbooks to resolve various application issues/problems.
  2. 1st Responder sees an alert on the Dashboard. 1st Responder launches into RBA from the Dashboard.
  3. 1st Responder then searches RBA for a runbook that will resolve the issue/problem.
  4. 1st Responder then executes the runbook.
  5. 1st Responder then evaluates the runbook and comments on the execution of the runbook.
  6. 1st Responder acknowledges alert.
  7. Automated triggers to send acknowledgements and upon completion send notification of the success of the runbook.

How to Use the toolchain

The following walkthrough guides you through how to use the toolchain for end-to-end monitoring of the hybrid application. You will learn how to implement basic incident management capabilities and how to build a more advanced, robust incident management solution.

Step 1: Installation prerequisites

When deployed using an instant runtime, the solution for incident management requires the following items:

Step 2: Incident Management walkthrough

The cloud native Cloud Service Management and Operations incident management walkthrough is provided with the tools in the toolchain.

The following sections only focusses on updates needed to instrument or use a hybrid application and will defer to the published how-to documents for the selected tools of the toolchain.

Step 3: Monitoring

Tool option a: How to Use New Relic for BlueCompute

New Relic is a Software-as-a-Service (SaaS) offering, where agents are injected into Bluemix Runtimes, IBM Bluemix Containers or SoftLayer Containers and automatically start reporting metrics back to the New Relic service over the internet.

Please be aware that the instrumented components will need an active internet out-bound connection either directly or via various Gateway services.

For detailed steps please continue with How to setup New Relic for BlueCompute

Step 4: Event Management

Tool option a: How to use IBM Netcool Operations Insight for BlueCompute

IBM Netcool Operations Insight accelerates the operations management lifecycle from problem detection to fix. It receives event from ressource monitoring solutions, enriches, correlates and escalates events based on rule automation.

For detailed steps please continue with How to setup NOI for BlueCompute

Step 5: Notification

Tool option a: How to use IBM Alert Notification System for BlueCompute

IBM Alert Notification System is IBM Bluemix® service environment that instantly delivers notifications of problem occurrences in your Bluemix environment using automated email, Short Message Service (SMS), and voice messaging.

For detailed steps please continue with How to setup ANS for BlueCompute

Step 6: Collaboration

Tool option a: How to use Slack for BlueCompute

Slack is an instant messaging and collaboration system on steroids. Slack’s channels help you focus by enabling you to separate messages, discussions and notifications by purpose, department or topic. For incidents which occur in a BlueCompute environment, channels can be used to collobarate on the remediation of the incident with various users like the First Responder, Subject Matter experts, Incident Owner as well as with a numerous number of tools via prebuild integrations.

For detailed steps please continue with How to setup Slack for BlueCompute

Step 7: Ticketing & Trending

Tool option a: How to use IBM Control Desk for BlueCompute

IBM Control Desk unified IT asset and service management software provides a common control center for managing business processes for both digital and physical assets. It enables control, governance and compliance to applications, endpoints and assets to protect critical data and prevent outages.

For detailed steps please continue with How to setup IBM Control Desk for BlueCompute

Step 8: Dashboarding

Tool option a: How to use Grafana Dashboarding for BlueCompute

Grafana is one of the leading tools for querying and visualizing time series and metrics. In this project we used it to create dashboards for First Responder persona. Grafana features a variety of panels, including fully featured graph panels with rich visualization options. There is built in support for many of the time series data sources like InfluxDB or Graphite. We used InfluxDB - a time series database for metrics as a data source for Grafana and perl script to collect data from various APIs of BlueCompute CSMO infrastructure like New Relic, Bluemix, NOI or CMDB.

For detailed steps please continue with How to setup Grafana for BlueCompute

Step 9: Runbook

Tool option a: How to use IBM Runbook Automation for BlueCompute

IBM Runbook Automation (RBA) is an easy-to-use-service. The service can help IT Operations to simplify and automate operational issues/problems with Applications. It can help eliminate the reliance on manual efforts and remove issues dealing with varying skill sets. With the use of RBA, you can reduce delays, disruption, and risk of errors. With RBA, you create a repeatable set of instructions to resolve an issue.

For detailed steps please continue with How to setup IBM Runbook Automation for BlueCompute

Reference Product Links

IBM Netcool Operations Insights

IBM Runbook Automation

IBM Alert Notification

New Relic

Slack

Grafana