# Risk Identification, Monitoring, & Analysis

## Understand the Risk Management Process

Defense is a set of strategies, management is about making decisions, and mitigation is a set of tactics chosen to implement those decisions. **Integrated information risk management** is about protecting what's important to the organization. It's about what to protect and why; risk mitigation addresses how.

The key to risk and risk management is simple: it's about making decisions in reliable ways and using the CIA triad to help you know when the decision you’re about to make is a reliable one…and when it is a blind leap into the dark. From the SSCP’s perspective, information security is necessary because it enables more decisions to be made on time and on target. Reliable decision making is as much about long-range planning as it is about incident response. This means that you can rely on the following:

* Your individual and organizational memory (the information and knowledge you think you already have, know, and understand)
* New information that you've gathered, processed, and used as inputs to this decision
* Your ability to deliberate, examine, review, think, and then to decide, free from disruption
* Your ability to communicate our decision (the "new marching orders") to those elements of your organization and systems that have to carry them out

![Risk](images/risk)

Two important questions must be asked about such failures or risk occurrences as incidents:

* First, how predictable are incidents like these? How often do the sorts of mistakes that lead to such incidents happen? When might they happen? If we can predict how often such circumstances might occur, or identify conditions that increase the likelihood of such mistakes or failures, we might gain insight into ways to prevent them. In risk management terms, this asks us to make reasonable assumptions that help us estimate the frequencies of occurrence and probabilities of occurrence for such events.

* Second, how much impact do they have on the organization, its goals and objectives, and its assets, people, or reputation? What did this cost us, in terms of money, lost business, real damages, injuries or deaths, and loss of goodwill among our customers and suppliers?

These answers suggest that if something we do, use, or depend on can fail, no matter what the cause, then we can start to look at the how of those failures—but we let those frequencies, probabilities, and possible impacts guide us to prioritize which risks we look at first, and which we can choose to look at later.

We care about risks because when they occur (when they become an incident), they disrupt our plans. Incidents disrupt us in two ways:

* They break our chain of thought. They interrupt the flow of decision making that we “normally” would be using to carry out our planned, regular, normal activities.

* They cause us to react to their occurrence. We divert time, labor, money, effort, and decision making into responding to that incident.

Every one of those decisions, large or small, is an opportunity for somebody or something to "mess with" what you had planned and what you want and need to accomplish:

* Competitors can learn what you’re planning to do.
* Customer requests can be mishandled, misrouted, or ignored, which may lead to customers taking their business elsewhere.
* Costs can be erroneously increased, and revenues can be lost.

**Decision assurance**, then, consists of protecting the availability, reliability, and integrity of the four main components of the decision process:

* The knowledge we already have (our memory and experience), including knowledge of our goals, objectives, and priorities
* New information we receive from others (the marketplace, customers, others in the organization, and so on)
* Our cognitive ability to think and reason with these two sets of information and to come to a decision
* Taking action to carry out that decision or to communicate that decision to others, who will then be responsible for taking action

One of the most powerful decision assurance tools that managers and leaders can use at almost any organizational level is to “sanity-check” the inputs, the thinking, and the proposed actions with other people before committing to a course of action. “Does this make sense?” is a question that experience suggests ought to be asked often but isn’t. For information security specialists, checking your facts, your stored knowledge, your logic, and your planning with others can take many different forms:

* Sharing or pooling risk management information with others in your marketplace, with insurers or re-insurers, or with key stakeholders
* Actively participating in threat and risk reduction communities of practice, information exchanges, and community emergency response planning groups, which might include representation from local and national government authorities
* Using “anti-groupthink” processes and techniques to prevent your decision processes from stifling new voices or contrary views
* Finding ways to be “surprise-tolerant” so that unanticipated observations about day-to-day operational events can generate possible new insight
* Building, maintaining, and using mentors, peer groups, and trusted advisory groups, both from within the organization and from outside

### Risk Visibility & Reporting (e.g., Risk Register, Sharing Threat Intelligence, Common Vulnerability Scoring System (CVSS))

Three observations are important here, so important that they are worth considering as rules in and of themselves:
* Rule 1: All things will end. Systems will fail; parts will wear out. People will get sick, quit, die, or change their minds. Information will never be complete or absolutely accurate or true.
* Rule 2: The best you can do in the face of Rule 1 is spend money, time, and effort making some things more robust and resilient at the expense of others, and thus trade off the risk of one kind of failure for another.
* Rule 3: There’s nothing you can do to avoid Rule 1 and Rule 2.

![Four Faces of Risk](images/four-faces-of-risk.png)

Risk management, then, is trading off effort and resources now to reduce the possibility of a risk occurring later, and if it does occur, in limiting the damage it can cause to us or those things, people, and objectives we hold important. The impact or loss that can happen to us when a risk goes from being a possibility to a real occurrence - when it becomes an incident - is often looked at first in terms of how it affects our organization's goals, objectives, systems, and our people. This provides four ways of looking at risk, no one of which is the one best right way. All of these perspectives have something to reveal to us about the information risks our organization may be facing.

![The Layered View](images/the-layered-view.png)

These layers of function may take physical, logical, and administrative forms throughout every human enterprise:

* **Physical systems elements** are typically things such as buildings, machinery, wiring systems, and the hardware elements of IT systems. The land surrounding the buildings, the fences and landscaping, lighting, and pavements are also some of the physical elements you need to consider as you plan for information risk management. The physical components of infrastructures, such as electric power, water, sewer, storm drains, streets and transportation, and trash removal, are also important. What’s missing from this list? People. People are of course physical (perhaps illogical?) elements that should not be left out of our risk management considerations!
*　**Administrative elements** are the policies, procedures, training, and expectations that we spell out for the humans in the organization to follow. These are typically the first level at which legal and regulatory constraints or directives become a part of the way the organization functions.
* **Logical elements** (sometimes called **technical elements**) are the software, firmware, database, or other control systems settings that you use to make the physical elements of the organization’s IT systems obey the dictates and meet the needs of the administrative ones

#### Outcomes-Based Risk
This face of risk looks at why people or organizations do what they do or set out to achieve their goals or objectives. The outcomes of achieving those goals or objectives are the tangible or intangible results we produce, the harvest we reap.

Here's a hypothetical example: Search Improvement Engineering (SIE) is a small software development company that makes and markets web search optimization aids targeted to mobile phone users. SIE’s chief of product development wants to move away from in-house computers, servers, and networks and start using cloud-based integrated development and test tools instead; this, she argues, will reduce costs, improve overall product quality and sustainability, and eliminate risks of disruption that owning (and maintaining) their own development computer systems can bring. The outcome is to improve software product quality, lower costs, and enable the company to make new products for new markets. This further supports the higher-level outcomes of organizational survival, financial health, growth, and expansion. One outcomes-based risk would be the disclosure, compromise, or loss of control over SIE’s designs, algorithms, source code, or test data to other customers operating on the cloud service provider’s systems.

#### Process-Based Risk

Everything we want to achieve or do requires us to take some action; action requires us to make a decision. Even if it’s only one action that flows from one decision, that's a process. In organizational terms, a **business process** takes a logical sequence of purpose, intention, conditions, and constraints and structures them as a set of systematic actions and decisions in order to carry them out. This **business logic**, and the business processes that implement it, also typically provide indicators or measurements that allow operators and managers to monitor the execution of the process, assess whether key steps are working correctly, signal completion of the process (and thus perhaps trigger the next process), or issue an alarm to indicate that attention and action are required. When a task (a process step) fails to function properly, this can either stop the process completely or lead to erroneous results.

#### Asset-Based Risk

Broadly speaking, an asset is anything that the organization (or the individual) has, owns, uses, or produces as part of its efforts to achieve some of its goals and objectives. Buildings, machinery, or money on deposit in a bank are examples of hard, or tangible assets. The people in your organization (including you!), the knowledge that is recorded in the business logic of your business processes, your reputation in the marketplace, the intellectual property that you own as patents or trade secrets, and every bit of information that you own or use are examples of soft, or intangible assets. Assets are the tools you use to perform the steps in your business processes; without assets, the best business logic cannot do anything.

#### Threat-Based (or Vulnerability-Based) Risk

These are two sides of the same coin, really. Threat actors (natural or human) are things that can cause damage and distruction leading to loss. Vulnerabilities are weaknesses within systems, processes, assets, and so forth that are points of potential failure. When (not if) they fail, they result in damage, disruption, and loss. Typically, threats or threat actors exploit (make use of) vulnerabilities. Threats can be natural (such as storms or earthquakes), accidental (failures of processes or systems due to unintentional actions or normal wear and tear, causing a component to fail), or deliberate actions taken by humans or instigated by humans. Such intentional attackers have purposes, goals, or objectives they seek to accomplish; Mother Nature or a careless worker does not intend to cause disruption, damage, or loss.

As an example, consider a typical small office/home office (SOHO) IT network, consisting of a modem/router, a few PCs or laptops, and maybe a network attached printer and storage system. A thunderstorm can interrupt electrical power; the lack of a backup power supply is a weakness or vulnerability that the thunderstorm unintentionally exploits. By contrast, the actions of the upstairs neighbors or passers-by who try to “borrow some bandwidth” and make use of the SOHO network’s wireless connection will most likely degrade service for authorized users, quite possibly leading to interruptions in important business or personal tasks. This is deliberate action, taken by threat actors, that succeeds perhaps by exploiting poorly configured security settings in the wireless network, whether its intention was hostile (e.g., willful disruption) or merely inconsiderate.

#### Risk Register

At this point, the organization or business needs to be building a risk register, a central repository or knowledge bank of the risks that have been identified in its business and business process systems. This register should be a living document, constantly refreshed as the company moves from risk identification through mitigation to the “new normal” of operations after instituting risk controls or countermeasures.

As an internal document, a company’s risk register is a compendium of its weaknesses and should be considered as closely held, confidential, proprietary business information. It provides a would-be attacker, competitors, or a disgruntled employee with powerful insight into ways that the company might be vulnerable to attacks. This need to protect the confidentiality of the risk register becomes even more acute as the risk register is updated from first-level outcomes or process-based identification through impact assessments, and then linked (as you’ll see in the next chapter, “Operationalizing Risk Mitigation”) with systems vulnerability or root cause/proximate cause assessments.


### Risk Management Concepts (e.g., Impact Assessments, Threat Modelling, Business Impact Analysis (BIA))

Information security best practices suggest a good minimum set of "when in doubt" actions to ensure that the organization:
* Physically protects and secures information systems, information storage (paper or electronic), and supporting infrastructure
* Controls access by all users, visitors, and guests, such as with usernames and passwords, for all computer systems
* Controls disclosure and disposal of information and information systems
* Trains all staff (or anyone with access) on these minimum security measures

This "safe computing" or **computing hygiene** standard, is a proven place for any organization to start with.

Two sets of information provide a rich source of information security requirements for an organization. The first is the legal, regulatory, and cultural context in which the organization must exist. As stated before, failure to fulfill these obligations can put the organization out of existence, and its leaders, owners, stakeholders (and even its employees) at risk of civil or criminal prosecution. The second set of information that should drive the synthesis of information security requirements is the organization’s BIA.

There are typically two major ways that information security requirements take form or are expressed or stated within an organization. The first is to write a system requirements specification (SRS), which is a formal document used to capture high-level statements of function, purpose, and intent. An SRS also contains important system-level constraints. It guides or directs analysts and developers as they design, build, test, deploy, and maintain an information; it also drives end-user training activities.

Organizations also write and implement policies and procedures that state what the information security requirements are and what the people in the organization need to do to fulfill them and comply with them:
* **Policies** are broad statements of direction and intention; in most organizations, they establish direction and provide constraints to leaders, managers, and the workforce. Policies direct or dictate what should be done, to what standards of compliance, who does it, and why they should do it. Policies are usually approved (“signed out”) by senior leadership, and are used to guide, shape, direct, and evaluate the performance of the people who are affected by the policies; they are thus considered administrative in nature.
* **Procedures** take the broad statements expressed in policies and break them down into step-by-step detailed instructions to those people who are assigned responsibility to perform them. Procedures state how a task needs to be performed and should also state what constraints or success criteria apply. As instructions to people who perform these tasks, procedures are administrative in nature.
You might ask which should come first, the SRS or the policies and procedures. Once senior leadership agrees to a statement of need, it's probably faster to publish a policy and a new procedure than it is to write the SRS, design the system, test it, deliver it, and train users on the right ways to use it. But be careful! It often takes a lot of time and effort for the people in an organization to operationalize a new policy and the procedures that come with it. Overlooking this training hurdle can cause the new policy or procedures to fail.

#### Business Impact Analysis (BIA)

The business impact analysis (BIA) is where the rubber hits the road, so to speak. Risk management must be a balance of priorities, resources, probabilities, and impacts, as you’ve seen throughout this chapter. All this comes together in the BIA. As its name implies, the BIA is a consolidated statement of how different risks could impact the prioritized goals and objectives of an organization.

The BIA reflects a combination of due care and due diligence in that it combines "how we do business" with "how we know how well we're doing it".

There is no one right, best format for a BIA; instead, each organization must determine what its BIA needs to capture and how it has to present it to achieve a mix of purposes:
* BIAs should inform, guide, and shape risk management decisions by senior leadership.
* BIAs should provide the insight to choose a balanced, prudent mix of risk mitigation tactics and techniques.
* BIAs should guide the organization in accepting residual risk to goals, objectives, processes, or assets in areas where this is appropriate.
* BIAs may be required to meet external stakeholder needs, such as for insurance, financial, regulatory, or other compliance purposes.

You must recognize one more important requirement at this point: to be effective, a BIA must be kept up to date. The BIA must reflect today's set of concerns, priorities, assets, and processes; it must reflect today's understanding of threats and vulnerabilities. Outdated information in a BIA could at best lead to wasted expenditures and efforts on risk mitigation; at worst, it could lead to failures to mitigate, prevent, or contain risks that could lead to serious damage, injury, or death, or possibly put the organization out of business completely.

At its heart, making a BIA is pretty simple: you identify what's important, estimate how often it might fail, and estimate the costs to you of those failures. You then rank those possible impacts in terms of which basis for risk best suits your organization, be that outcomes, processes, assets, or vulnerabilities. For all but the simplest and smallest of organizations, however, the amount of information that has to be gathered, analyzed, organized, assessed, and then brought together in the BIA can be overwhelming. The BIA is one of the most critical steps in the information risk management process, end to end; it's also perhaps the most iterative, the most open to reconsideration as things change, and the most in need of being kept alive, current, and useful. Most of that is well beyond the scope of the SSCP examination, and so we won’t go into the mechanics of the business impact analysis process in any further detail. As an SSCP, however, you’ll be expected to continue to grow your knowledge and skills, thus becoming a valued contributor to your organization’s BIA.

### Risk Management Frameworks (e.g., ISO, NIST)

A **risk management framework** is a set of concepts, tools, processes, and techniques that help organize information about risk. As you’ve no doubt started to see, the job of managing risks to your information is a set of many jobs, layered together. More than that, it’s a set of jobs that changes and evolves with time as the organization, its mission, and the threats it faces evolve.

Let’s start by taking a quick look at NIST Special Publication 800-37, Risk Management Framework (RMF) for Information Systems and Organizations: A System Life Cycle Approach for Security and Privacy. In its May 2018 draft updated form, this RMF establishes a broad, overarching perspective on what it calls the fundamentals of information systems risk management. Organizational leadership and management must address these areas of concern, shown conceptually in the below figure

![NIST RMF Areas of Concern](images/nist-rmf-areas-of-concern.png)

1. Organization-wide risk management
2. Information security and privacy
3. System and system elements
4. Control allocation
5. Security and privacy posture
6. Supply chain risk management

You can see that there’s an expressed top-down priority or sequence here. It makes little sense to worry about your IT supply chain (which might be a source of malware-infested hardware, software, and services) if leadership and stakeholders have not first come to consensus about risks and risk management at the broader, strategic level. (You should also note that in NIST’s eyes, the big-to-little picture goes from strategic, to operational, to tactical, which is how many in government and the military think of these levels. Business around the world, though, sees it as strategic, to tactical, to day-to-day operations.)

The RMF goes on by specifying seven major phases (which it calls steps) of activities for information risk management:
1. Prepare
2. Categorize
3. Select
4. Implement
5. Assess
6. Authorize
7. Monitor
It is tempting to think of these as step-by-step sets of activities - for example, once all risks have been categorized, you then start selecting which are the most urgent and compelling to make mitigation decisions about. Real-world experience shows, though, that each step in the process reveals things that may challenge the assumptions we just finished making, causing us to reevaluate what we thought we knew or decided in that previous step. It is perhaps more useful to think of these steps as overlapping sets of attitudes and outlooks that frame and guide how overlapping sets of people within the organization do the data gathering, inspection, analysis, problem solving, and implementation of the chosen risk controls. The figure belowed shows that there's a continual ebb and flow of information, insight, and decision between and across all elements of these "steps".

![NIST RMF Phased Approach](images/nist-rmf-phased-approach.png)

Although NIST publications are directive in nature for U.S. government systems, and indirectly provide strong guidance to the IT security market in the United States and elsewhere, many other information risk management frameworks are in widespread use around the world. For example, the International Organization for Standardization publishes ISO Standard 31000:2018, Risk Management Guidelines, in which the same concepts are arranged in slightly different fashion. First, it suggests that three main tasks must be done (and in broad terms, done in the order shown):
1. Scope, Context, Criteria
2. Risk Assessment, consisting of Risk Identification, Risk Analysis, and Risk Evaluation
3. Risk Treatment 

Three additional, broader functions support or surround these central risk mitigation tasks:
4. Recording and Reporting
5. Monitoring and Review
6. Communication and Consultation
As you can see in the belowed figure, the ISO RMF also conveys a sense that on the one hand, there is a sequence of major activities, but on the other hand, these major steps or phases are closely overlapping.

![ISO 31000:2018 Conceptual RMF](images/iso-31000-2018-conceptual-rmf.png)

#### Plan, Do, Check, Act (PDCA)

The Project Management Institute and many other organizations talk about the basic cycle of making decisions, taking steps to carry out those decisions, monitoring and assessing the outcomes, and taking further actions to correct what's not working and strengthen or improve what is.

One important idea to keep in mind is that these cycles of Plan, Do, Check, Act (PDCA) don’t just happen one time—they repeat, they chain together in branches and sequels, and they nest one inside the other, as you can see in the figure belowed. Note too that planning is a forward-looking, predictive, thoughtful, and deliberate process. We plan our next vacation before we put in for leave or make hotel and travel arrangements; we plan how to deal with a major disruption due to bad weather before the tornado season starts!

![PDCA Cycle Diagram with Subcycles](images/pdca-cycle-diagram-with-subcycles.png)

* **Planning** is the process of laying out the step-by-step path we need to take to go from “where we are” to “where we want to be.” It’s a natural human activity; we do this every moment of our lives. Our most potent tools for planning are what Kipling called his “six honest men”—asking what, why, when, how, where, and who of almost everything we are confronted with and every decision we have to make. As an SSCP, you need those six honest teammates with you at all times!
* **Doing** encompasses everything it takes to accomplish the plan. From the decisions to “execute the plan” on through all levels of action, this phase is where we see people using new or different business processes to achieve what the plan needs to accomplish, using the steps the plan asks for.
* **Checking** is part of conducting due diligence on what the plan asked us to achieve and how it asked us to get it done. We check that tasks are getting done, on time, to specification; we check that errors or exceptions are being handled correctly. And of course, we gather this feedback data and make it available for further analysis, process improvement, and leadership decision making.
* **Acting** involves making decisions and taking corrective or amplifying actions based on what the checking activities revealed. In this phase, leaders and managers may agree that a revised plan is needed, or that the existing plan is working fine but some individual processes need some fine-tuning to achieve better results.

#### Risk Assessment

Risk assessment is a systematic process of identifying risks to achieving organizational priorities.

At the heart of a risk assessment process must be the organizational goals and objectives, suitably prioritized. Typically, the highest priorities are existential ones—ones that relate to the continued existence and health of the organization. These often involve significant threats to continued operation or significant and strategic opportunities for growth. Other priorities may be vitally important in the near term, but other options may be available if the chosen favorite fails to be successful. The “merely nice to have” objectives may fall lower in the risk assessment process. This continual reevaluation of priorities allows the risk assessment team to focus on the most important, most compelling risks first.

The next major element of risk assessment is to thoroughly examine and evaluate the processes, assets, systems, information, and other elements of the organization as they relate to or support achieving these prioritized goals and objectives. This linkage of “what” and “how” with “why” helps narrow the search for system elements or process steps that, if they fail or are vulnerable to exploitation, could put these goals in jeopardy.

Most risk assessment processes typically summarize their findings in some form of BIA. This relates costs (in money, time, and resources) to the organization that could be faced if the risk events do occur. It also takes each risk and assesses how frequently it might occur. The expected cost of these risks (their costs multiplied by their frequencies and probabilities of occurrences, across the organization) represents the anticipated financial impact of that risk, over time; this is a key input to making risk mitigation or control choices.

What happens when an organization’s information is lost, compromised by disclosure to unauthorized parties, or corrupted? These questions (which reflect the CIA triad) indicate what the organization stands to lose if such a breach of information security happens. Let’s illustrate with a few examples:

* **Personally identifying information (PII)** Loss or compromise can cause customers to take their business elsewhere and can lead to criminal and civil penalties for the organization and its owners, stakeholders, leaders, and employees.
* **Company financial data, and price and cost information** Loss or compromise can lead to loss of business, to investors withdrawing their funds, or to loss of business opportunities as vendors and partners go elsewhere. Can also result in civil and criminal penalties.
* **Details about internal business processes** Loss could lead to failures of business processes to function correctly; compromise could lead to loss of competitive advantage, as others in the marketplace learn how to do your business better.
* **Risk management information** Loss or compromise could lead to insurance policies being canceled or premiums being increased, as insurers conclude that the organization cannot adequately fulfill its due diligence responsibilities.

When we view information in such terms—as “What does it cost us if we lose it?”—we decide how vital the information is to us. What this categorization or classification really does is tell us how important it is to protect that information, based on possible loss or impact. We categorize our possible losses, in terms of severity of damage, impact, or costs; we also categorize them in terms of outcomes, processes, and assets they have or depend on. Finally, we categorize them by threat or common vulnerabilities. This kind of risk analysis can help us identify critical locations, elements, or objectives that could be putting the entire organization at risk; in doing so, that focuses our risk analysis further.

Risk analysis is a complex undertaking and often involves trying to sort out what can cause a risk to become an incident. **Root cause analysis** looks to find what the underlying vulnerability or mechanism of failure is that leads to the incident, for example. By contrast, **proximate cause analysis** asks, “What was the last thing that happened that caused the risk to occur?” (This is sometimes called the “last clear opportunity to prevent” the incident, a term that insurance underwriters and their lawyers often use.) Our earlier example of backing your car out of the driveway, only to run over a child’s bicycle left in the wrong place, illustrates these ideas. You could have looked first, maybe even walked around the car before you got in and started to drive; you had the last clear opportunity to prevent damage, and thus your actions were the proximate cause. (You failed in your due diligence, in other words.) Your child, however, is the one who left the bicycle in the wrong place; the root of the problem may be the failure to help your child learn and appreciate what his responsibility of due care for his bicycle requires. And who was responsible for teaching due care to your child? (A word of advice: don’t say “My spouse.”)

We’ve looked at a number of examples of risks becoming incidents; for each, we’ve identified an outcome that describes what might happen (customers go to our competitors; we must get our car and the bicycle repaired). Outcomes are part of the basis of estimate with which we can make two kinds of **risk assessments**: quantitative and qualitative.

#### Quantitative Risk Assessment

Quantitative assessments use simple techniques (like counting possible occurrences, or estimating how often they might occur) along with estimates of the typical cost of each loss:
* Single loss expectancy (SLE): Usually measured in monetary terms, SLE is the total cost you can reasonably expect should the risk event occur. It includes immediate and delayed costs, direct and indirect costs, costs of repairs, and restoration. In some circumstances, it also includes lost opportunity costs, or lost revenues due to customers needing or choosing to go elsewhere.
* Annual rate of occurrence (ARO): ARO is an estimate of how often during a single year this event could reasonably be expected to occur.
* Annual loss expectancy (ALE): ALE is the total expected losses for a given year and is determined by multiplying the SLE by the ARO.
* Safeguard value: This is the estimated cost to implement and operate the chosen risk mitigation control. You cannot know this until you’ve chosen a risk control or countermeasure and an implementation plan for it; we’ll cover that in the next chapter.

Other numbers associated with risk assessment relate to how the business or organization deals with time when its systems, processes, and people are not available to do business. This “downtime” can often be expressed as a mean (or average) allowable downtime, or a maximum downtime. Times to repair or restore minimum functionality, and times to get everything back to normal, are also some of the numbers the SSCP will need to deal with. For example:
* The maximum acceptable outage (MAO) is the maximum time that a business process or task cannot be performed without causing intolerable disruption or damage to the business. Sometimes referred to as the maximum tolerable outage (MTO), or the maximum tolerable period of disruption (MTPOD), determining this maximum outage time starts with first identifying mission-critical outcomes. These outcomes, by definition, are vital to the ongoing success (and survival!) of the organization; thus, the processes, resources, systems, and no doubt people they require to properly function become mission-critical resources. If only one element of a mission-critical process is unavailable, and no immediate substitute or workaround is at hand, then the MAO clock starts ticking.
* The mean time to repair (MTTR), or mean time to restore, reflects our average experience in doing whatever it takes to get the failed system, component, or process repaired or replaced. The MTTR must include time to get suitable staff on scene who can diagnose the failure, identify the right repair or restoration needed, and draw from parts or replacement components on hand to effect repairs. MTTR calculations should also include time to verify that the repair has been done correctly and that the repaired system works correctly. This last requirement is very important—it does no good at all to swap out parts and say that something is fixed if you cannot assure management and users that the repaired system is now working the way it needs to in order to fulfill mission requirements.
These types of quantitative assessments help the organization understand what a risk can do when it actually happens (becomes an incident) and what it will take to get back to normal operations and clean up the mess it caused. One more important question remains: how long to repair and restore is too long? Two more “magic numbers” shed light on this question:
* The recovery time objective (RTO) is the amount of time in which system functionality or ability to perform the business process must be back in operation. Note that the RTO must be less than or equal to the MAO (if not, there’s an error in somebody’s thinking). As an objective, RTO asks systems designers, builders, maintainers, and operators to strive for a better, faster result. But be careful what you ask for; demanding too rapid an RTO can cause more harm than it deflects by driving the organization to spend far more than makes bottom-line sense.
* The recovery point objective (RPO) measures the data loss that is tolerable to the organization, typically expressed in terms of how much data needs to be loaded from backup systems in order to bring the operational system back up to where it needs to be. For example, an airline ticketing and reservations system takes every customer request as a transaction, copies the transactions into log files, and processes the transactions (which causes updates to its databases). Once that’s done, the transaction is considered completed. If the database is backed up in its entirety once a week, let’s say, then if the database crashes five days after the last backup, that backup is reloaded and then five days’ worth of transactions must be reapplied to the database to bring it up to where customers, aircrew, airport staff, and airplanes expect it to be. Careful consideration of an RPO allows the organization to balance costs of routine backups with time spent reapplying transactions to get back into business.

We’ll go into these numbers (and others) in greater depth in Chapter 10 as you learn how to help your organization plan for and manage its response to actual information security and assurance incidents. It’s important that you realize that these numbers play three critical roles in your integrated, proactive information defense efforts. All of these quantitative assessments (plus the qualitative ones as well) help you achieve the following:
* Establish the “pain points” that lead to information security requirements that can be measured, assessed, implemented, and verified.
* Shape and guide the organization’s thinking about risk mitigation control strategies, tactics, and operations, and keep this thinking within cost-effective bounds.
* Dictate key business continuity planning needs and drive the way incident response activities must be planned, managed, and performed.
One final thought about the “magic numbers” is worth considering. The organization’s leadership have their stakeholders’ personal and professional fortunes and futures in their hands. Exercising due diligence requires that management and leadership be able to show, by the numbers, that they’ve fulfilled that obligation and brought it back from the brink of irreparable harm when disaster strikes. Those stakeholders—the organization’s investors, customers, neighbors, and workers—need to trust in the leadership and management team’s ability to meet the bottom line every day. Solid, well-substantiated numbers like these help the stakeholders trust, but verify, that their team is doing their job.

#### Qualitative Risk Assessment
Qualitative assessments focus on an inherent quality, aspect, or characteristic of the risk as it relates to the outcome(s) of a risk occurrence. “Loss of business” could be losing a few customers, losing many customers, or closing the doors and going out of business entirely!

So, which assessment strategy works best? The answer is both. Some risk situations may present us with things we can count, measure, or make educated guesses about in numerical terms, but many do not. Some situations clearly identify existential threats to the organization (the occurrence of the threat puts the organization completely out of business); again, many situations are not as clear-cut. Senior leadership and organizational stakeholders find both qualitative and quantitative assessments useful and revealing.

Qualitative assessment of information is most often used as the basis of an information classification system, which labels broad categories of data to indicate the range of possible harm or impact. Most of us are familiar with such systems through their use by military and national security communities. Such simple hierarchical information classification systems often start with “Unclassified” and move up through “For Official Use Only,” “Confidential,” “Secret,” and “Top Secret” as their way of broadly outlining how severely the nation would be impacted if the information was disclosed, stolen, or otherwise compromised. Yet even these cannot stay simple for long.

Businesses, private organizations, and the military have another aspect of data categorization in common: the concept of need to know. Need to know limits who has access to read, use, or modify data based on whether their job functions require them to do so. Thus, a school’s purchasing department staff have a need to know about suppliers, prices, specific purchases, and so forth, but they do not need to know any of the PII pertaining to students, faculty, or other staff members. Need-to-know leads to compartmentalization of information approaches, which create procedural boundaries (administrative controls) around such sets of information. 

### Risk Treatment (e.g., Accept, Transfer, Mitigate, Avoid, Recast)

Four strategic choices exist when we think of how to protect prioritized assets, outcomes, or processes. These choices are at the strategic level, because just the nature of them is comparable to “life-or-death” choices for the organization. A strategic risk might force the company to choose between abandoning a market or opportunity and taking on a fundamental, gut-wrenching level of change throughout its ethics, culture, processes, or people, for example. We see such choices almost before we’ve started to think about what the alternatives might cost and what they might gain us. These strategic choices are often used in combination to achieve the desired level of assurance against risk. As an SSCP, you’ll assist your organization in making these choices across strategic, tactical, and operational levels of planning, decision making, and actions that people and the organization must take. Note that each of these choices is a verb; these are things that you do, actions you perform. This is key to understanding which ones to choose and how to use them successfully. We’ll look at each individually, and then take a closer look at how they combine and mutually reinforce each other to attain greater protective effect.

There are choices at the strategic and tactical level that seem quite similar and are often mistaken as identical. The best way to keep them separate in your mind might be as follows:

* If you’ve just completed the risk assessment and BIA, your strategic choices are about operational risk mitigation planning and which risks to deal with in other ways. This is the strategic choice (as you’ll see) of deterring, detecting, preventing, or avoiding a risk altogether. Note that prevent, deter, and detect will probably involve choices of risk mitigation controls, but you cannot make those choices until after you’ve done the architectural and vulnerability assessments.

* If you’ve already done the architectural and vulnerability assessments, as we’ll cover in Chapter 4, you’re ready to start making hard mitigation choices for the risks you’re not going to avoid altogether. These are tactical choices you’ll be making, as they will dictate how, when, and to what degree of completeness you implement operational (day-to-day), functional choices in the ways you try to control risks.

Having identified the risks and prioritized them, what next? What realistic options exist? One (more!) thing to keep in mind is that as you delve into the details of your architecture, and find, characterize, and assess its vulnerabilities against the prioritized set of risks, you will probably find some risks you thought you could and should “fix” that prove far too costly or disruptive to attempt to do so. That’s okay. Like any planning process, risk management and risk mitigation taken together are a living, breathing, dynamic set of activities. Let these assessments shed light on what you’ve already thought about, as well as what you haven’t seen before.

#### Deter

To **deter** means to discourage or dissuade someone from taking an action because of their fear or dislike of the possible consequences. Deterring an attacker means that you get them to change their mind and choose to do something else instead. Your actions and your posture convince the attacker that what they stand to gain by launching the attack will probably not be worth the costs to them in time, resources, or other damages they might suffer (especially if they are caught by law enforcement!). Your actions do this by working on the attacker’s decision cycle. Why did they pick you as a target? What do they want to achieve? How probable is it that they can complete the attack and escape without being caught? What does it cost them to prepare for and conduct the attack? If you can cast sufficient doubt into the attacker’s mind on one or more of these questions, you may erode their confidence; at some point, the attacker gives up and chooses not to go through with their contemplated or planned attack.

By its nature, deterrence is directed onto an active, willful threat actor. Try as you might, you cannot deter an accident, nor can you command the tides not to flood your datacenter. You do have, however, many different ways of getting into the attacker’s decision cycle, demotivating them, and shaping their thinking so that they go elsewhere:

* Physical assets such as buildings (which probably contain or protect other kinds of assets) may have very secure and tamper-proof doors, windows, walls, or rooflines that prevent physical forced entry. Guard dogs, human guards or security patrols, fences, landscaping, and lighting can make it obvious that an attacker has very little chance to approach the building without being detected or prevented from carrying out their attack.
* Strong passwords and other access control technologies can make it visibly difficult for an attacker to hack into your computer systems (be they local or cloud-hosted).
* Policies and procedures can be used to train your people to make them less vulnerable to social-engineering attacks.

Deterrence can be passive, active, or a combination of the two. Fences, the design of parking, access roads and landscaping, and lighting tend to be passive deterrence measures; they don’t take actions in response to the presence of an attacker, for example. Active measures give the defender the opportunity to create doubt in the attacker’s mind: Is the guard looking my way? Is anybody watching those CCTV cameras?

#### Detect

To **detect** means to notice or consciously observe that an event of interest is happening. Notice the built-in limitation here: you have to first decide what set of events to “be on the lookout for” and therefore which events you possibly need to make action decisions about in real time. While you’re driving your car down a residential street, for example, you know you have to be watching for other cars, pedestrians, kids, dogs, and others darting out from between parked cars—but you normally would “tune out” watching the skies to see if an airplane was about to try to land on the street behind you. You also need to decide what to do about false alarms, both the false positives (that alarm when an event of interest hasn’t occurred) and the false negatives (the absence of an alarm when an event is actually happening).

If you think of how many false alarms you hear every week from car alarms or residential burglar alarms in your neighborhood, you might ask why we bother to try to detect that an event of interest might possibly be happening. Fundamentally, you cannot respond to something if you do not know it is happening. Your response might be to prevent or disrupt the event, to limit or contain the damage being caused by it, or to call for help from emergency responders, law enforcement, or other response teams. You may also need to activate alternative operations plans so that your business is not severely disrupted by the event. Finally, you do need to know what actually happened so that you can decide what corrective actions (or remediation) to take—what you must do to repair what was damaged and to recover from the disruption the incident has caused.

#### Prevent
To **prevent** an attack means to stop it from happening or, if it is already underway, to halt it in its tracks, thus limiting its damage. A thunderstorm might knock out your commercial electrical power (which is an attack, even if a nondeliberate one), but the uninterruptible power supplies keep your critical systems up and running. Heavy steel fire doors and multiple dead-bolt locks resist all but very determined attempts to cut, pry, or force an entry into your building. Strong access control policies and technologies prevent unauthorized users from logging into your computer systems. Fire-resistant construction of your home’s walls and doors is designed to increase the time you and your family have to detect the fire and get out safely before the fire spreads from its source to where you’re sleeping. (We in the computer trades owe the idea of a firewall to this pre-computer-era, centuries-old idea of keeping harm on one side of a barrier from spreading through to the other.)

Preventive defense measures provide two immediate paybacks to the defender: 
they limit or contain damage to that which you are defending, and they cost the attacker time and effort to get past them. Combination locks, for example, are often rated in terms of how long it would take someone to just “play with the dial” to guess the combination or somehow sense that they’ve started to make good guesses at it. Fireproof construction standards aim to prevent the fire from burning through (or initiating a fire inside the protected space through heat transfer) for a desired amount of time.

Note that we gain these benefits whether we are dealing with a natural, nonintentional threat, an accident, or a deliberate, intentional attack.

#### Avoid

To avoid an attack means to change what you do, and how you do it, in such ways as to not be where your attacker is expecting you to be when they try to attack you. This can be a temporary change to your planned activities or a permanent change to your operations. In this way, you can reduce or eliminate the possible disruptions or damages of an attack from natural, accidental, or deliberate causes:

* Physically avoiding an attack might involve relocating part of your business or its assets to other locations, shutting down a location during times of extremely bad weather, or even closing a branch location that’s in too dangerous a market or location.

* Logically avoiding an attack can be done by using cloud service providers to eliminate your business’s dependence on a specific computer system or set of services in a particular place. At a smaller scale, you do this by making sure that the software, data, and communications systems allow your employees to get business done from any location or while traveling, without regard to where the data and software are hosted. Using a virtual private network (VPN) to mask your IP and Media Access Control (MAC) addresses is another example of using logical means to avoid the possible consequences of an attack on your IT infrastructure and information systems.

* A variety of administrative methods can be used, usually in conjunction with physical or logical ones such as those we’ve discussed. Typically they will be implemented in policies, procedural documents, and quite possibly contracts or other written agreements.

Like everything in risk management and risk mitigation, these basic elements of choice can be combined in a wide variety of ways:

* Alarms combine detection and notification to users and systems owners; by alerting the attacker that they’ve been spotted “in the act,” the sound of the alarms may motivate the attacker to stop the attack and leave the scene (which is a combination of preventing further damage while it deters and prevents continued or repeated attack).
* Strong protective systems can limit or contain damage during an attack, which prevents the attack from spreading; to the degree that these protective systems are visible to the attacker, they may also deter the attack by raising the costs to the attacker to commence or continue the attack. They may also raise the attacker’s fear of capture, arrest, or other losses and thus further deter attack.
* Most physical and logical attack avoidance methods require a solid policy and procedural framework, and they quite often require users and staff members to be familiar with them and even trained in their operational use.

This last point bears some further emphasis. Organizations will often spend substantial amounts of money, time, and effort to put physical and even logical risk management systems into use, only to then put minimal effort into properly defining the who, what, when, where, how, and why of their use, maintenance, and ongoing monitoring. The money spent on a strong, imposing fence around your property will ultimately go to waste without routinely inspecting it and keeping it maintained. (Has part of it been knocked down by frost heave or a fallen tree? Has someone cut an opening in it? You’ll never know if you don’t walk the fence line often.)

This suggests that continuous follow-through is in fact the weakest link in our information risk management and mitigation efforts. We’ll look at ways to improve on this in the remainder of this book.

## Perform Security Assessment Activities

### Participate in Security Testing

### Interpretation & Reporting of Scanning & Testing Results

### Remediation Validation

### Audit Finding Remediation

## Operate & Maintain Monitoring Systems

It’s often been said that the attackers have to get lucky only once, whereas the defenders have to be lucky every moment of every day. When it comes to advanced persistent threats (APTs), which pose potentially the most damaging attacks to our information systems, another, more operationally useful rule applies. APTs must of necessity use a robust kill chain to discover, reconnoiter, characterize, infiltrate, gain control, and further identify resources to attack within the system; make their “target kill”; and copy, exfiltrate, or destroy the data and systems of their choice, cover their tracks, and then leave. Things get worse: for most businesses, nongovernmental organizations (NGOs), and government departments and agencies, they are probably the object of interest of dozens of different, unrelated attackers, each following its own kill chain logic to achieve its own set of goals (which may or may not overlap with those of other attackers). Taken together, there may be thousands if not hundreds of thousands of APTs out there in the wild, each seeking its own dominance, power, and gain. The millions of information systems owned and operated by businesses and organizations worldwide are their hunting grounds.

The good news, however, is that as you’ve seen in previous chapters, SSCPs have some field-proven information risk management and mitigation strategies that they can help their companies or organizations adopt. These frameworks, and the specific risk mitigation controls, are tailored to the information security needs of your specific organization. With them, you can first deter, prevent, and avoid attacks. Then you can detect the ones that get past that first set of barriers, and characterize them in terms of real-time risks to your systems. You then take steps to contain the damage they’re capable of causing, and help the organization recover from the attack and get back up on its feet.

You probably will not do battle with an APT directly; you and your team won’t have the luxury (if we can call it that!) of trying to design to defeat a particular APT and thwart its attempts to seek its objectives at your expense. Instead, you’ll wage your defensive campaign one skirmish at a time. You’ll deflect or defeat one scouting party as you strengthen one perimeter; you’ll detect and block a probe from gaining entry into your systems. You’ll find where an illicit user ID has made itself part of your system, and you’ll contain it, quarantine it, and ultimately block its attempts to expand its presence inside your operations. As you continually work with your systems’ designers and maintainers, you’ll help them find ways to tighten down a barrier here or mitigate a vulnerability there. Step by step, you strengthen your information security posture.

Why should SSCPs put so much emphasis on APTs and their use of the kill chain? In virtually every major data breach in the past decade, the attack pattern was low and slow: sequences of small-scale efforts designed to not cause alarm, each of which gathered information or enabled the attacker to take control of a target system. More low and slow attacks launched from that first target against other target systems. More reconnaissance. Finally, with all command, control, and hacking capabilities in place, the attack began in earnest to exfiltrate sensitive, private, or otherwise valuable data out of the target’s systems.

Note that if any of those low and slow attack steps had been thwarted, or if any of those early reconnaissance efforts, or attempts to install command and control tools, had been detected and stopped, then the attacker might have given up and moved on to another lucrative target.

Preparation and planning are the keys to survival. 

#### Kill Chain

The name kill chain comes from military operational planning (which, after all, is the business of killing the opponent’s forces and breaking their systems). Kill chains are outcomes-based planning concepts and are geared to achieving national strategic, operational, or tactical outcomes as part of larger battle plans. These kill chains tend to be planned from the desired outcome back toward the starting set of inputs: if you want to destroy the other side’s naval fleet while at anchor at its home port, you have to figure out what kind of weapons you have or can get that can destroy such ships. Then you work out how to get those weapons to where they can damage the ships (by air drop, surface naval weapons fire, submarine, small boats, cargo trucks, or other stealthy means). And so on. You then look at each way the other side can deter, defeat, or prevent you from attacking. By this point, you probably realize that you need to know more about their naval base, its defenses, its normal patterns of activity, its supply chains, and its communications systems. With all of that information, you start to winnow down the pile of options into a few reasonably sensible ways to defeat their navy while it’s at home port, or you realize that’s beyond your capabilities and you look for some other target that might be easier to attack that can help achieve the same outcome you want to achieve by defeating their navy.

With that as a starting point, we can see that an information systems kill chain is the total set of actions, plans, tasks, and resources used by an advanced persistent threat to

1. Identify potential target information systems that suit their objectives.
2. Gain access to those targets, and establish command and control over portions of those targets’ systems.
3. Use that command and control to carry out further tasks in support of achieving their objectives.

How do APTs apply this kill chain in practice? In broad general terms, APT actors do the following:

* Survey the marketplaces for potential opportunities to achieve an outcome that supports their objectives
* Gather intelligence data about potential targets, building an initial profile on each target
* Use that intelligence to inform the way they conduct probes against selected targets, building up fingerprints of the target’s systems and potentially exploitable vulnerabilities
* Conduct initial intrusions on selected targets and their systems, gathering more technical intelligence
* Establish some form of command and control presence on the target systems
* Elevate privilege so as to enable broader, deeper search for exploitable information assets in the target’s systems and networks
* Conduct further reconnaissance to discover internetworked systems that may be worth reconnaissance or exploitation
* Begin the exploitation of the selected information assets: exfiltrate the data, disrupt or degrade the targeted information processes, and so on
* Complete the exploitation activities
* Obfuscate or destroy evidence of their activities in the target’s system
* Disconnect from the target
The more complex, pernicious APTs will use multiple target systems as proxies in their kill chains, using one target’s systems to become a platform from which they can run reconnaissance and exploitation against other targets.

#### Incident Response Framework

What is a computer security incident? Several definitions by NIST, ITIL, and the IEFT* suggest that computer security incidents are events involving a target information system in ways that
* Are unplanned
* Are disruptive
* Are hostile, malicious, or harmful in intent
* Compromise the confidentiality, integrity, availability, authenticity, or other security characteristics of the affected information systems
* Willfully violate the system owners’ policies for acceptable use, security, or access

Consider the unplanned shutdown of an email server within your systems. You’d need to do a quick investigation to rule out natural causes (such as a thunderstorm-induced power surge) and accidental causes (the maintenance technician who stumbled and pulled the power cord loose on his way to the floor). Yes, your vulnerability assessment might have discovered these and made recommendations as to how to reduce their potential for disruption. But if neither weather nor a hardware-level accident caused the shutdown, you still have a dilemma: was it a software design problem that caused the crash, or a vulnerability that was exploited by a person or persons unknown?

Or consider the challenges of differentiating phishing attacks from innocent requests for information. An individual caller to your main business phone number, seeking contact information in your IT team, might be an honest and innocent inquiry (perhaps from an SSCP looking for a job!). However, if a number of such innocent inquiries across many days have attempted to map out your entire organization’s structure, complete with individual names, phone numbers, and email addresses, you’re being scouted against!

What this leads to is that your organization needs to clearly spell out a triage process by which the IT and information security teams can recognize an event, quickly characterize it, and decide the right process to apply to it. The belowed figure illustrates such a process.

![Incident Triage and Response Process](images/incident-triage-and-response-process.png)

(ISC)2 and others define the incident response framework as a formal plan or process for managing the organization’s response to a suspected information security incident. It consists of a series of steps that start with detection and run through response, mitigation, reporting, recovery, and remediation, ending with a lessons learned and onward preparation phase. Please note that this is a conceptual flow of the steps involved; reality tells us that incidents unfold in strange and complex ways, and your incident response team needs to be prepared to cycle around these steps in different ways based on what they learn and what results they get from the actions they take.

NIST, in its special publication 800-61r2, adds an initial preparation phase to this flow and further focuses attention on the detection process by emphasizing the role of prompt analysis to support incident identification and characterization. NIST also refines the mitigation efforts by breaking them down into containment and eradication steps and the lessons learned phase into information sharing and coordination activities.

![Incident Response Process](images/incident-response-process.png)

#### Incident Response Team

Unless you’re in a very small organization, and as the SSCP you wear all of the hats of network and systems administration, security, and incident response, your organization will need to formally designate a team of people who have the “watch-standing” duty of a real-time incident response team. This team might be called a computer emergency response team (CERT). CERTs can also be known as computer incident response teams, as a cyber incident response team (both using the CIRT acronym), or as computer security incident response teams (CSIRTs). For ease of reference, let’s call ours a CSIRT for the remainder of this chapter. (Note that CERTs tend to have a broader charter, responding whether systems are put out of action by acts of nature, accidents, or hostile attackers. CERTs, too, tend to be more involved with broader disaster recovery efforts than a team focused primarily on security-related incidents.)

Your organization’s risk appetite and its specific CIANA needs should determine whether this CSIRT provides around-the-clock, on-site support, or supports on a rapid-response, on-call basis after business hours. These needs will also help determine whether the incident response team is a separate and distinct group of people or is a part of preexisting groups in your IT, systems, or networks departments. In Chapter 5, “Communications and Network Security,” for example, we looked at segregating the day-to-day network operations jobs of the network operations center (NOC) from the time-critical security and incident response tasks of a security operations center (SOC).

Whether your organization calls them a CSIRT or an SOC, or they’re just a subset of the IT department’s staff, there are a number of key functions that this incident response team should perform. We’ll look at them in more detail in subsequent sections, but by way of introduction, they are as follows:

**Serve as a single point of contact for incident response**. Having a single point of contact between the incident and the organization makes incident command, control, and communication much more effective. This should include the following:

* Focus reporting and rumor control with users and managers regarding suspicious events, systems anomalies, or other security concerns.
* Coordinate responses, and dispatch or call in additional resources as needed.
* Escalate computer security incident reports to senior managers and leadership.
* Coordinate with other security teams (such as physical security), and with local police, fire, and rescue departments as required.

**Take control of the incident and the scene**. Taking control of the incident, as an event that’s taking place in real time, is vital. Without somebody taking immediate control of the incident, and where it’s taking place, you risk bad decisions placing people, property, information, or the business at greater risk of harm or loss than they already are. Taking control of the incident scene protects information about the incident, where it happened, and how it happened. This preserves physical and digital evidence that may be critical to determining how the incident began, how it progressed, and what happened as it spread. This information is vital to both problem analysis and recovery efforts and legal investigations of fault, liability, or unlawful activity.
* Response procedures should specify the chain of command relationships, and designate who (by position, title, or name) is the “on-scene commander,” so to speak. Incident situations can be stressful, and often you’re dealing with incomplete information. Even the simplest of decisions needs to be clearly made and communicated to those who need to carry it out; committees usually cannot do this very well in real time.
* The scene itself, and the systems, information, and even the rooms or buildings themselves, represent investments that the organization has made. Due care requires that the incident response team minimize further damage to the organization’s property or the property of others that may be involved in the incident scene.

**Investigate, analyze, and assess the incident**. This is where all of your skills as a troubleshooter, an investigator, or just being good at making informed guesses start to pay off. Gather data; ask questions; dig for information.

**Escalate, report, and engage with leadership**. Once they’ve determined that a security-related incident might in fact be happening, the team needs to promptly escalate this to senior leadership and management. This may involve a judgment call on the response team chief’s part, as preplanned incident checklists and procedures cannot anticipate everything that might go wrong. Experience dictates that it’s best to err on the side of caution, and report or escalate to higher management and leadership.

**Keep a running incident response log**. The incident response team should keep accurate logs of what happened, what decisions got made (and by whom), and what actions were taken. Logging should also build a time-ordered catalog of event artifacts—files, other outputs, or physical changes to systems, for example. This time history of the event, as it unfolds, is also vital to understanding the event, and mitigating or taking remedial action to prevent its reoccurrence. Logs and the catalogs of artifacts that go with them are an important part of establishing the chain of custody of evidence (digital or other) in support of any subsequent forensics investigation.

**Coordinate with external parties**. External parties can include systems vendors and maintainers, service bureaus or cloud-hosting service providers, outside organizations that have shared access to information systems (such as extranets or federated access privileges), and others whose own information and information systems may be put at risk by this incident as it unfolds. By acting as the organization’s focal point for coordination with external parties, the team can keep those partners properly informed, reduce risk to their systems and information, and make better use of technical, security, and other support those parties may be able to provide.

**Contain the incident**. Prevent it from infecting, disrupting, or gaining access to any other elements of your systems or networks, as well as preventing it from using your systems as launchpads to attack other external systems.

**Eradicate the incident**. Remove, quarantine, or otherwise eliminate all elements of the attack from your systems.

**Recover from the incident**. Restore systems to their pre-attack state by resetting and reloading network systems, routers, servers, and so forth as required. Finally, inform management that the systems should be back up and ready for operational use by end users.

**Document what you’ve learned**. Capture everything possible regarding systems deficiencies, vulnerabilities, or procedural errors that contributed to the incident taking place for subsequent mitigation or remediation. Review your incident response procedures for what worked and what didn’t, and update accordingly.

No matter how your organization breaks up the incident response management process into a series of steps, or how they are assigned to different individuals or teams within the organization, the incident response team must keep three basic priorities firmly in mind.

The first one is easy: the safety of people comes first. Nothing you are going to try to accomplish is more important than protecting people from injury or death. It does not matter whether those people are your coworkers on the incident response team, or other staff members at the site of the incident, or even people who might have been responsible for causing the incident, your first priority is preventing harm from coming to any of them—yourself included! Your organization should have standing policies and procedures that dictate how calls for assistance to local fire, police, or emergency medical services should be made; these should be part of your incident response procedures.

The next two priority choices, when taken together, are actually one of the most difficult decisions facing an organization, especially when it’s in the midst of a computer security incident: should it prioritize getting back into normal business operations or supporting a digital forensics investigation that may establish responsibility, guilt, or liability for the incident and resultant loss and damages. This is not a decision that the on-scene response team leader makes! Simply put, the longer it takes to secure the scene, and gather and protect evidence (such as memory dumps, systems images, disk images, log files, etc.), the longer it takes to restore systems to their normal business configurations and get users back to doing productive work. This is not a binary, either-or decision—it is something that the incident response team and senior leaders need to keep a constant watch over throughout all phases of incident response.

Increasingly, we see that government regulators, civic watchdog groups, shareholders, and the courts are becoming impatient with senior management teams that fail in their due diligence. This impatience is translating into legal and market action that can and will bring self-inflicted damage—negligence, in other words—home to roost where it belongs, and the reasonable fear of that should lead to tasking all members of the IT organization, including their information security specialists, with developing greater proficiency at being able to protect and preserve the digital evidence related to an incident, while getting the systems and business processes promptly restored to normal operations.

The details of how to preserve an incident scene for a possible digital forensics investigation, and how such investigations are conducted, is beyond the scope of the SSCP exam and this book. They are, however, great avenues for you to journey along as you continue to grow in your chosen profession as a white hat!

#### Preparation 

Preparation Planning

Let’s break this preparation task down into more manageable steps, using the Plan-Do-Check-Act (PDCA) model we used in earlier chapters, as part of risk management and mitigation. It may seem redundant to plan for a plan, but it’s not—you have to start somewhere, after all. Note that the boundaries between planning, doing, checking, and acting are not hard and fast; you’ll no doubt find that some steps can and should be taken almost immediately, while others need a more deliberative approach. Every step of the way, keep senior management and leadership engaged and involved. This is their emergency response capability you’re planning and building, after all.

This first set of tasks focuses on gathering what the organization already knows about its information systems and IT infrastructures, its business processes and its people, which become the foundation on which you can build the procedures, resources, and training that your incident responders will need. As you build those procedures and training plans, you’ll also need to build out the support relationships you’ll need when that first incident (or the next incident) happens.

**Build, maintain, and use a knowledge base of critical systems support information**. You’ll need this information to identify and properly scope the CSIRT’s monitoring and detection job, as well as identify the internal systems support teams, critical users, and recovery and restoration processes that already exist. As a living library, the CSIRT should have these information products available to them as reference and guidance materials. These include but are not limited to
* Information architecture documentation, plans, and support information
* IT systems documentation, such as servers, endpoints, special-purpose systems, etc.
* IT security systems documentation, including connectivity, current settings, alarm indications, and system documentation
* Clean, trusted backup images of systems and critical files, including digitally signed copies or cryptographic hashes, from which trusted restoration can take place
* Networks and other communications systems design, installation, and support, including data plane, control plane, and management plane views
* Platform and service systems documentation
* Physical layout drawings, showing equipment location, points of presence, alarm systems, entrances, and exits
* Power supply information, including commercial and backup sources, switching, power conditioning, etc.
* Current status of known vulnerabilities on all systems, connections, and endpoints
* Current status of systems, applications, platforms and database backups, age of last backup, and physical location of backup images
* Contact information or directory of key staff members, managers, and support personnel, both in-house and for any service providers, systems vendors, or federated access partners

Whether you put this information into a separate knowledge base for your incident responders, or it is part of your overall software, systems, and IT knowledge base, is perhaps a question of scale and of survivability. During an incident itself, you need this knowledge base reliably available to your responders, without having to worry if it’s been tainted by this incident or a prior but undetected one.

Use that list to identify the set of business process, systems architecture, and technology-focused critical knowledge that each CSIRT team member must be proficient in, and add this to your team training and requalification planning set.

**Assemble critical data collection, collation, and analysis tools**. Characterizing an event in real time, and quickly determining its nature and the urgency of the response it demands, requires that your incident response team be able to analyze and assess what all of the information from your systems is trying to tell them. You do not help the team get this done by letting the team find the tools they need right when they’re trying to deal with an ongoing incident. Instead, identify a broad set of systems and event information analysis tools, and bring them together in what we might call a responder’s workbench. This workbench can provide your response team with a set of known, clean systems to use as they capture data, analyze it, and draw conclusions about the event in question. Some of the current generation of security information and event management systems may provide good starting points for growing your own workbench. Other tools may need to be developed in house, tailored to the nature of critical business processes or information flows, for example.

**Establish minimum standards for event logging**. Virtually all of your devices, be they servers, endpoints, or connectivity systems, have the capability to capture event information at the hardware, systems software, and applications levels. These logs can quickly narrow down your hunt for the broken or infected system, or the unauthorized subject(s) and the objects they’ve accessed. You’ll also need to establish a comprehensive and uniform policy about log file retention if you’re hoping to correlate logs from different systems and devices with each other in any meaningful way. Higher-priority, mission-critical systems should have higher levels of logging, capturing more events and at greater time granularity, to better empower your response capability regarding these systems.

**Identify forensics requirements, capabilities, and relationships**. Although many information security incidents may come and go without generating legal repercussions, you need to take steps now to prepare for those incidents that will. You’ll need to put in place the minimum required capabilities to establish and maintain a chain of custody for evidence. This may surface the need for additional training for CSIRT team members and managers. Use this as the opportunity to understand the support relationships your team will need when (not if) such an incident occurs, and start thinking through how you’d select the certified forensics examiners you’d need when it does.
By the end of this preparation planning phase, you should have some concrete ideas about what you’ll need for the CSIRT:

* SOC, or NOC? Does your organization need a security operations center, with its crew of watch-standers? Or can the CSIRT be an on-call team of responders drawn from the IT department’s networks, systems, and applications support specialists? In either case, how many people will be needed for ongoing alert and monitoring during normal business hours, for round-the-clock watch-standing, and for emergency response?
* Physical work space, responder’s workbenches, and communication needs must also be identified at this point. These will no doubt need to be budgeted for, and their acquisition, installation, and ongoing support needs to fit into your overall incident response budget and schedule.
* Reporting, escalation, and incident management chain of command procedures should be put together in draft form at this point; coordinate with management and leadership to gain their endorsement and commitment to these.

Put the Preparation Plan in Motion

This is where the doing of our PDCA gets going in earnest. Some of the actions you’ll take are strictly internal and technical; some relate to improvements in administrative controls:

**Synchronize all system clocks**. Many service handshakes can allow up to 5 minutes or more misalignment of clocks across all elements participating in the service, but this can play havoc with attempts to correlate event logs.

**Frequently profile your systems**. System profiles help you understand the “normal” types, patterns, and amounts of traffic and load on the systems, as well as capturing key security and performance settings. Whether you use automated change-detection tools or manual inspection, comparing a current profile to a previous one may surface an indicator of an event of interest in progress or shed light on your search to find it, fix it, and remove it.

**Establish channels for outside parties to report information security incidents to you**. Whether these are other organizations you do routine business with or complete strangers, you make it much easier on your shared community of information security professionals when you set up an email form or phone number for anyone to report such problems to you. And it should go without saying that somebody in your response team needs to be paying attention to that email inbox, or the phone messages, or the forms-generated trouble tickets that flow from such a “contact us” page!

**Establish external incident response support relationships**. Many of the organizations you work with routinely—your cloud-hosting providers, other third-party services, your systems and software vendors and maintainers, even and especially your ISP—can be valuable teammates when you’re in the midst of an incident response. Gather them up into a community of practice before the lightning strikes. Get to know each other, and understand the normal limits of what you can call upon each other for in the way of support. Clearly identify what you have to warn them about as you’re working through a real-time incident response yourself.

**Develop and document CSIRT response procedures**. These will, of course, be living documents; as your team learns with each incident they respond to, they’ll need to update these procedures as they discover what they were well-prepared and equipped to deal with effectively, and what caught them by surprise. Checklist-oriented procedures can be very powerful, especially if they’re suitable for deployment to CSIRT team members’ smartphones or phablets. Don’t forget the value of a paper backup copy, along with emergency lighting and flashlights with fresh batteries, for when the lights go out!

**Initiate CSIRT personnel training and certification as required**. Take the minimum proficiency sets of knowledge, skills, and abilities (often called KSAs in human resources management terms), review the personnel assigned to the CSIRT and your recall rosters, and identify the gaps. Focus training, whether informal on-the-job or formal coursework, that each person needs, and get that training organized, planned, scheduled, and accomplished. Keep CSIRT proficiency qualification files for each team member, note the completion of training activities, and be able to inform management regarding this aspect of your readiness for incident response. (Your organization’s HR team may be able to help you with these tasks, and with organizing the training recordkeeping.)

Are You Prepared?

Maybe your preparation achieves a “ready to respond” state incrementally; maybe you’re just not ready for an incident at all, until you’ve achieved a certain minimum set of verified, in-place knowledge, tools, people, and procedures. Your organization’s mission, goals, objectives, and risk posture will shape whether you can get incrementally ready or have to achieve an identifiable readiness posture. Regardless, there are several things you and the CSIRT should do to determine whether they are ready or not:

**Understand your “business normal” as seen by your IT systems**. Establish a routine pattern or rhythm for your incident response team members to steep themselves in the day-to-day normal of the business and how people in the business use the IT infrastructure to create value in that normal way. Stay current with internal and external events that you’d reasonably expect would change that normal—the weather-related shutdown of a branch office, or a temporary addition of new federation partners into your extranets. The more each team member knows about how “normal” is reflected in fine-grained system activity, the greater the chance that those team members will sniff out trouble before it starts to cause problems. They’ll also be better informed and thus more capable of restoring systems to a useful normal state as a result.

While you’re at it, don’t forget to translate that business normal into fine-tuning of your automated and semiautomated security tools, such as your security incident event management systems (SIEMs), intrusion detection systems (IDS), intrusion prevention systems (IPS), or other tools that drive your alerting and monitoring channels. Business normal may also be reflected in the control and filter settings for access control and identity management systems, as well as for firewall settings and their access control lists. This is especially important if your organization’s business activities have seasonal variations.

**Routinely demonstrate and test backup and restore capabilities**. You do not want to be in the middle of an incident response only to find out that you’ve been taking backup images or files all wrong and that none of them can be reloaded or work right when they are loaded.

**Exercise your alert/recall, notification, escalation, and reporting processes**. At the cost of a few extra phone calls and a bit of time from key leaders and managers, you gain confidence in two critical aspects of your incident response management process. For starters, you demonstrate that the phone tree or the recall and alert processes work; this builds confidence that they’ll work when you really need them to. A second, add-on bonus is that you get to “table-top” or exercise the protocols you’d want to use had this been an actual information systems security incident.

**Document your incident response procedures, and use these documents as part of training and readiness**. Do not trust human memory or the memory of a well-intended and otherwise effective committee or team! Take the time to write up each major procedure in your incident response management process. Make it an active, living part of the knowledge base your responders will need. Exercise these procedures. Train with them, both as initial training for IT and incident response team members, line, and senior managers, and your general user base as applicable.

Taken all at once, that looks like a lot of preparation! Yet much of what’s needed by your incident response team, if they’re going to be well prepared, comes right from the architectural assessments, your vulnerability assessments, and your risk mitigation implementation activities. Other key information comes from your overall approach to managing and maintaining configuration control over your information systems and your IT infrastructure. And you should already be carrying out good “IT hygiene” and safety and security measures, such as clock synchronization, event logging, testing, and so forth. The new effort is in creating the team, defining its tasks, writing them up in procedural form, and then using those procedures as an active part of your ongoing training, readiness, and operational evaluation of your overall information security posture.

#### Detection & Analysis

On a typical day, a typical medium-sized organization might see millions of IP packets knocking on its point of presence, most of them in response to legitimate traffic generated inside the organization, solicited by its Web presence, or generated by its external partners, customers, prospective customers, and vendors. Internally, the traffic volume on the company’s internetworks and the event loads on servers that support end users at their endpoints could be of comparable volume. Detecting that something is not quite right, and that that something might be part of an attack, is as much art as it is science. Three different factors combine to make this art-and-science difficult and challenging:

* Multiple, different means of detection: Many different technologies are in use to flag circumstances that might be a security-related incident in the making. Quite often, different technologies measure, assess, characterize, and report their observations at different levels of granularity and accuracy. Sometimes, technologies cannot detect a potential incident, and a human end user or administrator is the first to suspect something’s not quite right. Often, however, the first signs of an incident in progress go undetected.
* Incredibly high volumes of events that might be incidents: Inline intrusion detection systems might detect and report a million or more events per day as possible intrusion-related events. Filtering approaches, even with machine learning capabilities, can reduce this, while introducing both false positive and false negative alarms into the response team’s workload.
* Deep, specialist knowledge, along with considerable experience is required for a response team member to be able to make sense of the noise and find the signal (the real events worth investigating) in all of it.

First, let’s define some important terms related to incident detection. Earlier we talked about events of interest—that is, some kind of occurrence or activity that takes place that just might be worth paying closer attention to. Without getting too philosophical about it, events make something in our systems change state. The user, with hand on mouse, does not cause an event to take place until they do something with the mouse, and it signals the system it’s attached to. That movement, click, or thumbwheel roll causes a series of changes in the system. Those changes are events. Whether they are interesting ones, or not, from a security perspective, is the question!

A **precursor** is a sign, signal, or observable characteristic of the occurrence of an event that in and of itself is not an attack but that might indicate that an attack could happen in the future. Let’s look at a few common examples to illustrate this concept:
* Server or other logs that indicate a vulnerability scanner has been being used against a system
* An announcement of a newly found vulnerability by a systems or applications vendor, information security service, or reputable vulnerabilities and exploits reporting service that might relate to your systems or platforms
* Media coverage of events that put your organization’s reputation at risk (deservedly or not)
* Email, phone calls, or postal mail threatening attack on your organization, your systems, your staff, or those doing business with you
* Increasingly hostile or angry content in social media postings regarding customer service failures by your company
* Anonymous complaints in employee-facing suggestion boxes, ombudsman communications channels, or even graffiti in the restrooms or lounge areas

Genuine precursors—ones that give you actionable intelligence—are quite rare. They are often akin to the “travel security advisory codes” used by many national governments. They rarely provide enough insight that something specific is about to take place. The best you can do when you see such potential precursors is to pay closer attention to your indicators and warnings systems, perhaps by opening up the filters a bit more. You might also consider altering your security posture in ways that might increase protection for critical systems, perhaps at the cost of reduced throughput due to additional access control processing.

An indicator is a sign, signal, or observable characteristic of the occurrence of an event indicating that an information security incident may have occurred or may be occurring right now. Again, a few very common examples will illustrate:
* Network intrusion detectors generate an alert when input buffer overflows might indicate attempts to inject SQL or other script commands into a webpage or database server.
* Antivirus software detects that a device, such as an endpoint or removable media, has a suspected infection on it.
* Systems administrators, or automated search tools, notice filenames containing unusual or unprintable characters.
* Access control systems notice a device attempting to connect, which does not have required software or malware definition updates applied to it.
* A host or an endpoint device does an unplanned restart.
* A new or unmanaged host or endpoint attempts to join the network.
* A host or an endpoint device notices a change to a configuration-controlled element in its baseline configuration.
* An applications platform logs multiple failed login attempts, seemingly from an unfamiliar system or IP address.
* Email systems and administrators notice an increase in the number of bounced, refused, or quarantined emails with suspicious content or ones with unknown addressees.
* Unusual deviations in network traffic flows or systems loading are observed.

One type of indicator worth special attention is called an indicator of compromise (IOC), which is an observable artifact that with high confidence signals that an information system has been compromised or is in the process of being compromised. Such artifacts might include recognizable malware signatures, attempts to access IP addresses or URLs known or suspected to be of hostile or compromising intent, or domain names associated with known or suspected botnet control servers. The information security community is working to standardize the format and structure of IOC information to aid in rapid dissemination and automated use by security systems.

In one respect, the fact that detection is a war of numbers is both a blessing and a curse; in many cases, even the first few low and slow steps in an attack may create dozens or hundreds of indicators, each of which may, if you’re lucky, contain information that correlates them all into a suspicious pattern. Of course, you’re probably dealing with millions of events to correlate, assess, screen, filter, and dig through to find those few needles in that field of haystacks.

Initial Detection

Initial incident detection is the iterative process by which human members of the incident response team assemble, collate, and analyze any number of indicators (and precursors, if available and applicable), usually with a SIEM tool or data aggregator of some sort, and then come to the conclusion that there is most likely an information security event in progress or one that has recently occurred. This is a human-centric, analytical, thoughtful process; it requires team members to make educated guesses (that is, generate hypotheses), test those hypotheses against the indicators and other systems event information, and then reasonably conclude that the alarm ought to be sounded.

That alarm might be best phrased to say that a “probable information security incident” has been detected, along with reporting when it is believed to have first started to occur and whether it is still ongoing.

Ongoing analysis will gather more data, from more systems; run tests, possibly including internal profiling of systems suspected to have been affected or accessed by the attack (if attack it was); and continue to refine its characterization or classification of the incident. At some point, the response team should consult predefined priority lists that help them allocate people and systems resources to continuing this analysis.

Note the dilemma here: paying too much attention, too soon, to too many alarms may distract attention, divert resources, and even build in a “Chicken Little” kind of reaction within management and leadership circles. When a security incident actually does occur, everyone may be just too desensitized to care about it. And of course, if you’ve got your thresholds set too high, you ignore the alarms that your investments in intrusion detection and security systems are trying to bring to your attention. Many of the headline-grabbing data breach incidents in the past 10 years, such as the attack that struck Target stores in 2013, suffered from having this balance between the costs of dealing with too many false rejections (or Type 1 errors) and the risk of missing a few more dangerous false acceptances (or Type 2 errors) set wrong.

Timeline Analysis

This may seem obvious, but one of the most powerful analytical tools is often overlooked. Timeline analysis reconstructs the sequence of events in order to focus analysis, raise questions, generate insight, and aid in organizing information discovered during the response to the incident. Responders should start building their own reconstructed event timeline or sequence of events, starting from well before the last known good system state, through any precursor or indicator events, and up to and including each new event that occurs. The timeline is different than the response team’s log—the log chronicles actions and decisions taken by the response team, directions they’ve received from management, and key coordination the team has had with external parties.

Some IDS, IPS, or SIEM product systems may contain timeline analysis tools that your teams can use. Digital forensic workbenches usually have excellent timeline analysis capabilities. Even a simple spreadsheet file can be used to record the sequence of events as it reveals itself to the responders, and as they deduce or infer other events that might have happened.

This last is a powerful component of timeline analysis. Timeline analysis should focus you on asking, “How did event A cause event B?” Just asking the question may lead you to infer some other event that event A actually caused, with this heretofore undiscovered event being the actual or proximate cause of event B. Making these educated guesses, and making note of them in your timeline analysis, is a critical part of trying to figure out what happened.

And without figuring out what happened, your search for all of the elements that might have caused the incident to occur in the first place will be limited to lucky guesswork.

Notification

Now that the incident response team has determined that an incident probably already occurred or is ongoing, the team must notify managers and leaders in the organization. Each organization should specify how this notification is to be done and who the team contacts to deliver the bad news. In some organizations, this may direct that some types of incidents need immediate notification to all users on the affected systems; other circumstances may dictate that only key departmental or functional managers be advised. In any event, these notification procedures should specify how and when to inform senior leadership and management. (It’s a sign of inadequate planning and preparation if the incident responders have to ask, “Who should we call?” in the heat of battle.)

Notification also includes getting local authorities, such as fire or rescue services, or law enforcement agencies, involved in the real-time response to the incident. This should always be coordinated with senior leadership and management, even if the team phones them immediately after following the company’s process for calling the fire department.

Senior leadership and management may also have notification and reporting responsibilities of their own, which may include very short time frames in which notification must be given to regulatory authorities, or even the public. The incident response team should not have to do this kind of reporting, but it does owe its own leadership and management the information they will need to meet these obligations.

As incident containment, eradication, and recovery continue, the CSIRT will have continuing notification responsibilities. Management may ask for their assistance or direct them to reach out directly via webpage updates, updated voice prompt menus on the IT Help Desk contact line, emails, or phone calls to various internal and external stakeholders. Separate voice contact lines may also need to be used to help coordinate activities and keep everyone informed.

#### Prioritization
There are several ways to prioritize the team’s efforts in responding to an incident. These consider the potential for impact to the organization and its business objectives; whether confidentiality, integrity, or availability of information resources will be impacted; and just how possible it will be to recover from the incident should it continue. Let’s take a closer look at these:

* Functional impact looks to the nature of the business processes, objectives, or outcomes that are put at risk by the incident. At one end of this spectrum are the mission-critical systems, the failure of which puts the very survival of the organization at risk. At the other end might be routine but necessary business processes, for which there are readily available alternatives or where the impact is otherwise tolerable. A hospital, for example, might consider systems that directly engage with real-time patient care—instrumentation control, laboratory and pharmacy, and surgical robots—as mission-critical (since losing a patient, terminally, because of an IT systems failure can severely jeopardize the hospital’s ongoing existence!). On the other hand, the same hospital could consider post-release patient follow-up care management to be less urgent (no one will die today if this system fails to work today).
* Information impact considers whether the incident risks unauthorized disclosure, exfiltration, corruption, deletion, or other unauthorized changes to information assets, and the relative strategic, tactical, or operational value or sensitivity of that information asset to the organization. The annual holiday party plans, if compromised or deleted, probably have a very low impact to the organization; exfiltration of business proposals being developed with a strategic partner, on the other hand, could have significant impact to both organizations.
* Recoverability involves whether the impact of the incident is eliminated or significantly reduced if the incident is promptly and thoroughly contained. A data exfiltration attack that is detected and contained before copies of sensitive data have left the facility is a recoverable incident; after the copies of PII, customer credit card, or other sensitive data has left, it is not.
Taken together, these factors help the incident response team advise senior leadership and management on how to deal with the incident. It’s worth stressing, again, that senior leadership and management need to make this prioritization decision; the SSCPs on the incident response team must advise their leaders by means of the best, most complete, and most current assessment of the incident and its impacts that they can develop. That advice also should address options for containment and eradication of the incident and its effects on the organization.

#### Containment and Eradication
These two goals are the next major task areas that the CSIRT needs to take on and accomplish. As you can imagine, the nature of the specific incident or attack in question all but defines the containment and eradication tactics, techniques, and procedures you’ll need to bring to bear to keep the mess from spreading and to clean up the mess itself.

More formally, containment is the process of identifying the affected or infected systems elements, whether hardware, software, communications systems, or data, and isolating them from the rest of your systems to prevent the disruption-causing agent and the disruption it is causing from affecting the rest of your systems or other systems external to your own. Pay careful attention to the need to not only isolate the causal agent, be that malware or an unauthorized user ID with superuser privileges, but also keep the damage from spreading to other systems. As an example, consider a denial of service (DoS) attack that’s started on your systems at one local branch office and its subnets and is using malware payloads to spread itself throughout your systems. You may be able to filter any outbound traffic from that system to keep the malware itself from spreading, but until you’ve thoroughly cleansed all hosts within that local set of subnets, each of them could be suborned into launching DoS attacks on other hosts inside your system or out on the Internet.

Some typical containment tactics might include:
* Logically or physically disconnecting systems from the network or network segments from the rest of the infrastructure
* Disconnecting key servers (logically or physically), such as domain name system (DNS), dynamic host configuration protocol (DHCP), or access control systems
* Disconnecting your internal networks from your ISP at all points of presence
* Disabling Wi-Fi or other wireless and remote login and access
* Disabling outgoing and incoming connections to known services, applications, platforms, sites, or services
* Disabling outgoing and incoming connections to all external services, services, applications, platforms, sites, or services
* Disconnecting from any extranets or VPNs
* Disconnecting some or all external partners and user domains from any federated access to your systems
* Disabling internal users, processes, or applications, either in functional or logical groups or by physical or network locations

A familiar term should come to mind as you read this list: quarantine. In general, that’s what containment is all about. Suspect elements of your system are quarantined off from the rest of the system, which certainly can prevent damage from spreading. It also can isolate a suspected causal agent, allowing you a somewhat safer environment in which to examine it, perhaps even identify it, and track down all of its pieces and parts. As a result, containment and eradication often blur into each other as interrelated tasks rather than remain as distinctly different phases of activity.

This gives us another term worthy of a definition: a causal agent is a software process, data object, hardware element, human-performed procedure, or any combination of those that perform the actions on the targeted systems that constitute the incident, attack, or disruption. Malware payloads, their control and parameter files, and their carriers are examples of causal agents. Bogus user IDs, hardware sniffer devices, or systems on your network that have already been suborned by an attacker are examples of causal agents. As you might suspect, the more sophisticated APT kill chains may use multiple methods to get into your systems and in doing so leave multiple bits of stuff behind to help them achieve their objectives each time they come on in.

Eradication is the process of identifying every instance of the causal agent and its associated files, executables, etc. from all elements of your system. For example, a malware infection would require you to thoroughly scrub every CPU’s memory, as well as all file storage systems (local and in the clouds), to ensure you’d found and removed all copies of the malware and any associated files, data, or code fragments. You’d also have to do this for all backup media for all of those systems in order to ensure you’d looked everywhere, removed the malware and its components, and clobbered or zeroized the space they were occupying in whatever storage media you found them on. Depending on the nature of the causal agent, the incident, and the storage technologies involved, you may need to do a full low-level reformat of the media and completely initialize its directory structures to ensure that eradication has been successfully completed.

Eradication should result in a formal declaration that the system, a segment or subsystem, or a particular host, server, or communications device has been inspected and verified to be free from any remnants of the causal agent. This declaration is the signal that recovery of that element or subsystem can begin.

It’s beyond the scope of the SSCP exam to get into the many different techniques your incident response team may need to use as part of containment and eradication—quite frankly, there are just far too many potential causal agents out there in the wild, and more are being created daily. It’s important to have a working sense of how detection and identification provided you the starting point for your containment, and then your eradication, of the threat.

Evidence Gathering, Preservation, and Use

During all stages of an incident, responders need to be gathering information about the status, state, and health of all systems, particularly those affected by the attack. They need to be correlating event log files from many different elements of their IT infrastructure, while at the same time constructing their own timeline of the event. Incident response teams are expected to figure out what happened, take steps to keep the damage from spreading, remove the cause(s) of the incident, and restore systems to normal use as quickly as they can.

There’s a real danger that the incident response team can spread itself too thin if the same group of people are containing and eradicating the threat, while at the same time trying to gather evidence, preserve it, and examine it for possible clues. Management and leadership need to be aware of this conflict. They are the ones who can allocate more resources, either during preparation and planning, incident response, or both, to provide a digital forensics capability.

As in all things, a balance needs to be struck, and response team leaders need to be sensitive to these different needs as they develop and maintain their team’s battle rhythm in working through the incident.

Constant Monitoring

From the first moment that the responders believe that an incident has occurred or is ongoing, the team needs to sharpen their gaze at the various monitoring tools that are already in place, watching over the organization’s IT infrastructure. The incident itself may be starting to cause disruptions to the normal state of the infrastructure and systems; containment and eradication responses will no doubt further disrupt operations. All of that aside, a new monitoring priority and question now needs to occupy center stage for the response team’s attention: are their chosen containment, eradication, and (later on) restoration efforts working properly?

On the one hand, the team should be actively predicting the most likely outcomes of each step they are about to take before they take it. This look-ahead should also be suggesting additional alarm conditions or signs of trouble that might indicate that the chosen step is not working correctly or in fact is adding to the impact the incident is causing. Training and experience with each tool and tactic is vital, as this gives the team the depth of specialist knowledge to draw on as they assess the situation, choose among possible actions to take, and then perform that action as part of their overall response.

The incident response team is, first and foremost, supposed to be managing their responses to the incident. Without well-informed predictions of the results of a selected action, the team is not managing the incident; they’re not even experimenting, which is how we test such predictions as part of confirming our logic and reasoning. Without informed guesswork and thoughtful consideration of alternatives, the team is being out-thought by its adversaries; the attackers are still managing and directing the incident, and defense is trapped into reacting as they call the shots.

### Events of Interest (e.g., Anomalies, Intrusions, Unauthoriaed Changes, Compliance Monitoring)

### Logging

### Source Systems

### Legal & Regulatory Concerns (e.g., Jurisdiction, Limitations, Privacy)

## Analyze Monitoring Results

### Security Baselines & Anomalies

### Visualizations, Metrics, & Trends (e.g., Dashboards, Timelines)

### Event Data Analysis

### Document & Communicate Findings (e.g., Escalation)