# Incident Response & Recovery

#### Recovery
**Recovery** is the process by which the organization’s IT infrastructure, applications, data, and workflows are reestablished and declared operational. In an ideal world, recovery starts when the eradication phase is complete, and the hardware, networks, and other systems elements are declared safe to restore to their required normal state. The ideal recovery process brings all elements of the system back to the moment in time just before the incident started to inflict damage or disruption to your systems. When recovery is complete, end users should be able to log back in and start working again, just as if they’d last logged off at the end of a normal set of work-related tasks.

It’s important to stress that every step of a recovery process must be validated as correctly performed and complete. This may need nothing more than using some simple tools to check status, state, and health information, or using preselected test suites of software and procedures to determine whether the system or element in question is behaving as it should be. It’s also worth noting that the more complex a system is, the more it may need to have a specific order in which subsystems, elements, and servers are reinitialized as part of an overall recovery and restart process.

With that in mind, let’s look at this step by step, in general terms:

**Eradication complete**. Ideally, this is a formal declaration by the CSIRT that the systems elements in question have been verified to be free of any instances of the causal agent (malware, illicit user IDs, corrupted or falsified data, etc.).

**Restore from bare metal to working OS**. Servers, hosts, endpoints, and many network devices should be reset to a known good set of initial software, firmware, and control parameters. In many cases, the IT department has made standard image sets that they use to do a full initial load of new hardware of the same type. This should include setting up systems or device administrator identities, passwords, or other access control parameters. At the end of this task, the device meets your organization’s security and operational policy requirements and can now have applications, data, and end users restored to it.

**Ensure all OS updates and patches are installed correctly...** if any have been released for the versions of software installed by your distribution kits or pristine system image copies.

**Restore applications as well as links to applications platforms and servers on your network**. Many endpoint devices in your systems will need locally installed applications, such as email clients, productivity tools, or even multifactor access control tools, as part of normal operations. These will need to be reinstalled from pristine distribution kits if they were not in the standard image used to reload the OS. This set of steps also includes reloading the connections to servers, services, and applications platforms on your organization’s networks (including extranets). This step should also verify that all updates and patches to applications have been installed correctly.

**Restore access to resources via federated access controls and resources beyond your security perimeter out on the Internet**. This step may require coordination with these external resource operators, particularly if your containment activities had to temporarily disable such access.
At this point, the systems and infrastructure are ready for normal operations. Aren’t they?

#### Data Recovery
Remember that the IT systems and the information architecture exist because the organization’s business logic needs to gather, create, make use of, and produce information to support decisions and action. Restoring the data plane of the total IT architecture is the next step that must be taken before declaring the system ready for business again.

Remember that the IT systems and the information architecture exist because the organization’s business logic needs to gather, create, make use of, and produce information to support decisions and action. Restoring the data plane of the total IT architecture is the next step that must be taken before declaring the system ready for business again.

In most cases, incident recovery will include restoring databases and storage systems content to the last known good configuration. This requires, of course, that the organization has a routine process in place for making backups of all of its operational data. Those backups might be

* Complete copies of every data item in every record in every database and file
* Incremental or partial copies, which copy a subset of records or files on a regular basis
* Differential, update, or change copies, which consist of records, fields, or files changed since a particular time
* Transaction logs, which are chronologically ordered sets of input data

Restoring all databases and file systems to their “ready for business as usual” state may take the combined efforts of the incident response team, database administrators, application support programmers, and others in the IT department. Key end users may also need to be part of this process, particularly as they are probably best suited to verifying that the systems and the data are all back to normal.

For example, a small wholesale distributor might use a backup strategy that makes a full copy of its databases once per week, and then a differential backup at the end of every business day. Individual transactions (reflecting customer orders, payments to vendors, inventory changes, etc.) would be reflected in the transaction logs kept for specific applications or by end users. In the event that the firm’s database has been corrupted by an attacker (or a serious systems malfunction), it would need to restore the last complete backup copy, then apply the daily differential backups for each day since that backup copy had been made. Finally, the firm would have to step through each transaction again, either using built-in applications functions that recover transactions from saved log files or by hand.

Now, that distributor is ready to start working on new transactions, reflecting new business. Its CSIRT’s response to the incident is over, and it moves on to the post-incident activities we’ll look at in just a moment.

#### Post-Recovery: Notification and Monitoring
One of the last tasks that the incident response team has is to ensure that end users, functional managers, and senior leaders and managers in the organization know that the recovery operations are now complete. This notice serves several important purposes:

* **Back in business**. This notice gives the green light to the organization to get back into normal business operations. Each department or functional division of the organization may have a different approach to this, based on its business logic and processes. This is particularly true as to how each department addresses any work lost during the overall downtime.
* **Proceed with caution**. Users and their managers should be extra vigilant as they start to use the systems, applications, and data once again. They may wish to start with load-balancing constraints in place so that processes can be closely monitored as they start up slowly and then throttle up to the normal pace of business.
* **Get the word out**. Senior leaders and managers should help make sure that key external stakeholders, partners, and others are properly informed about the successful recovery operation. They may also need to meet legal and regulatory obligations, and keep government officials, shareholders or investors, customers, and the general public properly informed. This is also a great opportunity for leadership and management, from the top down to the first-rung supervisors, to help ensure that every member of the team can be confident in the post-recovery state of the organization.

At this point, the incident response team’s real-time sense of urgency can relax; they’ve met the challenges of this latest information security incident to confront their organization. Now it’s time to take a deep breath, relax, and capture their lessons learned.

#### Post-Incident Activities

Before you as team chief send your responder crews home for some rest, you need to get them to look at their notes and the team log, and make some quick memory-jogging notes about anything that happened that’s not immediately obvious in those logs. Then (perhaps the next morning), the team should walk through a formal debrief process, using their logs and their event timeline as a framework. This debrief needs to capture, as completely as possible, the immediate memory of the experiences the team has just shared.

The process of appreciative inquiry can be a great help in such a team debrief. Appreciative inquiry starts from the assumption that what happened was good and useful, even if it didn’t quite fit what was needed; this can lead the team to a blame-free examination of why or how the chosen procedures didn’t suit the situation as best as they could have. Appreciative inquiry sets the stage for learning from experience by valuing that experience and, in doing so, reassuring those on the team that they played valued roles in the incident recovery process.

Good questions can and should be used to drive this debriefing process:

* Exactly what happened, and when?
* How well did we observe each event and capture information about it?
* Did we have documented procedures for such an event? If so, were they used? Did they help?
* What information did we need sooner than we actually discovered or received it?
* What did we do that actually hindered our recovery efforts? What mistakes did we make? How could we have done such steps more effectively?
* What can leadership, management, and staff do differently, both before the next incident and during the next incident, to make containment and recovery work more effectively?
* How could our information sharing with other organizations be improved?
* What precursors and indicators did we miss, or do we still not have insight about, that might have made a key difference to our recovery process?
* What other tools, resources, or talent and experience do we need to help us better detect, analyze, and respond to such incidents in the future?
This debriefing process may take several iterations as the team discovers that they need to learn more from the data collected from the systems during the incident and their response actions. They may also need to consult with others, such as system developers, key end users, or other partners, to more fully appreciate just what did happen and how well the team and the organization responded to it.

The debriefing process will no doubt surface a number of actions, suggestions, and areas for further exploration and analysis. All of these need to be captured in a manageable form, which the team leader, IT director, chief information security officer, or others in leadership and management can use to manage and direct the learning process that’s been started by the debrief. In general, you’ll see several broad types or categories of action items flowing out from the start of this “lessons learned” process:
* Immediate updates to administrative, technical, and physical controls, including the response team’s procedures
* Prompt updates to procedures and content for internal and external communication and coordination during and after an incident response
* Prompt development, installation, and use of new or modified controls and their corresponding procedures
* Updated training and education of response team members, IT and other support staff, managers, leaders, and the overall workforce
* Longer-term, additional investment in information security risk mitigation and management approaches
The question is often asked: did we really learn lessons from such an experience, or did we just write them down and put them in the files for later? That set of action item categories bears a striking resemblance to how software, systems, or product developers manage successive builds or versions of their own products. They plan what should be in each of the next several releases or versions; they task members of their teams to develop those incremental changes, write them, test, and validate them, and then the team integrates them together into the next release.

Make those observations you and your team wrote down be more than just observations—prioritize them, plan and schedule their resolution, and assign resources and people to update systems, controls, procedures, and training as required to get the learning from those lessons reflected in your new and improved ways of doing incident response.

## Support Incident Lifecycle

## Understand & Support Forensic Investigations

The incident responders may be done at this point, but other investigations may still be ongoing. Criminal or civil proceedings may mean that digital discovery motions have been served on the organization, or it’s anticipated that they’ll be served very soon. Ongoing internal investigations may be examining suspicious or careless behavior on the part of one or more employees, which could lead to disciplinary actions or even dismissal for cause. Most employers will not take such actions unless they are reasonably certain that they’ve got the evidence to back up such accusations, should the employee seek redress via a labor relations tribunal or the courts. In addition, the nature of the incident may bring with it still more regulatory or legal burdens that require the organization to thoroughly document exactly what happened; what information was compromised, disclosed, or corrupted; and whether any business decisions and actions were taken unadvisedly based on such loss or impact to decision support data.

Information and Evidence Retention

In almost any jurisdiction, there are many different and sometimes conflicting rules, regulations, laws, and expectations regarding how long information pertaining to such an incident must be retained. There are even laws and regulations that set maximum retention periods, and companies and individuals can cause themselves more legal troubles if they don’t dispose of information when required to do so. When any aspect of an incident becomes a matter for the courts to consider, these retention timelines can change yet again.

As an SSCP, your role in the midst of all of this may be as simple as ensuring that somebody in the organization produces a records and information retention schedule and that this schedule states how long data collected during an information security incident and response activity must be retained.

You’ll also need to be aware that storage and retention of evidence requires more stringent controls than the storage and retention of other forms of business records, including data gathered or produced during an incident response. Any of that information that has been deemed evidence to a legal proceeding of any kind will probably require a separate storage and accountability process. Most digital evidence is a copy of the original—the contents of a system’s RAM when it was executing malware has to be read out and written onto some kind of systems image media, and that disk image is what must be kept free from harm and under positive accountability. The chain of custody is the sequence of each step taken to originally gather the evidence, record or copy it, put it into storage, and then control and keep account of persons or processes who accessed that evidence; it further has to account for anything that was done to the evidence. Gaps in this chain of custody suggest that someone had the opportunity to tamper with the evidence, at which point the evidence is worthless.

You probably won’t encounter questions on the SSCP exam as to the details of records retention, evidence protection and its chain of custody, and the many different laws, regulations, and standards that apply to all of this. You may very well encounter these topics on the job, and the more you know about the nature of these requirements, the better you’ll be able to serve your organization’s overall information security needs.

Information Sharing with the Larger IT Security Community

It’s good practice to be an established, respected, and trusted member of your local area information security communities of practice, as well as of larger communities. Once you’re into the post-event phase, it’s a good time to share information about the incident, your responses to it, and the residual damage or actions, if any, that you’re facing. (Such sharing must of course be tempered by your organization’s information security classification guidelines!) Those communities—much like your fellow (ISC)2 members—are there to help each other learn from experiences such as you and your team have just been through. Share the wealth, as well as the pain, of that learning with them.

## Understand & Support Business Continuity Plan (BCP) and Disaster Recovery Plan (DRP) Activities

The belowed figure puts these many different planning processes in a loosely arranged hierarchy; while many formal frameworks, such as those from NIST, ISO, and ITIL, offer sage advice, no one specific set of plans in any particular relationship is the most correct or most compliant with law and regulation. As a result, this figure shows no direct connecting lines or arrows from one plan to another; they all mesh together in the context of business planning and risk management planning. What this figure does illustrate, however, is that there are many interrelated and mutually supportive planning tasks or processes that organizations can and should use to be better prepared to adapt, survive, and overcome the anomalies. As an SSCP, you won’t need to have deep knowledge of each of these plans or the planning processes that produce them. You will, however, serve your employers or clients best as you can to offer advice and assistance in helping them achieve their CIANA needs by protecting their information, information systems, IT infrastructures, and people from harm.

![Continuity of operations planning and supporting planning processes](images/bcp.png)

Each of these layers of planning is (or should be) driven by the business impact analysis (BIA), which took the results of the risk assessment process to produce a prioritized approach to which risks, leading to which impacts to the organization, were the most important, urgent, or compelling to protect against. Let’s take a brief look in more detail at some of the planning processes that SSCPs will typically participate in, the plans those processes produce:

* **Business continuity planning** considers how to keep core business logic and processes operating safely and reliably in the face of disruptive incidents; it also looks at how to restore these core processes after they have been disrupted. The business continuity plans (BCPs) that are produced are at the “high tactical” level; they use the strategic plans of the organization as context to take the prioritized core business processes (as defined by the BIA), specifying the tasks needed to recover from such a disruption. This includes all phases of incident response, as you saw in Chapter 10. BCPs do not normally go into the step-by-step operational details necessary to achieve effective preparation, response, or recovery; they rely on other, subordinate plans and procedures to do so.
* **Disaster recovery planning** must concern itself with significant loss of life, injury to people, damage to organizational assets (or the property or assets of others), and significant disruption to normal business operations. As a result, disaster recovery plans (DRPs) look to ways to prevent a disruption from turning into panic or hysteria, while at the same time meeting the organization’s due care and due diligence responsibilities to keep both stakeholders and the community informed. DRPs, for example, often must consider that organizational cash flow will probably suffer significantly as business operations are suspended, or greatly reduced, perhaps for months.
* **Contingency operations planning** takes business continuity considerations a few steps further by examining and selecting how to provide alternate means of getting business operations up and running again. This can embrace a variety of approaches, depending on the nature of the business logic in question:
    * Alternate work locations for employees to use
    * Alternate communications systems, internal and external, to keep employees, stakeholders, customers, or partners in touch, informed, and engaged
    * Information backup, archive, and restore capabilities, whether for physical backup of information and key documents or digital backups
    * Alternate processing capabilities
    * Alternate storage, support, and logistics processes 
    * Temporary staffing, financial, and other key considerations
* **Critical asset protection planning** looks at the protection required for strategic, high-value or high-risk assets in order to prevent significant loss of value, utility, or availability of these assets to serve the organization’s needs. As you saw in Chapter 3, these can be people, intellectual property, databases, assembly lines, or almost anything that is hard to replace and almost impossible to carry on business without.
* **Physical security and safety planning** focuses on preventing unauthorized physical access to the organization’s premises, property, systems, and people; it focuses on fire, environmental, or other hazards that might cause human injury or death, property damage, or otherwise reduce the value of the organization and its ability to function. It works to identify safety hazards and reduce accidents. (Chapter 4, “Operationalizing Risk Mitigation,” identified key approaches to physical risk mitigation controls with which SSCPs should be familiar.)

Finally, we as SSCPs come back to the information security incident response planning processes, as shown in Chapter 10. That planning process rightly focuses our attention on detecting IT and information systems events (or anomalies) that might be security incidents in the making, characterizing them, notifying appropriate organizational managers and leaders, and working through containment, eradication, and recovery tasks as we respond to such incidents.

The conclusion is inescapable: planning is what keeps us prepared, so that we can respond, but our planning has to be multifaceted and allow us to look at our organization, our operations, our information architectures, and our risks across the whole spectrum of business strategic, tactical, and operational concerns and details.

For example, consider how businesses in the Midwestern United States can combine forecasting and weather trending data to minimize the risk of loss from tornados and severe thunderstorms. As Yossi Sheffi points out in The Resilient Enterprise (2005, MIT Press), General Motors plant managers in Oklahoma City used data that shows how “tornado season” spans April, May, and June, with the peak time of day being from 3 p.m. to 9 p.m. local time. This insight empowered managers to focus attention on keeping the swing shift (afternoon and evening) workforce safer. The same data might suggest better, more cost-effective solutions to preserving and protecting IT infrastructures and key systems, especially if the commercial power, communications, and Internet systems can be prone to interruption or collapse during these peak periods of storm activity.

It’s important to make a distinction here between plans and planning. Plans are sets of tasks, objectives, resources, constraints, schedules, and success criteria, brought together in a coherent way to show us what we need to do and how we do it to achieve a set of goals. Planning is a process—an activity that people do to gather all of that information, understand it, and put it to use. Planning is iterative; you do it over and over again, and each time through, you learn more about the objectives, the tasks, the constraints; you learn more about what “success” (or “failure”) really means in the context of the planning you’re doing. In the worst of all worlds, plans become documents that sit on shelves; they are taken down every year, dusted off, thumbed through, and put back on the shelf with minor updates perhaps. These plans are not living documents; they are useless. Plans that people use every day become living documents through use; they stay alive, current, and real, because the people served by those plans take each step of those plans and develop detailed procedures that they then use on the job to accomplish the intent of the plan.

In a very real sense, the planning you’ll do to meet the CIANA needs of your organization or business does not and should not end until that organization or business does. Ongoing, continuous planning is in touch with what the knowledge workers and knowledge-seeking workers on your team are doing, every day, in every aspect of their jobs.

#### Cloud-Based “Do-Over” Buttons for Continuity, Security, and Resilience

This is arguably the greatest boon to the organization that migrating its business logic correctly into the right cloud-hosting environment and service model can bring. In this chapter, as well as earlier chapters, we’ve examined some of the arguments for moving services into the clouds. Let’s take a 50,000-foot view of this (as our aviator friends call it), and see just how these three attributes of secure, safe, and survivable computing show up in a typical organization by means of a common feature in almost every video game: the do-over button. This can show up in a number of everyday activities:

* Transaction do-over: Almost everything businesses and organizations need to achieve can be modeled as a series of transactions. Transactions are atomic by definition; that is, either you complete a transaction successfully, in its entirety, or you don’t. (You don’t partially make a deposit into your bank account, do you?) Undeleting a file is probably the most common IT example of this; this is straightforward when done on your local storage devices but requires multiple versions of files (or other approaches) to deal with shared, cloud-hosted, and synchronized storage supporting multiple users on multiple devices.
* Session do-over: As a writer, I might spend an hour or more editing a document, only to realize that I’ve made some horrible mistakes; I really want to just throw away this hour’s work and start over from where I was first thing this morning. Document file versioning (or even frequent “save as” with a new name) is the simplest approach to this, since auto-save and cloud-synchronized backups often capture each change as an update to the file being edited as they occur. In fact, document versioning—saving a completely new instance of the file under a new name—is just about the only way to provide this kind of fallback at the user’s work task level.
* Complex service do-over: Installing a new version of an OS or a major applications suite is complex, can take a lot of time, and may require a number of system reboots. Most systems and applications installation kits provide some kind of fallback capability, allowing the user to retry by reconfiguring the system to the way it was before the (aborted) installation or update was started.

You might say that a threat actor is the cause of every do-over capability that we need: the user made a mistake and needs to correct it; Mother Nature has intervened and shut the systems down in unclean ways; or an attacker, or just a software or data error, has caused the system to malfunction. The systems’ managers and owners then detect that they’re facing the Hob’s choice of the computer era: abort what’s already been done, retry the tasks by falling back to a known good point in the cumulative transaction history and redoing everything since, or ignore the compounding of errors and somehow move on. In the larger case of a total systems outage lasting days or weeks, the activation of disaster recovery and business continuity plans determines what to do about three categories of work (as reflected in the physical nature of the business and in its information systems).

What all of these do-overs have in common is that first, users and systems planners need to identify some baseline configurations of systems, software, and data that are worthy of extra efforts to keep available. These baselines need to be time and date stamped to be effective, of course! Whether we need to fall back to this morning’s version of a file or completely re-create the system image onto new hardware after a disaster, users will need to know what backup set from what moment in time to reload. Once it’s loaded, users can step through offline records of work steps taken and either redo them or deliberately choose to ignore them. Business logic should dictate this choice in advance or at least define the criteria to use in making this choice.

Different cloud systems providers, their deployment models, and their services models provide different selections of features and capabilities that support a cloud-based do-over capability. Critical to making the right choice is to know your organization’s real needs in each of these three areas, and the BIA should give you a solid foundation from which to start. Some key questions to consider include these:

* How many connections to the cloud-hosted business platforms, from how many of our end users, must meet what degree of reliability?
* Does our current physical communications architecture, including connections via our ISPs, provide us with that degree of reliability?
* If our fallback options include greater reliance on end-user mobile devices, what happens to our connectivity and business continuity when the local area mobile networks are overloaded or crash (as often happens during severe storms, earthquakes, and accidental or deliberate large-scale disruptions)?

We also must consider where our cloud-hosted data and applications platform systems actually, physically reside. Can the same natural or man-made hazard that disrupts our on-site business operations and people also disrupt our cloud host? If there’s a possibility of this, we need to explore how the cloud host itself can provide backup, distributed, or alternate site storage, as well as processing and access control support. Again, our business needs for this should drive the CIANA components of our discussions and negotiations with alternative cloud services providers. (One comforting thought: the major cloud host providers, such as Amazon, Google, and Microsoft, have already worked out solutions to these problems for very large, multinational customer organizations using their clouds; this drove them to build in capabilities that smaller organizations, be they local, regional, or international, can benefit from.)

This is the point in the information security risk management process where the “magic numbers” of risk come back into play:
* Exposure factor (EF)—The fraction of the value of the asset, process, or outcome that will be lost from a single occurrence of the risk event.
* Single loss expectancy (SLE)—The total direct and indirect costs (or losses) from a single occurrence of a risk event.
* Annual rate of occurrence (ARO)—The anticipated number of times per year that such an event may occur.
* Annual loss expectancy (ALE)—The anticipated losses for the year, which is the ARO multiplied by the SLE: 
* ALE = SLE × ARO
* Safeguard value (SV)—The costs to install, activate, and use the risk mitigation controls that provide protection from the impact of this risk event.
* Maximum allowable outage (MAO)—The greatest time period that business operations can be allowed to be disrupted by this risk event.
* Recovery time objective (RTO)—The time by which the systems must be restored to normal operational function after the occurrence of this risk event.
* Recovery point objective (RPO)—The maximum allowable latency or lag between having all data current versus the state of the data as a result of the risk event. The shorter the RPO, the more frequently data needs to be backed up. Longer RPOs reflect a willingness to operate on restored systems, handling new data (new business transactions) while still working to restore ones lost by the event.

Remember, these magic numbers are needed for each risk in the organization’s risk register. Complex organizations, with their resultant complex IT infrastructures supporting their rich sets of information systems and services, may have hundreds or more lower-level, more granular risks, each with its set of risk assessment numbers. They may have many ways of aggregating subsets of these risks up into a much smaller set of numbers. Note that the reverse process is also true: first, you face a large, almost terrifying single risk idea or event, and you break it down into smaller, more manageable elements of risk, each of which has its own set of numbers, including the leftover residual risk that you don’t quite know what to do with yet.

SLE is a good example of a magic number that can be quite simple to derive, or quite complex. Think about what happens when an electrical power transient “bricks” your modem/router in a SOHO environment, leaving you with no LAN and no ISP connection at all. This may mean that your staff can work to only 75 percent of their normal (budgeted) productivity until Internet services are restored. There’s also the possibility that for every 10-hour day that you’re offline, you might lose a customer order. The components of this single loss event might break down as shown in the following table:

If we assume that it’s going to take you all day to get the ISP to bring you a replacement modem/router, your total single loss expectancy is $400 ($150 for the replacement modem/router, 10 hours of lost productivity, and a 10 percent chance of a lost order averaging $500). Depending on the ARO for this event as a whole, it might (or might not) be worthwhile to invest in a spare modem/router.

Thus, the circle should close: business or organizational goals and objectives drive the creation and design of business processes, which dictate IT infrastructure capabilities to deliver enough CIANA to achieve those goals while managing risk in cost-effective ways. Each step further refines the strategies, tactics, and operational approaches to risk management and mitigation, producing more entries in the risk registry, each of which is smaller in impact. At some point, each risk can be affordably managed, controlled, and monitored. Each step in that circle of decision, design, implementation, and operation can benefit greatly from what the SSCP can bring to the table.