# Incident Response & Recovery

#### Recovery
**Recovery** is the process by which the organization’s IT infrastructure, applications, data, and workflows are reestablished and declared operational. In an ideal world, recovery starts when the eradication phase is complete, and the hardware, networks, and other systems elements are declared safe to restore to their required normal state. The ideal recovery process brings all elements of the system back to the moment in time just before the incident started to inflict damage or disruption to your systems. When recovery is complete, end users should be able to log back in and start working again, just as if they’d last logged off at the end of a normal set of work-related tasks.

It’s important to stress that every step of a recovery process must be validated as correctly performed and complete. This may need nothing more than using some simple tools to check status, state, and health information, or using preselected test suites of software and procedures to determine whether the system or element in question is behaving as it should be. It’s also worth noting that the more complex a system is, the more it may need to have a specific order in which subsystems, elements, and servers are reinitialized as part of an overall recovery and restart process.

With that in mind, let’s look at this step by step, in general terms:

**Eradication complete**. Ideally, this is a formal declaration by the CSIRT that the systems elements in question have been verified to be free of any instances of the causal agent (malware, illicit user IDs, corrupted or falsified data, etc.).

**Restore from bare metal to working OS**. Servers, hosts, endpoints, and many network devices should be reset to a known good set of initial software, firmware, and control parameters. In many cases, the IT department has made standard image sets that they use to do a full initial load of new hardware of the same type. This should include setting up systems or device administrator identities, passwords, or other access control parameters. At the end of this task, the device meets your organization’s security and operational policy requirements and can now have applications, data, and end users restored to it.

**Ensure all OS updates and patches are installed correctly...** if any have been released for the versions of software installed by your distribution kits or pristine system image copies.

**Restore applications as well as links to applications platforms and servers on your network**. Many endpoint devices in your systems will need locally installed applications, such as email clients, productivity tools, or even multifactor access control tools, as part of normal operations. These will need to be reinstalled from pristine distribution kits if they were not in the standard image used to reload the OS. This set of steps also includes reloading the connections to servers, services, and applications platforms on your organization’s networks (including extranets). This step should also verify that all updates and patches to applications have been installed correctly.

**Restore access to resources via federated access controls and resources beyond your security perimeter out on the Internet**. This step may require coordination with these external resource operators, particularly if your containment activities had to temporarily disable such access.
At this point, the systems and infrastructure are ready for normal operations. Aren’t they?

#### Data Recovery
Remember that the IT systems and the information architecture exist because the organization’s business logic needs to gather, create, make use of, and produce information to support decisions and action. Restoring the data plane of the total IT architecture is the next step that must be taken before declaring the system ready for business again.

Remember that the IT systems and the information architecture exist because the organization’s business logic needs to gather, create, make use of, and produce information to support decisions and action. Restoring the data plane of the total IT architecture is the next step that must be taken before declaring the system ready for business again.

In most cases, incident recovery will include restoring databases and storage systems content to the last known good configuration. This requires, of course, that the organization has a routine process in place for making backups of all of its operational data. Those backups might be

* Complete copies of every data item in every record in every database and file
* Incremental or partial copies, which copy a subset of records or files on a regular basis
* Differential, update, or change copies, which consist of records, fields, or files changed since a particular time
* Transaction logs, which are chronologically ordered sets of input data

Restoring all databases and file systems to their “ready for business as usual” state may take the combined efforts of the incident response team, database administrators, application support programmers, and others in the IT department. Key end users may also need to be part of this process, particularly as they are probably best suited to verifying that the systems and the data are all back to normal.

For example, a small wholesale distributor might use a backup strategy that makes a full copy of its databases once per week, and then a differential backup at the end of every business day. Individual transactions (reflecting customer orders, payments to vendors, inventory changes, etc.) would be reflected in the transaction logs kept for specific applications or by end users. In the event that the firm’s database has been corrupted by an attacker (or a serious systems malfunction), it would need to restore the last complete backup copy, then apply the daily differential backups for each day since that backup copy had been made. Finally, the firm would have to step through each transaction again, either using built-in applications functions that recover transactions from saved log files or by hand.

Now, that distributor is ready to start working on new transactions, reflecting new business. Its CSIRT’s response to the incident is over, and it moves on to the post-incident activities we’ll look at in just a moment.

#### Post-Recovery: Notification and Monitoring
One of the last tasks that the incident response team has is to ensure that end users, functional managers, and senior leaders and managers in the organization know that the recovery operations are now complete. This notice serves several important purposes:

* **Back in business**. This notice gives the green light to the organization to get back into normal business operations. Each department or functional division of the organization may have a different approach to this, based on its business logic and processes. This is particularly true as to how each department addresses any work lost during the overall downtime.
* **Proceed with caution**. Users and their managers should be extra vigilant as they start to use the systems, applications, and data once again. They may wish to start with load-balancing constraints in place so that processes can be closely monitored as they start up slowly and then throttle up to the normal pace of business.
* **Get the word out**. Senior leaders and managers should help make sure that key external stakeholders, partners, and others are properly informed about the successful recovery operation. They may also need to meet legal and regulatory obligations, and keep government officials, shareholders or investors, customers, and the general public properly informed. This is also a great opportunity for leadership and management, from the top down to the first-rung supervisors, to help ensure that every member of the team can be confident in the post-recovery state of the organization.

At this point, the incident response team’s real-time sense of urgency can relax; they’ve met the challenges of this latest information security incident to confront their organization. Now it’s time to take a deep breath, relax, and capture their lessons learned.

#### Post-Incident Activities

Before you as team chief send your responder crews home for some rest, you need to get them to look at their notes and the team log, and make some quick memory-jogging notes about anything that happened that’s not immediately obvious in those logs. Then (perhaps the next morning), the team should walk through a formal debrief process, using their logs and their event timeline as a framework. This debrief needs to capture, as completely as possible, the immediate memory of the experiences the team has just shared.

The process of appreciative inquiry can be a great help in such a team debrief. Appreciative inquiry starts from the assumption that what happened was good and useful, even if it didn’t quite fit what was needed; this can lead the team to a blame-free examination of why or how the chosen procedures didn’t suit the situation as best as they could have. Appreciative inquiry sets the stage for learning from experience by valuing that experience and, in doing so, reassuring those on the team that they played valued roles in the incident recovery process.

Good questions can and should be used to drive this debriefing process:

* Exactly what happened, and when?
* How well did we observe each event and capture information about it?
* Did we have documented procedures for such an event? If so, were they used? Did they help?
* What information did we need sooner than we actually discovered or received it?
* What did we do that actually hindered our recovery efforts? What mistakes did we make? How could we have done such steps more effectively?
* What can leadership, management, and staff do differently, both before the next incident and during the next incident, to make containment and recovery work more effectively?
* How could our information sharing with other organizations be improved?
* What precursors and indicators did we miss, or do we still not have insight about, that might have made a key difference to our recovery process?
* What other tools, resources, or talent and experience do we need to help us better detect, analyze, and respond to such incidents in the future?
This debriefing process may take several iterations as the team discovers that they need to learn more from the data collected from the systems during the incident and their response actions. They may also need to consult with others, such as system developers, key end users, or other partners, to more fully appreciate just what did happen and how well the team and the organization responded to it.

The debriefing process will no doubt surface a number of actions, suggestions, and areas for further exploration and analysis. All of these need to be captured in a manageable form, which the team leader, IT director, chief information security officer, or others in leadership and management can use to manage and direct the learning process that’s been started by the debrief. In general, you’ll see several broad types or categories of action items flowing out from the start of this “lessons learned” process:
* Immediate updates to administrative, technical, and physical controls, including the response team’s procedures
* Prompt updates to procedures and content for internal and external communication and coordination during and after an incident response
* Prompt development, installation, and use of new or modified controls and their corresponding procedures
* Updated training and education of response team members, IT and other support staff, managers, leaders, and the overall workforce
* Longer-term, additional investment in information security risk mitigation and management approaches
The question is often asked: did we really learn lessons from such an experience, or did we just write them down and put them in the files for later? That set of action item categories bears a striking resemblance to how software, systems, or product developers manage successive builds or versions of their own products. They plan what should be in each of the next several releases or versions; they task members of their teams to develop those incremental changes, write them, test, and validate them, and then the team integrates them together into the next release.

Make those observations you and your team wrote down be more than just observations—prioritize them, plan and schedule their resolution, and assign resources and people to update systems, controls, procedures, and training as required to get the learning from those lessons reflected in your new and improved ways of doing incident response.

## Support Incident Lifecycle

## Understand & Support Forensic Investigations

The incident responders may be done at this point, but other investigations may still be ongoing. Criminal or civil proceedings may mean that digital discovery motions have been served on the organization, or it’s anticipated that they’ll be served very soon. Ongoing internal investigations may be examining suspicious or careless behavior on the part of one or more employees, which could lead to disciplinary actions or even dismissal for cause. Most employers will not take such actions unless they are reasonably certain that they’ve got the evidence to back up such accusations, should the employee seek redress via a labor relations tribunal or the courts. In addition, the nature of the incident may bring with it still more regulatory or legal burdens that require the organization to thoroughly document exactly what happened; what information was compromised, disclosed, or corrupted; and whether any business decisions and actions were taken unadvisedly based on such loss or impact to decision support data.

Information and Evidence Retention

In almost any jurisdiction, there are many different and sometimes conflicting rules, regulations, laws, and expectations regarding how long information pertaining to such an incident must be retained. There are even laws and regulations that set maximum retention periods, and companies and individuals can cause themselves more legal troubles if they don’t dispose of information when required to do so. When any aspect of an incident becomes a matter for the courts to consider, these retention timelines can change yet again.

As an SSCP, your role in the midst of all of this may be as simple as ensuring that somebody in the organization produces a records and information retention schedule and that this schedule states how long data collected during an information security incident and response activity must be retained.

You’ll also need to be aware that storage and retention of evidence requires more stringent controls than the storage and retention of other forms of business records, including data gathered or produced during an incident response. Any of that information that has been deemed evidence to a legal proceeding of any kind will probably require a separate storage and accountability process. Most digital evidence is a copy of the original—the contents of a system’s RAM when it was executing malware has to be read out and written onto some kind of systems image media, and that disk image is what must be kept free from harm and under positive accountability. The chain of custody is the sequence of each step taken to originally gather the evidence, record or copy it, put it into storage, and then control and keep account of persons or processes who accessed that evidence; it further has to account for anything that was done to the evidence. Gaps in this chain of custody suggest that someone had the opportunity to tamper with the evidence, at which point the evidence is worthless.

You probably won’t encounter questions on the SSCP exam as to the details of records retention, evidence protection and its chain of custody, and the many different laws, regulations, and standards that apply to all of this. You may very well encounter these topics on the job, and the more you know about the nature of these requirements, the better you’ll be able to serve your organization’s overall information security needs.

Information Sharing with the Larger IT Security Community

It’s good practice to be an established, respected, and trusted member of your local area information security communities of practice, as well as of larger communities. Once you’re into the post-event phase, it’s a good time to share information about the incident, your responses to it, and the residual damage or actions, if any, that you’re facing. (Such sharing must of course be tempered by your organization’s information security classification guidelines!) Those communities—much like your fellow (ISC)2 members—are there to help each other learn from experiences such as you and your team have just been through. Share the wealth, as well as the pain, of that learning with them.

## Understand & Support Business Continuity Plan (BCP) and Disaster Recovery Plan (DRP) Activities