# 01 INTRODUCTION TO CYBER SECURITY DATA ANALYTICS

There are a number of reports and white papers nowadays that talk about cyber security big data analytics. You will have seen the earlier [Big Data Security Analytics](https://youtu.be/x1B9WrRPtOc) video report that whilst only a few companies currently use big data analytics for security, in the future they expect many more to do so. Part of the challenge identified in the report is the lack of resources, and inadequate knowledge for how to do cyber security data analytics. This course will help to get you started, and it may be you who can help industry address this current resource gap.

## What do we mean by visualisation?

Data visualisation is used in a variety of different domains, and at the very core, visualisation is about effective communication. Often, we think about two forms of effective communication with visualisation: exploratory visualisation and explanatory visualisation.

Exploratory visualisation is about how I as an analyst can examine and understand data. I may not know much about the data to begin with, but I can explore the data using visualisation techniques to understand what interesting properties may exist. It may be that there is lots of information that I am not interested in, however, through filtering and selection of the data, I may find something that is deemed worthy of further investigation. Explanatory visualisation is then about how I explain my findings to someone - how I communicate my findings. Suppose I have found a particular useful or intriguing attribute within the original data, can I emphasise this within my visual display of the data so that a reader can observe this clearly, in the context of the overall data. We can think about exploratory visualisation as supporting our analysis of data, and explanatory visualisation as supporting our presentation of data.

“The greatest value of a picture is when it forces us to notice what we never expected to see” 
John W. Tukey, 1977

## What do we mean by analysis?

Data analysis is about trying to make sense of data, to understand the wider context of what the data represents, what the implications of this are for our given investigation, and what course of action we should take as a result of this data and its underlying meaning. We may have different forms of data, such as qualitative and quantitative forms of data.

Specifically, data without context is rather meaningless. I may have some data, let’s say the value '42' - but what does this represent and therefore what does it mean? It may represent 42 emails received in one day, meaning that it is a fairly normal work day. However, if it represents 42 data breaches on a client server, suddenly my day has become a lot worse! How we analyse data will depend on the context of the data, what  the data represents, and what we may hope (or expect) to find from analysis of this attribute. We can think about 4 levels of understanding: data, information, knowledge, and wisdom (often described as the [DIKW pyramid](https://en.wikipedia.org/wiki/DIKW_pyramid)). With each level of the pyramid comes greater organisation of the discrete data elements, and greater contextual understanding [Hierarchy of Visual Understanding](https://informationisbeautiful.net/2010/data-information-knowledge-wisdom/).

![Alt text](./images/image1.png)

## So, what do we mean by cyber security data analytics?

For an organisation to be secure, a clear understanding of the operational environment is required. This is often described as [situational awareness](https://en.wikipedia.org/wiki/Situation_awareness), which is the perception of environmental elements and events, the comprehension of their meaning, and the projection of their future states. Increasingly, organisations are deploying **security operations centres (SOC)**, where analysts will seek to identify suspicious behaviour and understand the context and relevance to the organisational mission, often using **Security Information and Event Management (SIEM)** systems.

Technology underpins modern organisations and having insight into the business operational environment is crucial to protect it. As a first stage, ensuring the safe and correct operation of our computer systems, and our networking infrastructure is a good place to start. Network traffic data (e.g., packet captures) can help to indicate what data has been communicated over a network, and what actions have been carried out as a result of this (e.g., access to a particular URL, or downloading of large files). Intrusion Detection Systems (IDS) are commonly used to inspect networking inbound and outbound network traffic, to identify suspicious activities. IDSs will generate log files, and these logs constitute another informative data attribute. Similarly, firewall rules can help understand how the network is configured, and Intrusion Prevention Systems (IPS) will make decisions and act on IDS activity to prevent potential harm.

The remit of cyber security is far and wide and goes beyond traditional computers and network security. A holistic view is required of what we want to protect, and what attack vectors may be used to gain access. Therefore, aspects such as physical security, people security, and process security also need to be understood. Physical security may require CCTV, IoT sensor monitoring or GPS tracking. People security may require text analytics of social media and email usage. Process security may require analysis of business process models, supply chain security, organisational hierarchy information, and operational practice. Technology continues to influence how we conduct business across the global, and therefore we need to ensure that we understand our threat landscape and have clear monitoring in place to understand potential harms. In many cases, we are interested in spatial-temporal data, i.e., in what location did the activity occur and at what time? Given our highly connected society, location is becoming increasingly challenging (are we looking at the location of the attacker, the location of the data, the location of the breached system?), and as for time, devices are logging activities faster than we can humanly inspect them. There is then the need for big data cyber security analytics – to make this flow of data manageable and insightful, to highlight key attributes in the data, and to enable informed decisions to be made to respond and react to potential threats.

**Security is about understanding systems, the people, and the processes that act upon these systems, such that they remain secure.** Can we ever be fully secure? Probably not, but with greater insight of observed activities, we can manage this more effectively. Data analytics and visualisation techniques are one step towards achieving this.

![Alt text](./images/image2.png)

# Security Data Visualisation Skills

Data science and security visualisation requires the [following blend of skills](https://www.sans.org/white-papers/36387/) that combines the ability to hack and manipulate data, the understanding of statistical techniques, and the domain knowledge of what information is relevant and important for the purpose of security.

- **Substantive Expertise:** This is the security domain knowledge, which will enable the security practitioner to understand the data, determine what is expected and find anomalies or metrics from visualization. 
- **Hacking Skills:** Hacking skills are the skills from a data scientist language required for working with massive amount of data that should be acquired, cleaned and sanitized. 
- **Math & Statistics Knowledge:** This knowledge is critical to understand which tools to use, understand the spread and other characteristics to derive insight from the data. 

![Alt text](./images/image3.png)
 
# What stories may our data tell?

We discussed earlier the idea that data visualisation can be used for exploratory and explanatory uses. In the latter, [we may want to tell a story about our data, often described as Data-Driven Storytelling](https://www2.deloitte.com/us/en/insights/topics/analytics/data-driven-storytelling.html). So what makes for a good data story? There are five aspects that may be pertinent to the story. Novelty: We may want to observe when something is new within our observation. Outlier: We may want to observe when something not new appears different within our observation. Trend: We may want to observe the historical pattern of observations. Forecasting: We may want to observe how the historical pattern will forecast what may come in the future. Finally, there is Debunking: We may want to observe how our data contradicts an opinion of what may come.

Typically, in security, we are looking for novelty and outliers, based on the historical trend, to then provide a forecast of what may be if we do not intervene. Where we can couple observations against either “known-bad” activities (e.g., malware attacks), or show that observations are clearly deviating from “known-good” activities (e.g., insider threat), we can provide some insight into the underlying activity to determine whether action is required.

# AI and Cyber Security

DarkTrace, CheckPoint, Symantec, Sophos, FireEye, Cynet, Fortinet, Vectra, and Cylance. These are just a handful of vendors that now use Artificial Intelligence and Machine Learning as a part of their products and services for cyber security defence – there are plenty others too, but this just gives you [an impression of the direction that the industry is moving in](https://www.comparitech.com/blog/information-security/leading-ai-cybersecurity-companies/). 

Cyber security requires a holistic view to identify what should be protected, and how may it be vulnerable to attack. The volume of data generated by today’s systems means that humans cannot analyse this raw data effectively. With AI and machine learning techniques, we can filter and manage data observations, whilst visualisation can help human analysts understand and communicate about observations and appropriate responsive actions. 

# Further reading

- [Sarker, I.H., Kayes, A.S.M., Badsha, S. et al. Cybersecurity data science: an overview from machine learning perspective. J Big Data 7, 41 (2020). https://doi.org/10.1186/s40537-020-00318-5](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00318-5)
- [Maayan, G. How Data Science Has Changed Cybersecurity. Datasciencedojo (2020).](https://online.datasciencedojo.com/blogs/how-data-science-has-changed-cybersecurity)
