<a href="https://colab.research.google.com/github/guilhermelaviola/CybersecurityProblemSolvingWithDataScience/blob/main/Class02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Analysis in Cybersecurity**
Data analysis and machine learning are fundamental to modern cybersecurity, enabling organizations to detect, predict, and respond to cyber threats more effectively. By analyzing logs, network traffic, and user behavior, security teams can identify anomalies such as suspicious logins, malware activity, or insider threats. Tools for log management and data visualization help transform large volumes of security data into actionable insights, while data science techniques support data protection, regulatory compliance (such as LGPD), and cloud security. Together with strong policies and employee awareness, these approaches significantly strengthen an organization’s overall cybersecurity posture.

In [1]:
# Importing all the necessary libraries and resources:
import pandas as pd
from sklearn.ensemble import IsolationForest

## **Detecting Anomalous Login Behavior**
Below is a simple example showing how machine learning concepts can be applied to detect unusual login behavior using an anomaly detection algorithm.

In [2]:
# Sample login data (hour of login and number of login attempts):
data = {
    'login_hour': [9, 10, 11, 9, 10, 11, 2, 3],
    'login_attempts': [1, 1, 2, 1, 1, 2, 10, 8]
}

df = pd.DataFrame(data)

# Training an anomaly detection model:
model = IsolationForest(contamination=0.25, random_state=42)
df['anomaly'] = model.fit_predict(df)

# Marking anomalies (-1 = suspicious, 1 = normal):
df['anomaly'] = df['anomaly'].map({1: 'Normal', -1: 'Suspicious'})

print(df)

   login_hour  login_attempts     anomaly
0           9               1      Normal
1          10               1      Normal
2          11               2      Normal
3           9               1      Normal
4          10               1      Normal
5          11               2      Normal
6           2              10  Suspicious
7           3               8  Suspicious


## **Detecting Anomalies in Log Data**
This example simulates system log data and uses machine learning to detect unusual activity, such as abnormal request sizes that could indicate an attack.

In [3]:
# Simulated log data:
logs = {
    'response_time_ms': [120, 110, 115, 130, 125, 118, 900, 850],
    'request_size_kb': [20, 22, 21, 23, 19, 20, 300, 280]
}

df_logs = pd.DataFrame(logs)

# Training anomaly detection model:
model = IsolationForest(contamination=0.25, random_state=0)
df_logs['anomaly'] = model.fit_predict(df_logs)

# Converting output to readable labels:
df_logs['anomaly'] = df_logs['anomaly'].map({1: 'Normal', -1: 'Anomalous'})

print(df_logs)

   response_time_ms  request_size_kb    anomaly
0               120               20     Normal
1               110               22     Normal
2               115               21     Normal
3               130               23     Normal
4               125               19     Normal
5               118               20     Normal
6               900              300  Anomalous
7               850              280  Anomalous


## **Think & Answer**
**Considering the initial machine learning techniques applied to cybersecurity, discuss the importance of anomaly detection for identifying malicious activity on a computer network. Explain how an anomaly detection algorithm can be used in this context, highlighting the benefits and limitations of this approach.**

Anomaly detection is a fundamentally important area of ​​data visualization in computer network cybersecurity, as it can be used not only to identify current attacks but also to predict future attacks and authenticate user identity. It can be defined as the identification of an observation, event, or data point that deviates from what is standard or expected, making it inconsistent with the rest of the dataset.

It works like this: we can take as an example the log data of security event data (SIEM) (such as logins at unusual times, failed login attempts, unexpected changes in system settings, etc.), which can be visualized and used to identify suspicious activities. A time series visualization can show these activities over time and highlight any deviations from the norm or anomalies.

Today, anomaly detection uses Artificial Intelligence and Machine Learning to automatically identify unexpected changes in the normal behavior of a dataset.

It can be applied in various sectors. Some examples are:
- Finance: in fraud detection;
- Manufacturing: to identify defects or malfunctions in equipment
- Cybersecurity: to detect unusual network activity;
- Healthcare: to identify abnormal patient conditions.

Machine Learning algorithms can be used to detect anomalies. They learn the implicit pattern in the data and then identify any deviations from that pattern. Some examples of these algorithms are:
- Decision trees
- Single-class support vector machine (SVM)
- K-nearest neighbors (k-NN)
- Naive Bayesian
- Autoencoders
- Local outlier factor (LOF)
- k-means clustering

An anomaly detection algorithm can learn to identify patterns and detect anomalous data using various Machine Learning training techniques. Depending on the amount of labeled data (if any) in a team's training dataset, the best anomaly detection technique to be used is then chosen: unsupervised, supervised, or semi-supervised.

In unsupervised techniques, data engineers train a model by providing unlabeled datasets used to discover patterns or anomalies on their own. These techniques are the most commonly used due to their wider application; however, they require massive datasets and high computational power.

In supervised techniques, an algorithm trained on a labeled dataset that includes normal and anomalous instances is used. Due to the general unavailability of labeled training data and the unbalanced nature of the classes, these anomaly detection techniques are rarely used.

In semi-supervised techniques, the positive attributes of both unsupervised and supervised anomaly detection are maximized. With an algorithm that has part of the data labeled, it can be partially trained. Data engineers then use the partially trained algorithm to autonomously label a larger dataset, known as 'pseudo-labeling'. These newly labeled data points are then combined with the original dataset to refine the algorithm.

Each approach has its advantages and disadvantages, as noted above, and the right combination of supervised and unsupervised approaches is vital for machine learning automation.