# Phase 1 – Data Selection (IT326)

## Project Goal
In this project, we aim to analyze cybercrime data to study the impact of various attack characteristics and security-related factors (e.g., attack type, severity, duration, affected data size, targeted systems, security tools, and response time) on the outcome of cyberattacks (success/failure).

This phase focuses on dataset selection and initial exploration in preparation for classification and clustering tasks.

## Dataset Source
Cyber Crimes Dataset – Kaggle  
https://www.kaggle.com/datasets/shakirul09/cyber-crimes-dataset

> Note: Our working dataset file in this notebook is loaded from the provided CSV file.

In [None]:
import pandas as pd
import os
from pathlib import Path

DATASET_FILE = 'cybersecurity_large_synthesized_data.csv'
DATASET_PATHS = [
    Path(DATASET_FILE),
    Path('/content') / DATASET_FILE,
    Path('/content') / 'Dataset' / DATASET_FILE,
]

found_path = None
for p in DATASET_PATHS:
    if p.exists():
        found_path = p
        break

if found_path is None:
    raise FileNotFoundError(
        f"Could not find '{DATASET_FILE}'. Please upload it to Colab (Files → Upload) and re-run this cell."
    )

print('Loading dataset from:', found_path)
df = pd.read_csv(found_path)
df.head()

## Dataset Description
In this section, we report:
- Number of attributes and their data types
- Number of records (instances)
- Class label (outcome) and the number of instances for each class
- A sample of the raw dataset

In [None]:
df.info()

In [None]:
df.shape

### Class Label (Target Variable)
The selected target (class label) for this project is **`outcome`**.

In [None]:
df['outcome'].value_counts()

In [None]:
df.head()