# CIC-IDS-2017 Dataset Analysis

This notebook provides a comprehensive analysis of the CIC-IDS-2017 network intrusion detection dataset.

## Dataset Overview
The CIC-IDS-2017 dataset is a comprehensive network intrusion detection dataset containing labeled network traffic flows captured over 5 days (Monday to Friday) in July 2017.

### Dataset Structure:
- **Raw Data** (`GeneratedLabelledFlows/TrafficLabelling/`): Original labeled network flow data
- **Processed Data** (`MachineLearningCSV/MachineLearningCVE/`): Same data but processed for machine learning applications

### Key Differences:
- **Removed**: Flow ID, Source IP, Source Port, Destination IP, Timestamp
- **Kept**: All statistical features (73 features) + Label
- **Purpose**: Privacy protection while maintaining ML utility


## 1. Import Required Libraries


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("Libraries imported successfully!")


Libraries imported successfully!


## 2. Dataset Structure Analysis


In [2]:
# Define paths
raw_data_path = "CIC-IDS-2017/GeneratedLabelledFlows/TrafficLabelling/"
processed_data_path = "CIC-IDS-2017/MachineLearningCSV/MachineLearningCVE/"

# Get file lists
raw_files = [f for f in os.listdir(raw_data_path) if f.endswith('.csv')]
processed_files = [f for f in os.listdir(processed_data_path) if f.endswith('.csv')]

print("Raw Data Files:")
for i, file in enumerate(raw_files, 1):
    print(f"{i}. {file}")

print("\nProcessed Data Files:")
for i, file in enumerate(processed_files, 1):
    print(f"{i}. {file}")

print(f"\nTotal files in each directory: {len(raw_files)} vs {len(processed_files)}")


Raw Data Files:
1. Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
2. Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
3. Friday-WorkingHours-Morning.pcap_ISCX.csv
4. Monday-WorkingHours.pcap_ISCX.csv
5. Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
6. Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
7. Tuesday-WorkingHours.pcap_ISCX.csv
8. Wednesday-workingHours.pcap_ISCX.csv

Processed Data Files:
1. Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
2. Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
3. Friday-WorkingHours-Morning.pcap_ISCX.csv
4. Monday-WorkingHours.pcap_ISCX.csv
5. Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
6. Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
7. Tuesday-WorkingHours.pcap_ISCX.csv
8. Wednesday-workingHours.pcap_ISCX.csv

Total files in each directory: 8 vs 8


## 3. File Size and Record Count Analysis


In [3]:
def get_file_info(file_path, file_name):
    """Get file information including record count and size"""
    try:
        # Get file size
        file_size = os.path.getsize(file_path) / (1024 * 1024)  # MB
        
        # Get record count
        with open(file_path, 'r') as f:
            line_count = sum(1 for line in f) - 1  # Subtract header
        
        return {
            'file_name': file_name,
            'size_mb': round(file_size, 2),
            'records': line_count
        }
    except Exception as e:
        return {
            'file_name': file_name,
            'size_mb': 0,
            'records': 0,
            'error': str(e)
        }

# Analyze raw data files
print("=== RAW DATA ANALYSIS ===")
raw_info = []
for file in raw_files:
    info = get_file_info(os.path.join(raw_data_path, file), file)
    raw_info.append(info)
    print(f"{file}: {info['records']:,} records, {info['size_mb']} MB")

print("\n=== PROCESSED DATA ANALYSIS ===")
processed_info = []
for file in processed_files:
    info = get_file_info(os.path.join(processed_data_path, file), file)
    processed_info.append(info)
    print(f"{file}: {info['records']:,} records, {info['size_mb']} MB")

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'File': [f.replace('.pcap_ISCX.csv', '') for f in raw_files],
    'Raw_Records': [info['records'] for info in raw_info],
    'Processed_Records': [info['records'] for info in processed_info],
    'Raw_Size_MB': [info['size_mb'] for info in raw_info],
    'Processed_Size_MB': [info['size_mb'] for info in processed_info]
})

comparison_df['Record_Diff'] = comparison_df['Processed_Records'] - comparison_df['Raw_Records']
comparison_df['Size_Diff_MB'] = comparison_df['Processed_Size_MB'] - comparison_df['Raw_Size_MB']

print("\n=== COMPARISON SUMMARY ===")
print(comparison_df.to_string(index=False))


=== RAW DATA ANALYSIS ===
Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv: 225,745 records, 91.65 MB
Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv: 286,467 records, 97.16 MB
Friday-WorkingHours-Morning.pcap_ISCX.csv: 191,033 records, 71.89 MB
Monday-WorkingHours.pcap_ISCX.csv: 529,918 records, 256.2 MB
Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv: 288,602 records, 103.69 MB
Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv: 458,968 records, 87.77 MB
Tuesday-WorkingHours.pcap_ISCX.csv: 445,909 records, 166.6 MB
Wednesday-workingHours.pcap_ISCX.csv: 692,703 records, 272.41 MB

=== PROCESSED DATA ANALYSIS ===
Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv: 225,745 records, 73.55 MB
Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv: 286,467 records, 73.34 MB
Friday-WorkingHours-Morning.pcap_ISCX.csv: 191,033 records, 55.62 MB
Monday-WorkingHours.pcap_ISCX.csv: 529,918 records, 168.73 MB
Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv: 288,602 re

## 4. Data Structure Comparison


In [4]:
# Read sample data from both raw and processed datasets
raw_sample = pd.read_csv(os.path.join(raw_data_path, raw_files[0]), nrows=5)
processed_sample = pd.read_csv(os.path.join(processed_data_path, processed_files[0]), nrows=5)

print("=== RAW DATA STRUCTURE ===")
print(f"Shape: {raw_sample.shape}")
print(f"Columns: {len(raw_sample.columns)}")
print("\nFirst 10 columns:")
for i, col in enumerate(raw_sample.columns[:10], 1):
    print(f"{i:2d}. {repr(col)}")

print("\n=== PROCESSED DATA STRUCTURE ===")
print(f"Shape: {processed_sample.shape}")
print(f"Columns: {len(processed_sample.columns)}")
print("\nFirst 10 columns:")
for i, col in enumerate(processed_sample.columns[:10], 1):
    print(f"{i:2d}. {repr(col)}")

print("\n=== REMOVED COLUMNS ===")
raw_cols = set(raw_sample.columns)
processed_cols = set(processed_sample.columns)
removed_cols = raw_cols - processed_cols
print(f"Removed columns: {len(removed_cols)}")
for col in sorted(removed_cols):
    print(f"- {repr(col)}")


=== RAW DATA STRUCTURE ===
Shape: (5, 85)
Columns: 85

First 10 columns:
 1. 'Flow ID'
 2. ' Source IP'
 3. ' Source Port'
 4. ' Destination IP'
 5. ' Destination Port'
 6. ' Protocol'
 7. ' Timestamp'
 8. ' Flow Duration'
 9. ' Total Fwd Packets'
10. ' Total Backward Packets'

=== PROCESSED DATA STRUCTURE ===
Shape: (5, 79)
Columns: 79

First 10 columns:
 1. ' Destination Port'
 2. ' Flow Duration'
 3. ' Total Fwd Packets'
 4. ' Total Backward Packets'
 5. 'Total Length of Fwd Packets'
 6. ' Total Length of Bwd Packets'
 7. ' Fwd Packet Length Max'
 8. ' Fwd Packet Length Min'
 9. ' Fwd Packet Length Mean'
10. ' Fwd Packet Length Std'

=== REMOVED COLUMNS ===
Removed columns: 6
- ' Destination IP'
- ' Protocol'
- ' Source IP'
- ' Source Port'
- ' Timestamp'
- 'Flow ID'


## 5. Attack Type Distribution Analysis


In [5]:
def analyze_attack_distribution(file_path, file_name):
    """Analyze attack type distribution in a file"""
    try:
        # Read the file
        df = pd.read_csv(file_path)
        
        # Get label column (last column)
        label_col = df.columns[-1]
        
        # Count attack types
        attack_counts = df[label_col].value_counts()
        
        return {
            'file_name': file_name,
            'total_records': len(df),
            'attack_distribution': attack_counts.to_dict()
        }
    except Exception as e:
        return {
            'file_name': file_name,
            'total_records': 0,
            'attack_distribution': {},
            'error': str(e)
        }

# Analyze attack distribution in processed data
print("=== ATTACK TYPE DISTRIBUTION ===")
attack_analysis = []

for file in processed_files:
    file_path = os.path.join(processed_data_path, file)
    analysis = analyze_attack_distribution(file_path, file)
    attack_analysis.append(analysis)
    
    print(f"\n{file}:")
    print(f"Total records: {analysis['total_records']:,}")
    if 'attack_distribution' in analysis:
        for attack_type, count in analysis['attack_distribution'].items():
            percentage = (count / analysis['total_records']) * 100
            print(f"  {attack_type}: {count:,} ({percentage:.2f}%)")

# Calculate total distribution
print("\n=== TOTAL DISTRIBUTION ACROSS ALL FILES ===")
total_distribution = {}
total_records = 0

for analysis in attack_analysis:
    total_records += analysis['total_records']
    for attack_type, count in analysis['attack_distribution'].items():
        total_distribution[attack_type] = total_distribution.get(attack_type, 0) + count

for attack_type, count in sorted(total_distribution.items()):
    percentage = (count / total_records) * 100
    print(f"{attack_type}: {count:,} ({percentage:.2f}%)")

print(f"\nTotal records: {total_records:,}")


=== ATTACK TYPE DISTRIBUTION ===

Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv:
Total records: 225,745
  DDoS: 128,027 (56.71%)
  BENIGN: 97,718 (43.29%)

Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv:
Total records: 286,467
  PortScan: 158,930 (55.48%)
  BENIGN: 127,537 (44.52%)

Friday-WorkingHours-Morning.pcap_ISCX.csv:
Total records: 191,033
  BENIGN: 189,067 (98.97%)
  Bot: 1,966 (1.03%)

Monday-WorkingHours.pcap_ISCX.csv:
Total records: 529,918
  BENIGN: 529,918 (100.00%)

Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv:
Total records: 288,602
  BENIGN: 288,566 (99.99%)
  Infiltration: 36 (0.01%)

Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv:
Total records: 170,366
  BENIGN: 168,186 (98.72%)
  Web Attack � Brute Force: 1,507 (0.88%)
  Web Attack � XSS: 652 (0.38%)
  Web Attack � Sql Injection: 21 (0.01%)

Tuesday-WorkingHours.pcap_ISCX.csv:
Total records: 445,909
  BENIGN: 432,074 (96.90%)
  FTP-Patator: 7,938 (1.78%)
  SSH-Patator: 5,897 (1.32%)


## 6. Data Quality Analysis


In [6]:
def analyze_data_quality(file_path, file_name, sample_size=1000):
    """Analyze data quality issues in a file"""
    try:
        # Read sample data
        df = pd.read_csv(file_path, nrows=sample_size)
        
        # Check missing values
        missing_values = df.isnull().sum()
        total_missing = missing_values.sum()
        
        # Check -1 values (common placeholder)
        minus_one_count = (df == -1).sum().sum()
        
        # Check infinite values
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        inf_count = 0
        for col in numeric_cols:
            inf_count += np.isinf(df[col]).sum()
        
        # Check for constant columns
        constant_cols = []
        for col in df.columns:
            if df[col].nunique() == 1:
                constant_cols.append(col)
        
        # Check for duplicate rows
        duplicates = df.duplicated().sum()
        
        return {
            'file_name': file_name,
            'sample_size': len(df),
            'total_missing': total_missing,
            'minus_one_count': minus_one_count,
            'inf_count': inf_count,
            'constant_cols': len(constant_cols),
            'duplicates': duplicates,
            'missing_by_column': missing_values[missing_values > 0].to_dict()
        }
    except Exception as e:
        return {
            'file_name': file_name,
            'error': str(e)
        }

# Analyze data quality for all files
print("=== DATA QUALITY ANALYSIS ===")
quality_analysis = []

for file in processed_files:
    file_path = os.path.join(processed_data_path, file)
    analysis = analyze_data_quality(file_path, file)
    quality_analysis.append(analysis)
    
    print(f"\n{file}:")
    if 'error' in analysis:
        print(f"  Error: {analysis['error']}")
    else:
        print(f"  Sample size: {analysis['sample_size']}")
        print(f"  Missing values: {analysis['total_missing']}")
        print(f"  -1 values: {analysis['minus_one_count']}")
        print(f"  Infinite values: {analysis['inf_count']}")
        print(f"  Constant columns: {analysis['constant_cols']}")
        print(f"  Duplicate rows: {analysis['duplicates']}")
        
        if analysis['missing_by_column']:
            print(f"  Missing by column:")
            for col, count in analysis['missing_by_column'].items():
                print(f"    {col}: {count}")

# Calculate totals
print("\n=== SUMMARY STATISTICS ===")
total_missing = sum(analysis.get('total_missing', 0) for analysis in quality_analysis)
total_minus_one = sum(analysis.get('minus_one_count', 0) for analysis in quality_analysis)
total_inf = sum(analysis.get('inf_count', 0) for analysis in quality_analysis)
total_constant = sum(analysis.get('constant_cols', 0) for analysis in quality_analysis)
total_duplicates = sum(analysis.get('duplicates', 0) for analysis in quality_analysis)

print(f"Total missing values: {total_missing:,}")
print(f"Total -1 values: {total_minus_one:,}")
print(f"Total infinite values: {total_inf:,}")
print(f"Total constant columns: {total_constant}")
print(f"Total duplicate rows: {total_duplicates:,}")


=== DATA QUALITY ANALYSIS ===

Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv:
  Sample size: 1000
  Missing values: 0
  -1 values: 996
  Infinite values: 2
  Constant columns: 13
  Duplicate rows: 0

Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv:
  Sample size: 1000
  Missing values: 0
  -1 values: 1277
  Infinite values: 2
  Constant columns: 13
  Duplicate rows: 11

Friday-WorkingHours-Morning.pcap_ISCX.csv:
  Sample size: 1000
  Missing values: 0
  -1 values: 1146
  Infinite values: 2
  Constant columns: 13
  Duplicate rows: 11

Monday-WorkingHours.pcap_ISCX.csv:
  Sample size: 1000
  Missing values: 0
  -1 values: 771
  Infinite values: 20
  Constant columns: 11
  Duplicate rows: 51

Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv:
  Sample size: 1000
  Missing values: 0
  -1 values: 1132
  Infinite values: 0
  Constant columns: 13
  Duplicate rows: 2

Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv:
  Sample size: 1000
  Missing values: 0
  -1 values: 