# Week 1: Dataset Acquisition and Exploration

**Project:** SentinelNet â€“ AI Powered Network Intrusion Detection System  
**Dataset:** NSL-KDD  
**Contributor:** Sowbharanika S R  

## Objectives
- Load NSL-KDD dataset
- Explore dataset structure
- Perform statistical analysis
- Check for missing and duplicate values
- Analyze attack class distribution

In [None]:
import os
import pandas as pd
COLUMNS = [
    "duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes",
    "land", "wrong_fragment", "urgent", "hot", "num_failed_logins",
    "logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
    "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
    "is_host_login", "is_guest_login", "count", "srv_count", "serror_rate",
    "srv_serror_rate", "rerror_rate", "srv_rerror_rate", "same_srv_rate",
    "diff_srv_rate", "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
    "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate",
    "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "class", "difficulty_level"
]

DATA_DIR = "data"

train_path = os.path.join(DATA_DIR, "KDDTrain+.txt")
test_path = os.path.join(DATA_DIR, "KDDTest+.txt")

train_df = pd.read_csv(train_path, names=COLUMNS)
test_df = pd.read_csv(test_path, names=COLUMNS)

print("Data loaded successfully!")

print("Training data shape:", train_df.shape)
print("Testing data shape:", test_df.shape)

print("\nFirst 5 rows of training data:")
train_df.head()

print("\n--- Training Data Info ---")
train_df.info()
train_df.describe()

print("Missing values in training data:")
print(train_df.isnull().sum().sum())

print("\nMissing values in testing data:")
print(test_df.isnull().sum().sum())

print("Duplicate rows in training data:",
      train_df.duplicated().sum())

print("Duplicate rows in testing data:",
      test_df.duplicated().sum())

print("Unique classes:")
print(train_df["class"].unique())

print("\nClass distribution:")
train_df["class"].value_counts()


## Week 1 Summary
- Dataset successfully loaded
- Training samples: 125,973
- Testing samples: 22,544
- No missing or duplicate values found
- 23 different traffic classes identified
