# Malicious Network Traffic Detection in IoT Networks
This project develops a machine learning model to automatically detect malicious network traffic patterns in real-time IoT environments.

## Project Overview
This notebook demonstrates the development of a machine learning model to detect malicious network traffic patterns in IoT (Internet of Things) environments. Using the RT-IoT2022 Dataset, we implement a Random Forest classifier to distinguish between normal and malicious network behaviors.

## Business Context
The rapid growth of IoT devices has created new security challenges for organizations:
- IoT networks generate massive volumes of traffic data
- Traditional manual analysis is no longer feasible
- Attackers increasingly target IoT devices for botnets and data theft
- Real-time detection of threats is crucial for network security

## Technical Approach
We implement a supervised machine learning solution using:
- Random Forest Classification
- Feature engineering for network traffic data
- Label encoding for categorical variables
- Performance evaluation with multiple metrics

## Dataset Description
The RT-IoT2022 Dataset from UCI Machine Learning Repository includes:
- Size: 123,117 network traffic samples
- Features: 85 network traffic characteristics
- Types: Mix of normal and attack traffic patterns
- Source: Real-world IoT infrastructure data
- Categories: Various attack types including DDoS, scanning, and malware

Key Features Include:
- Protocol types
- Service types
- Flow duration
- Packet statistics
- Network behavior patterns

## Success Metrics
The model's performance will be evaluated against these criteria:

1. Accuracy Metrics:
   - Minimum accuracy: 80% on test set
   - Minimum precision: 80% (reduce false positives)
   - Minimum recall: 80% (detect most attacks)
   - Confidence threshold: 70%

2. Operational Requirements:
   - Process streaming data efficiently
   - Provide clear attack classifications
   - Generate confidence scores for predictions

## Implementation Steps
1. Data preprocessing and exploration
2. Feature engineering and selection
3. Model training using Random Forest
4. Performance evaluation
5. Feature importance analysis

## Expected Outputs
For each network traffic sample, the model will provide:
- Binary classification (Malicious/Benign)
- Confidence score (0-100%)
- Feature importance insights

## Usage Notes
- The notebook requires basic Python data science libraries
- Designed for use with the RT-IoT2022 dataset
- Can be adapted for similar network traffic analysis tasks


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

## 1. Load the Dataset

In [None]:
# Load the dataset
data = pd.read_csv('RT_IOT2022.csv')
print("Dataset shape:", data.shape)

## 2. Data Exploration

In [None]:
# Display the first few rows
print(data.head())

# Summary statistics
print("\nSummary statistics:")
print(data.describe())

# Check for missing values
print("\nMissing values:")
print(data.isnull().sum())

## 3. Encode Categorical Data

In [None]:
# 1. Load the data
data = pd.read_csv('RT_IOT2022.csv')

# 2. Examine original data
print("Sample rows with full context:")
print(data[['proto', 'service', 'Attack_type']].head(10))

print("\nData type of Attack_type column:", data['Attack_type'].dtype)

print("\nOriginal Attack Types and their frequencies:")
print(data['Attack_type'].value_counts())

# 3. Create encoders
attack_encoder = LabelEncoder()
proto_encoder = LabelEncoder()
service_encoder = LabelEncoder()

# 4. Create and store mappings
# Attack types mapping
attack_encoder.fit(data['Attack_type'])
print("\nMapping of numbers to attack types:")
for i, attack_type in enumerate(attack_encoder.classes_):
    print(f"{i}: {attack_type}")

# Protocol and service types
print("\nProtocol types:")
print(data['proto'].unique())
print("\nService types:")
print(data['service'].unique())

# 5. Transform the data
data['Attack_type'] = attack_encoder.transform(data['Attack_type'])
data['proto'] = proto_encoder.fit_transform(data['proto'])
data['service'] = service_encoder.fit_transform(data['service'])

## 4. Prepare Data for Training

In [None]:
# Prepare features and target
X = data.drop(['Attack_type', 'Unnamed: 0'], axis=1)
y = data['Attack_type']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

## 5. Train the Model

In [None]:
# Initialize the RandomForestClassifier
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

# Train the model
model.fit(X_train, y_train)
print("Model training completed!")

## 6. Evaluate the Model

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Print metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

## 7. Feature Importance

In [None]:
# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
})

# Sort by importance
feature_importance = feature_importance.sort_values('importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance.head(10))
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.show()

## Production Deployment (Optional)
The next step is to deploy this model into a production network environment. This involves several key actions:

1. **Upload the Model to an S3 Bucket**: 
   - Save your trained machine learning model (as a .joblib or .pickle file) 
   - Upload this file to an Amazon S3 bucket

2. **Create a SageMaker Model**: 
   - Create an AWS SageMaker model that references the saved model file in your S3 bucket
   - Specify the path to the model in S3
   - Select the appropriate Docker container image for your model type

3. **Deploy the Model to an Endpoint**: 
   - Deploy the SageMaker model to an endpoint
   - This endpoint will be used for making real-time predictions

4. **Send Real-Time Traffic for Prediction**: 
   - Route your real-time production network traffic to the SageMaker endpoint
   - The endpoint will use your deployed model to predict the attack types
   - Results will be returned in real-time

This deployment process will enable the model to analyze network traffic in real-time and identify potential malicious activities, enhancing your network's security posture.
