# Closed-World Experiment
- Objective
  
  classify the 95 monitored websites.
- Data Used

  **mon_standard.pkl**: data from "monitored" websites.
   - Class count: 95
   - Instance count: 19,000 (95 websites, each with 10 subpages which are non-index pages, observed 20 times each)
- Features

  - X1_mon : the sequence of **packet timestamps**
  - X2_mon : the sequence of **packet size * direction**
  - X3_mon : the sequence of **cumulative packet sizes**
  - X4_mon : the sequence of **bursts**

- Steps
  1. 준비
      - 데이터 업로드
      - 파일 경로 설정
  2. 데이터 전처리
      - Feature Extraction
      - Sequence Length Normalization
  3. Random Forest 모델 실험
      - X1_mon
      - X1_mon + X2_mon
      - X1_mon + X2_mon + X3_mon

## 준비
1. 데이터 업로드
    - mon_standard.pkl
    - X3_mon_data.pkl
    - X4_mon_data.pkl
2. 파일 경로 설정
    - 데이터를 업로드한 위치의 경로를 path 변수에 할당


In [None]:
# 파일 경로 설정
path = ''
mon_file = f'{path}/mon_standard.pkl'
unmon_file = f'{path}/unmon_standard10.pkl'
X3_mon_file = f'{path}/X3_mon_data.pkl'
X4_mon_file = f'{path}/X4_mon_data.pkl'

## Data Preprocessing
1. Feature Extraction
    - Extract X1_mon, X2_mon from mon_standard.pkl
    - Extract X3_mon, X4_mon using X2_mon
2. Sequence Length Normalization
    - X1 ~ X3 Sequence Length Normalization


### Feature Extraction



#### Extract X1_mon, X2_mon from mon_standard.pkl


In [4]:
import pickle

USE_SUBLABEL = False
URL_PER_SITE = 10
TOTAL_URLS_MON = 950

# Load the pickle file
print("Loading datafile...")
with open(mon_file, 'rb') as fi: # Path to mon_standard.pkl in Colab
    data = pickle.load(fi)

X1_mon = [] # Array to store instances (timestamps) - 19,000 instances, e.g., [[0.0, 0.5, 3.4, ...], [0.0, 4.5, ...], [0.0, 1.5, ...], ... [... ,45.8]]
X2_mon = [] # Array to store instances (direction*size) - size information
y_mon = [] # Array to store the site of each instance - 19,000 instances, e.g., [0, 0, 0, 0, 0, 0, ..., 94, 94, 94, 94, 94]

# Differentiate instances and sites, and store them in the respective x and y arrays
# x array (direction*timestamp), y array (site label)
for i in range(TOTAL_URLS_MON):
    if USE_SUBLABEL:
        label = i
    else:
        label = i // URL_PER_SITE # Calculate which site's URL the current URL being processed belongs to and set that value as the label. Thus, URLs fetched from the same site are labeled identically.
    for sample in data[i]:
        size_seq = []
        time_seq = []
        for c in sample:
            dr = 1 if c > 0 else -1
            time_seq.append(abs(c))
            size_seq.append(dr * 512)
        X1_mon.append(time_seq)
        X2_mon.append(size_seq)
        y_mon.append(label)
size = len(y_mon)

print(f'Total samples: {size}') # Output: 19000


Loading datafile...
Total samples: 19000


#### Check how X1_mon, X2_mon looks like

In [5]:
def inspect_data(X, num_samples=5):
    print(f"Total number of samples: {len(X)}")
    print(f"Length of each sample (first {num_samples} samples): {[len(sample) for sample in X[:num_samples]]}")
    print(f"First {num_samples} samples:")
    for i, sample in enumerate(X[:num_samples]):
        print(f"Sample {i + 1}: {sample}")

print("Inspecting X1 (timestamps):")
inspect_data(X1_mon)

print("\nInspecting X2 (direction * size):")
inspect_data(X2_mon)


Inspecting X1 (timestamps):
Total number of samples: 19000
Length of each sample (first 5 samples): [1421, 518, 1358, 1446, 1406]
First 5 samples:
Sample 1: [0.0, 0.14, 0.14, 0.31, 0.31, 0.51, 0.51, 0.51, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.88, 0.88, 0.88, 0.88, 0.88, 0.88, 0.88, 0.89, 1.13, 1.13, 1.3, 1.3, 1.3, 1.3, 1.3, 1.47, 1.47, 1.58, 1.72, 1.72, 1.72, 1.97, 2.21, 2.21, 2.21, 2.38, 2.38, 2.38, 2.38, 2.38, 2.38, 2.38, 2.47, 2.47, 2.48, 2.68, 2.68, 2.68, 2.98, 2.98, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.05, 3.08, 3.08, 3.08, 3.08, 3.08, 3.08, 3.08, 3.08, 3.08, 3.08, 3.08, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.15, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.17, 3.19, 3.48, 3.48, 3.48, 3.48, 3.48, 3.

#### Extract X3_mon, X4_mon using X2_mon

- 다음과 같이 X2_mon을 이용해서 X3_mon과 X_mon 추출함
- pkl 파일로 저장해놓고 불러와서 사용함

In [6]:
'''
import numpy as np
from collections import Counter

# Function to calculate X3 (Cumulative Packet Sizes)
def compute_cumulative_sizes(X2):
    return [np.cumsum(seq).tolist() for seq in X2]

# Function to calculate X4 (Bursts)
def compute_bursts(X2):
    bursts = []
    for seq in X2:
        current_burst = 0
        burst_sequence = []
        for i, value in enumerate(seq):
            if i == 0 or np.sign(value) == np.sign(seq[i - 1]):
                current_burst += value
            else:
                burst_sequence.append(current_burst)
                current_burst = value
        burst_sequence.append(current_burst)  # Append the last burst
        bursts.append(burst_sequence)
    return bursts

# Extract X3 and X4
X3 = compute_cumulative_sizes(X2)
X4 = compute_bursts(X2)
'''

'\nimport numpy as np\nfrom collections import Counter\n\n# Function to calculate X3 (Cumulative Packet Sizes)\ndef compute_cumulative_sizes(X2):\n    return [np.cumsum(seq).tolist() for seq in X2]\n\n# Function to calculate X4 (Bursts)\ndef compute_bursts(X2):\n    bursts = []\n    for seq in X2:\n        current_burst = 0\n        burst_sequence = []\n        for i, value in enumerate(seq):\n            if i == 0 or np.sign(value) == np.sign(seq[i - 1]):\n                current_burst += value\n            else:\n                burst_sequence.append(current_burst)\n                current_burst = value\n        burst_sequence.append(current_burst)  # Append the last burst\n        bursts.append(burst_sequence)\n    return bursts\n\n# Extract X3 and X4\nX3 = compute_cumulative_sizes(X2)\nX4 = compute_bursts(X2)\n'

##### X3_mon, X4_mon 불러오기

In [7]:
# X3와 X4 불러오기
with open(X3_mon_file, 'rb') as f_x3:
    X3_mon = pickle.load(f_x3)

with open(X4_mon_file, 'rb') as f_x4:
    X4_mon = pickle.load(f_x4)

# 확인 (첫 번째 샘플 확인)
print("First sample of X3_mon loaded:", X3_mon[0])
print("First sample of X4_mon loaded:", X4_mon[0])

First sample of X3_mon loaded: [-512, -1024, -512, -1024, -512, -1024, -512, 0, -512, -1024, -1536, -2048, -2560, -3072, -3584, -4096, -4608, -5120, -5632, -6144, -6656, -7168, -6656, -7168, -6656, -7168, -7680, -7168, -6656, -6144, -6656, -6144, -6656, -7168, -7680, -7168, -6656, -7168, -6656, -6144, -6656, -7168, -7680, -8192, -8704, -9216, -9728, -10240, -10752, -10240, -10752, -11264, -10752, -11264, -10752, -11264, -11776, -12288, -12800, -13312, -13824, -14336, -14848, -15360, -15872, -16384, -16896, -17408, -17920, -18432, -18944, -19456, -19968, -20480, -20992, -21504, -22016, -22528, -23040, -23552, -24064, -24576, -25088, -25600, -26112, -26624, -27136, -27648, -28160, -28672, -29184, -29696, -30208, -30720, -31232, -31744, -32256, -32768, -33280, -33792, -34304, -34816, -35328, -35840, -36352, -36864, -37376, -37888, -38400, -38912, -39424, -39936, -40448, -40960, -41472, -41984, -42496, -43008, -43520, -44032, -44544, -44032, -44544, -45056, -44544, -45056, -45568, -46080, 

##### Check X3_mon, X4_mon

In [9]:
# Test X3_mon: Cumulative Sum
for i in range(len(X2_mon)):
    x2_seq = X2_mon[i]
    x3_seq = X3_mon[i]
    assert x3_seq[-1] == sum(x2_seq), f"Error: Cumulative sum mismatch at index {i}. X3[-1] != sum(X2)."
print("X3 Test passed: Cumulative sum is correct.")

# Print values for a few samples
print(f"X2 (Packet Size * Direction) for sample 0: {X2_mon[0]}")
print(f"X3 (Cumulative Sizes) for sample 0: {X3_mon[0]}")
print(f"X4 (Bursts) for sample 0: {X4_mon[0]}")

X3 Test passed: Cumulative sum is correct.
X2 (Packet Size * Direction) for sample 0: [-512, -512, 512, -512, 512, -512, 512, 512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, 512, -512, 512, -512, -512, 512, 512, 512, -512, 512, -512, -512, -512, 512, 512, -512, 512, 512, -512, -512, -512, -512, -512, -512, -512, -512, -512, 512, -512, -512, 512, -512, 512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, 512, -512, -512, 512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, 512, 512, 512, 512, 512, 512, 512, 512, -512, 512, 512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, -512, 

### Sequence Length Normalization
- X1 ~ X3 Sequence Length Normalization
(X4 는 애초에 X1~X3랑 길이가 다른 성질이어서 제외함)
- 다음과 같은 Candidate length 중에서 최빈값 선택함
    - 최빈값
    - 평균
    - 최솟값
  

#### (use 최빈값, 0 padding)
- 9977 개

In [10]:
import numpy as np
from collections import Counter

def adjust_sequences_to_mode_length(sequences, padding_value=0):
    # Step 1: Calculate the lengths of all sequences
    lengths = [len(seq) for seq in sequences]

    # Step 2: Find the mode of the lengths
    mode_length = Counter(lengths).most_common(1)[0][0]
    print(f"Mode length: {mode_length}")

    # Step 3: Adjust all sequences to the mode length
    adjusted_sequences = []
    for seq in sequences:
        if len(seq) > mode_length:
            # Truncate sequences longer than the mode length
            adjusted_seq = seq[:mode_length]
        else:
            # Pad sequences shorter than the mode length
            adjusted_seq = np.pad(seq, (0, mode_length - len(seq)), constant_values=padding_value)
        adjusted_sequences.append(adjusted_seq)

    return np.array(adjusted_sequences), mode_length

# Adjust to mode length
X1_mon_adjusted, mode_length_X1_mon = adjust_sequences_to_mode_length(X1_mon)
X2_mon_adjusted, mode_length_X2_mon = adjust_sequences_to_mode_length(X2_mon)
X3_mon_adjusted, mode_length_X3_mon = adjust_sequences_to_mode_length(X3_mon)

print(f"X1_mon adjusted to mode length {mode_length_X1_mon}:\n", X1_mon_adjusted)
print(f"X2_mon adjusted to mode length {mode_length_X2_mon}:\n", X2_mon_adjusted)
print(f"X3_mon adjusted to mode length {mode_length_X3_mon}:\n", X3_mon_adjusted)

Mode length: 9977
Mode length: 9977
Mode length: 9977
X1_mon adjusted to mode length 9977:
 [[0.   0.14 0.14 ... 0.   0.   0.  ]
 [0.   0.13 0.13 ... 0.   0.   0.  ]
 [0.   0.11 0.11 ... 0.   0.   0.  ]
 ...
 [0.   0.11 0.11 ... 0.   0.   0.  ]
 [0.   0.17 0.17 ... 0.   0.   0.  ]
 [0.   0.12 0.12 ... 0.   0.   0.  ]]
X2_mon adjusted to mode length 9977:
 [[-512 -512  512 ...    0    0    0]
 [-512 -512  512 ...    0    0    0]
 [-512 -512  512 ...    0    0    0]
 ...
 [-512 -512  512 ...    0    0    0]
 [-512 -512  512 ...    0    0    0]
 [-512 -512  512 ...    0    0    0]]
X3_mon adjusted to mode length 9977:
 [[ -512 -1024  -512 ...     0     0     0]
 [ -512 -1024  -512 ...     0     0     0]
 [ -512 -1024  -512 ...     0     0     0]
 ...
 [ -512 -1024  -512 ...     0     0     0]
 [ -512 -1024  -512 ...     0     0     0]
 [ -512 -1024  -512 ...     0     0     0]]


## Random Forest Model 실험
   - X1_mon
   - X1_mon + X2_mon
   - X1_mon + X2_mon + X3_mon

#### RF(X1만 사용)
Accuracy: 0.4579

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Prepare the data
# X1 represents timestamps, X2 represents direction * size
# Choose one feature set for the model (X1 or X2)
X_mon = X1_mon_adjusted  # or X2
y_mon = np.array(y_mon)

# Step 2: Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_mon, y_mon, test_size=0.2, random_state=0, stratify=y_mon)

# Step 3: Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
rf_model.fit(X_train, y_train)

# Step 4: Evaluate the model
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Optional: Feature importance (if using X2)
if X_mon is X2_mon:
    feature_importances = rf_model.feature_importances_
    print("Feature Importances:", feature_importances)

Accuracy: 0.4666
Classification Report:
              precision    recall  f1-score   support

           0       0.45      0.23      0.30        40
           1       0.38      0.38      0.38        40
           2       0.39      0.57      0.46        40
           3       0.54      0.55      0.54        40
           4       0.51      0.47      0.49        40
           5       0.42      0.28      0.33        40
           6       0.40      0.42      0.41        40
           7       0.38      0.33      0.35        40
           8       0.55      0.60      0.57        40
           9       0.39      0.35      0.37        40
          10       0.40      0.45      0.42        40
          11       0.54      0.38      0.44        40
          12       0.48      0.65      0.55        40
          13       0.19      0.10      0.13        40
          14       0.46      0.28      0.34        40
          15       0.41      0.33      0.36        40
          16       0.45      0.50      0.

#### RF (X1, X2 사용)

Accuracy: 0.6029

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Combine X1 and X2
# Ensure X1 and X2 have the same number of samples
X_combined = [np.concatenate((x1, x2)) for x1, x2 in zip(X1_mon_adjusted, X2_mon_adjusted)]
X_combined = np.array(X_combined)
y_mon = np.array(y_mon)

# Step 2: Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X_combined, y_mon, test_size=0.2, random_state=0, stratify=y_mon)

# Step 3: Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
rf_model.fit(X_train, y_train)

# Step 4: Evaluate the model
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Optional: Feature importance
feature_importances = rf_model.feature_importances_
print("Feature Importances:", feature_importances)


Accuracy: 0.6300
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.45      0.58        40
           1       0.63      0.65      0.64        40
           2       0.48      0.75      0.59        40
           3       0.74      0.78      0.76        40
           4       0.70      0.65      0.68        40
           5       0.88      0.55      0.68        40
           6       0.57      0.65      0.60        40
           7       0.46      0.40      0.43        40
           8       0.77      0.68      0.72        40
           9       0.47      0.57      0.52        40
          10       0.66      0.53      0.58        40
          11       0.86      0.45      0.59        40
          12       0.53      0.82      0.65        40
          13       0.42      0.25      0.31        40
          14       0.78      0.53      0.63        40
          15       0.54      0.53      0.53        40
          16       0.53      0.53      0.

#### RF (X1, X2, X3 사용)
Accuracy: 0.9100

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Combine X1 and X2
# Ensure X1 and X2 have the same number of samples
X123_combined = [np.concatenate((x1, x2, x3)) for x1, x2, x3 in zip(X1_mon_adjusted, X2_mon_adjusted, X3_mon_adjusted)]
X123_combined = np.array(X123_combined)
y_mon = np.array(y_mon)

# Step 2: Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X123_combined, y_mon, test_size=0.2, random_state=0, stratify=y_mon)

# Step 3: Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
rf_model.fit(X_train, y_train)

# Step 4: Evaluate the model
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Optional: Feature importance
feature_importances = rf_model.feature_importances_
print("Feature Importances:", feature_importances)

Accuracy: 0.9100
Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.85      0.92        40
           1       0.97      0.88      0.92        40
           2       0.93      0.93      0.93        40
           3       0.95      0.97      0.96        40
           4       0.93      0.95      0.94        40
           5       0.93      0.97      0.95        40
           6       0.91      0.97      0.94        40
           7       0.78      0.90      0.84        40
           8       0.92      0.90      0.91        40
           9       0.95      0.93      0.94        40
          10       0.85      0.85      0.85        40
          11       1.00      0.90      0.95        40
          12       0.95      1.00      0.98        40
          13       0.92      0.90      0.91        40
          14       0.94      0.82      0.88        40
          15       0.74      0.93      0.82        40
          16       0.97      0.82      0.