<a href="https://colab.research.google.com/github/juwetta/DLI_Group-B/blob/main/DLI_Malicious_URL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


Import imporant libraries

In [4]:
import os
from sklearn.datasets import load_svmlight_file
import glob
import scipy.sparse
import numpy as np

Accessing to folder

In [5]:
folder_path = '/content/drive/My Drive/DLI Group B/url_svmlight'

svm_files = glob.glob(os.path.join(folder_path, "*.svm"))
print(f"Found {len(svm_files)} SVM files in: {folder_path}")

Found 121 SVM files in: /content/drive/My Drive/DLI Group B/url_svmlight


In the context of the output, "shape" refers to the dimensions of the data structure (like a NumPy array or a SciPy sparse matrix).

For a 2D structure like combined_X, the shape (2396130, 3231961) means it has 2,396,130 rows and 3,231,961 columns. In this dataset, the rows represent the samples (e.g., URLs), and the columns represent the features.
For a 1D structure like combined_y, the shape (2396130,) means it has 2,396,130 elements. This corresponds to the labels for each of the samples in combined_X.
So, the shape tells you how many samples you have and how many features or labels are associated with each sample.

Read SVM file and store dataset


the different SVM files have varying numbers of features, causing an error when trying to combine them. I'll add a step to determine the total number of features across all files and then load each file with that consistent number of features.

In [6]:
# max_features = 0

# for file_path in svm_files:
#   try:
#     X, _ = load_svmlight_file(file_path)
#     if X.shape[1] > max_features:
#       max_features = X.shape[1]

#   except Exception as e:
#     print(f"Error loading file {os.path.basename(file_path)}: {e}")


# print(f"Maximum number of features found: {max_features}") #146.758s used
max_features = 3231961
print(f"Maximum number of features found: {max_features}")

Maximum number of features found: 3231961


In [7]:

all_X = []
all_y = []
max_features = 3231961
try:
    print("\nLoading and combining data...")

    # Load and combine data in a single pass, specifying the number of features
    for file_path in svm_files:
        try:
            X, y = load_svmlight_file(file_path, n_features=max_features)
            all_X.append(X)
            all_y.append(y)
            print(f"{os.path.basename(file_path)}", end="| ")

        except Exception as e:
            print(f"Error loading file {os.path.basename(file_path)}: {e}")

    if all_X and all_y:
        # Vertically stack the sparse feature matrices
        combined_X = scipy.sparse.vstack(all_X)
        # Concatenate the label arrays
        combined_y = np.concatenate(all_y)

        print("\nSuccessfully combined data from all files.")
        print(f"Shape of combined data (X): {combined_X.shape}")
        print(f"Shape of combined labels (y): {combined_y.shape}")
    else:
        print("\nNo data was loaded from the SVM files.")


except FileNotFoundError:
    print(f"Folder not found at: {folder_path}")
except Exception as e:
    print(f"An error occurred: {e}") #154.318s used


Loading and combining data...
Day0.svm| Day1.svm| Day2.svm| Day3.svm| Day4.svm| Day5.svm| Day6.svm| Day7.svm| Day8.svm| Day9.svm| Day10.svm| Day11.svm| Day12.svm| Day13.svm| Day14.svm| Day15.svm| Day16.svm| Day17.svm| Day18.svm| Day19.svm| Day20.svm| Day21.svm| Day22.svm| Day23.svm| Day24.svm| Day25.svm| Day26.svm| Day27.svm| Day28.svm| Day29.svm| Day30.svm| Day31.svm| Day32.svm| Day33.svm| Day34.svm| Day35.svm| Day36.svm| Day37.svm| Day38.svm| Day39.svm| Day40.svm| Day41.svm| Day42.svm| Day43.svm| Day44.svm| Day45.svm| Day46.svm| Day47.svm| Day48.svm| Day49.svm| Day50.svm| Day51.svm| Day52.svm| Day53.svm| Day54.svm| Day55.svm| Day56.svm| Day57.svm| Day58.svm| Day59.svm| Day60.svm| Day61.svm| Day62.svm| Day63.svm| Day64.svm| Day65.svm| Day66.svm| Day67.svm| Day68.svm| Day69.svm| Day70.svm| Day71.svm| Day72.svm| Day73.svm| Day74.svm| Day75.svm| Day76.svm| Day77.svm| Day78.svm| Day79.svm| Day80.svm| Day81.svm| Day82.svm| Day83.svm| Day84.svm| Day85.svm| Day86.svm| Day87.svm| Day88.svm| 

Identify the indexes of real-valued features

In [8]:
import os

feature_types_path = '/content/drive/My Drive/DLI Group B/url_svmlight/FeatureTypes'
real_valued_feature_indices = set()

try:
    with open(feature_types_path, 'r') as f:
        for line in f:
            # Assuming each line in FeatureTypes is a feature index
            try:
                index = int(line.strip())
                real_valued_feature_indices.add(index)
            except ValueError:
                # Handle potential non-integer lines in the file
                print(f"Skipping non-integer line in FeatureTypes: {line.strip()}")

    print(f"Identified {len(real_valued_feature_indices)} real-valued feature indices.")
    # print("First 10 real-valued feature indices:", list(real_valued_feature_indices)[:10]) # Optional: print a few indices

except FileNotFoundError:
    print(f"FeatureTypes file not found at: {feature_types_path}")
except Exception as e:
    print(f"An error occurred while reading FeatureTypes: {e}")

# Now you have the set of real-valued feature indices and can use it
# For example, you could filter your data or analyze these specific features.

Identified 64 real-valued feature indices.


Briefly explore the dataset

In [9]:
# Select the first few rows to inspect
num_rows_to_inspect = 5
sample_rows = combined_X[:num_rows_to_inspect]

print(f"Values of real-valued features in the first {num_rows_to_inspect} rows:")

# Iterate through the selected rows
for i in range(sample_rows.shape[0]):
    print(f"\nRow {i+1}:")
    # Iterate through the real-valued feature indices
    for feature_index in sorted(list(real_valued_feature_indices)): # Sorting for consistent output
        # Check if the feature exists in the current row (i.e., it's non-zero)

        if feature_index in sample_rows[i].indices:
            # Get the index within the non-zero elements
            data_index = np.where(sample_rows[i].indices == feature_index)[0][0]
            # Get the value of the feature
            feature_value = sample_rows[i].data[data_index]
            print(f"  Feature {feature_index}: {feature_value}")
        # If the feature index is not in sample_rows[i].indices, its value is 0 in the sparse matrix,
        # so we don't need to explicitly print 0 unless we want to see all real-valued features
        # even if their value is 0 for that sample. Let's only print non-zero real-valued features.

Values of real-valued features in the first 5 rows:

Row 1:
  Feature 4: 0.124138
  Feature 5: 0.117647
  Feature 16: 0.749633
  Feature 17: 0.843029
  Feature 18: 0.197344
  Feature 21: 0.142857
  Feature 22: 0.142857
  Feature 55: 1.0
  Feature 63: 1.0
  Feature 69: 1.0
  Feature 71: 1.0
  Feature 73: 1.0
  Feature 75: 1.0
  Feature 81: 1.0
  Feature 83: 1.0
  Feature 85: 1.0
  Feature 87: 1.0
  Feature 89: 1.0
  Feature 91: 1.0
  Feature 93: 1.0
  Feature 95: 1.0
  Feature 101: 1.0
  Feature 103: 1.0
  Feature 105: 1.0
  Feature 107: 1.0
  Feature 109: 1.0
  Feature 111: 1.0

Row 2:
  Feature 4: 0.103448
  Feature 5: 0.176471
  Feature 16: 0.72266
  Feature 17: 0.836498
  Feature 18: 0.6189
  Feature 21: 0.0119048
  Feature 23: 1.0
  Feature 55: 1.0
  Feature 61: 1.0
  Feature 63: 1.0
  Feature 65: 1.0
  Feature 67: 1.0
  Feature 69: 1.0
  Feature 71: 1.0
  Feature 73: 1.0
  Feature 75: 1.0
  Feature 81: 1.0
  Feature 83: 1.0
  Feature 85: 1.0
  Feature 87: 1.0
  Feature 89: 1.0
  F

# Task
Balance the classes in the `combined_y` variable of the dataset.

## Check class distribution

### Subtask:
Analyze the current distribution of the target variable (`combined_y`) to see how imbalanced the classes are.


**Reasoning**:
Calculate and print the counts of each unique class in the `combined_y` array to understand the class distribution.



In [10]:
unique_classes, class_counts = np.unique(combined_y, return_counts=True)

print("Class distribution in combined_y:")
for class_val, count in zip(unique_classes, class_counts):
    print(f"Class {int(class_val)}: {count}")

Class distribution in combined_y:
Class -1: 1603985
Class 1: 792145


## Choose a balancing technique

### Subtask:
Decide on an appropriate method for balancing the classes, such as oversampling (e.g., SMOTE), undersampling, or a combination.


## Apply the balancing technique

### Subtask:
Apply the balancing technique


**Reasoning**:
To balance the dataset, I will first import the necessary libraries, `SMOTE` and `RandomUnderSampler`. Then, I will apply SMOTE to oversample the minority class, followed by RandomUnderSampler to undersample the majority class, as per the instructions. This will create a more balanced dataset for model training.



In [1]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# 1. Instantiate SMOTE
smote = SMOTE(sampling_strategy=0.5, random_state=42)

# 2. Apply SMOTE to the data
X_smote, y_smote = smote.fit_resample(combined_X, combined_y)

# 3. Instantiate RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy=0.8, random_state=42)

# 4. Apply RandomUnderSampler to the SMOTE-resampled data
X_resampled, y_resampled = rus.fit_resample(X_smote, y_smote)

# Print the shapes of the resampled data to verify
print(f"Shape of X after SMOTE and RandomUnderSampler: {X_resampled.shape}")
print(f"Shape of y after SMOTE and RandomUnderSampler: {y_resampled.shape}")

# Verify the new class distribution
unique_classes_resampled, class_counts_resampled = np.unique(y_resampled, return_counts=True)
print("\nClass distribution after resampling:")
for class_val, count in zip(unique_classes_resampled, class_counts_resampled):
    print(f"Class {int(class_val)}: {count}")

NameError: name 'combined_X' is not defined