<a href="https://colab.research.google.com/github/hasan-rakibul/AI-cybersec/blob/main/Lab%2011/lab_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering domain URLs using K-means clustering algorithm
Spam domains are malicious or suspicious domain names used for various illicit activities, including phishing, malware distribution and spam email campaigns. Clustering these domains helps identify patterns, similarities and clusters of spam domains, enabling the development of more effective detection and mitigation strategies.

By applying clustering algorithms to spam domain datasets, we can group similar domains together based on various features such as domain names, registration dates, IP addresses or textual content. This clustering process helps in identifying common characteristics and patterns that are prevalent among spam domains.

K-means clustering is a popular unsupervised machine learning algorithm used for partitioning data into distinct groups based on similarities. It aims to minimise the variance within each cluster by iteratively assigning data points to the nearest cluster centroid and updating the centroids. K-means is a simple and efficient algorithm that works well for large datasets. However, it requires specifying the number of clusters (K) in advance.

## Dataset
To demonstrate both the cluster generation and cluster scoring steps, we will work with a labeled dataset of internet domain names. The good names are the top 500,000 Alexa sites from May 2014, and the bad names are 13,789 “toxic domains”.

Download the datasets and keep it on your Google drive folder (e.g., `Colab Notebooks/AICS/` folder)

In [1]:
import numpy as np

In [2]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [3]:
good_domains = []
toxic_domains = []

In [4]:
# Read good_domains.txt
with open('/content/drive/MyDrive/Colab Notebooks/AICS/good_domains.txt', 'r') as f:
  good_domains = f.read().splitlines()

# Read toxic_domains_whole.txt
with open('/content/drive/MyDrive/Colab Notebooks/AICS/toxic_domains_whole.txt', 'r') as f:
  toxic_domains = f.read().splitlines()

In [5]:
print("Five samples of good domains: ", good_domains[1000:1005])
print("Five samples of toxic domains: ", toxic_domains[1000:1005])

Five samples of good domains:  ['101prikaz.ru', '101razasdeperros.com', '101recipe.com', '101secureonline.com', '101shans.ru']
Five samples of toxic domains:  ['allaboutemarketing.info', 'allaboutlabyrinths.com', 'alladyn.unixstorm.org', 'allairjordanoutlet.us', 'allairmaxsaleoutlet.us']


In [6]:
print("Number of good domains: ", len(good_domains))
print("Number of toxic domains: ", len(toxic_domains))

Number of good domains:  500000
Number of toxic domains:  13789


## Feature extraction
Convert the domain URLs into numerical features using a bag-of-words representation. In this representation, we'll consider each unique word in the domain URLs as a feature. We'll use scikit-learn's CountVectorizer for this purpose.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
# Combine the good and toxic domain lists, because the algorithm doesn't suppose to know beforehand which one is good which one is toxic
all_domains = good_domains + toxic_domains

In [9]:
# Create an instance of CountVectorizer
vectoriser = CountVectorizer()

In [10]:
# Fit and transform the data to obtain the feature matrix
features = vectoriser.fit_transform(all_domains)

## Applying K-means clustering
Now, we can apply the K-means clustering algorithm to the feature matrix obtained in the previous step. We'll use scikit-learn's KMeans class for this task.

In [11]:
from sklearn.cluster import KMeans

In [12]:
# Specify the number of clusters (K)
K = 2

In [13]:
# Create an instance of KMeans
kmeans = KMeans(n_clusters=K, random_state=0)

In [14]:
# Fit the model to the feature matrix
kmeans.fit(features)



## Analysing the clustering results
After clustering, we can analyse the results to see which domains belong to which cluster.

In [15]:
# Get the cluster labels for each domain
labels = kmeans.labels_

In [16]:
# Separate the domains into their respective clusters
good_domains_cluster = []
toxic_domains_cluster = []

In [17]:
for i, domain in enumerate(all_domains):
  if labels[i] == 0:
    good_domains_cluster.append(domain)
  else:
    toxic_domains_cluster.append(domain)

## Saving clusters

In [18]:
# Specify the output file paths
good_domains_file = "good_domains_cluster.txt"
toxic_domains_file = "toxic_domains_cluster.txt"

In [19]:
# Open the output files in write mode
with open(good_domains_file, 'w') as f:
  # Write the good domains to the file
  f.write('\n'.join(good_domains_cluster))

with open(toxic_domains_file, 'w') as f:
  f.write('\n'.join(toxic_domains_cluster))

## Calculate clustering performacne
In this code snippet, we use the `accuracy_score` function from scikit-learn's metrics module to calculate the accuracy. We pass the ground truth labels (ground_truth_labels) and the cluster labels obtained from K-means (labels) as input to the `accuracy_score` function.

In [20]:
from sklearn.metrics import accuracy_score

# Create ground truth labels
ground_truth_labels = [0] * len(good_domains) + [1] * len(toxic_domains)

# Calculate accuracy
accuracy = accuracy_score(ground_truth_labels, labels)

# Print the accuracy value
print("Accuracy:", accuracy)

Accuracy: 0.4107386495234425


## Practice task
Instead of the whole dataset, use only the last 5000 domains from each dataset (`good_domains.txt` and `toxic_domains_whole.txt`). Cluster the good and toxic domains and report accuracy, precision and recall.