In this work, we will try to predict the exact number of hackers that attack the company. The start-up isn't sure if there were 2 or 3 hackers, that's why they hire our service to try to predict the exact number.

First, let's run Spark in Colab

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://mirrors.sonic.net/apache/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar xzf spark-3.1.2-bin-hadoop3.2.tgz
!pip install -q findspark


import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"


import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Once we install Spark in our Colab, it is time to import all the functions that we need to do this exercise:

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

Once we import all the needed functions, let's load our database.

In [None]:
df = spark.read.options(header = True, inferSchema = True).csv('drive/MyDrive/Colab Notebooks/hack_data.csv')

Let's check all the columns and their type, so we can know if it needs to do some indexing or not.

In [None]:
df.printSchema()

root
 |-- Session_Connection_Time: double (nullable = true)
 |-- Bytes Transferred: double (nullable = true)
 |-- Kali_Trace_Used: integer (nullable = true)
 |-- Servers_Corrupted: double (nullable = true)
 |-- Pages_Corrupted: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- WPM_Typing_Speed: double (nullable = true)



It isn't needed, because the only column that needs indexing is Location, and we know that hackers use VPN's, so it will be useless to use it.

Now, I will  check for the column names to be able to copy them to do the assembler (so I don't have to write it by hand).

In [None]:
df.columns

['Session_Connection_Time',
 'Bytes Transferred',
 'Kali_Trace_Used',
 'Servers_Corrupted',
 'Pages_Corrupted',
 'Location',
 'WPM_Typing_Speed']

Let's time to do the assembler, needed to create the column features that the model will use to make the predictions.

In [None]:
assembler = VectorAssembler(inputCols=('Session_Connection_Time', 'Bytes Transferred',  'Kali_Trace_Used',
                            'Servers_Corrupted', 'Pages_Corrupted', 'WPM_Typing_Speed'),
                            outputCol = 'features')

final_df = assembler.transform(df);
db = final_df.select('features') # Just take the needed column to run the model

Now, let's create our model. First, with **k=2** 

In [None]:
kmeans = KMeans(k=2, seed= None) # No matter I put seed = 1 or seed = None, silhouette value remains the same
k2 = kmeans.fit(db)
k2_pred = k2.transform(db)

In [None]:
k2_pred.show(5)

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|[8.0,391.09,1.0,2...|         1|
|[20.0,720.99,0.0,...|         0|
|[31.0,356.32,1.0,...|         1|
|[2.0,228.08,1.0,2...|         1|
|[20.0,408.5,0.0,3...|         1|
+--------------------+----------+
only showing top 5 rows



In [None]:
evaluator = ClusteringEvaluator(metricName='silhouette', distanceMeasure='squaredEuclidean')
k2_sil = evaluator.evaluate(k2_pred)

print("The silhouette for k=2 is " + str(k2_sil))

The silhouette for k=2 is 0.8048521975748283


Now, for **k=3**

In [None]:
kmeans = KMeans(k=3, seed= None) # No matter I put seed = 1 or seed = None, silhouette value remains the same
k3 = kmeans.fit(db)
k3_pred = k3.transform(db)

In [None]:
k3_pred.show(5)

+--------------------+----------+
|            features|prediction|
+--------------------+----------+
|[8.0,391.09,1.0,2...|         0|
|[20.0,720.99,0.0,...|         2|
|[31.0,356.32,1.0,...|         0|
|[2.0,228.08,1.0,2...|         0|
|[20.0,408.5,0.0,3...|         0|
+--------------------+----------+
only showing top 5 rows



In [None]:
k3_sil = evaluator.evaluate(k3_pred)

print("The silhouette for k=3 is " + str(k3_sil))

The silhouette for k=2 is 0.6946221547026241


As we can see, the silhouette value when k=3 is quite lower than silhouette value when k=2, so with this information we can conclude that there were 3 the number of hackers that attack the company.