# DS/CMPSC 410 MiniProject #3
# Spring 2022
## Instructor: Professor John Yen
## TA: Rupesh Prajapati 
## LAs: Lily Jakielaszek and Cayla Shan Pun

### Learning Objectives
- Be able to apply bucketing to numerical variables for their integration with OHE features.
- Be able to apply k-means clustering to the Darknet dataset by combining buketing and one-hot encoding, first in local mode, then in cluster mode using the big data.
- Be able to use external labels (e.g., mirai) to evaluate and compare the results of k-means clustering.
- Be able to compare different mirai clusters identified by different k-means clustering and gain insights about characteristics of the clusters.  

### Total points: 100 
- Exercise 1: 5 points
- Exercise 2: 10 points 
- Exercise 3: 15 points 
- Exercise 4: 5 points
- Exercise 5: 10 points
- Exercise 6: 5 points
- Exercise 7: 5 points
- Exercise 8: 10 points
- Exercise 9: 15 points
- Exercise 10: 20 points
  
### Due: 11:59 pm, April 17th, 2022

In [1]:
import pyspark
import csv

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql.functions import col, column
from pyspark.sql.functions import expr
from pyspark.sql.functions import split
from pyspark.sql.functions import array_contains
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString, PCA
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [3]:
import pandas as pd
import numpy as np
import math

In [4]:
ss = SparkSession.builder.master("local").appName("MiniProject #3").getOrCreate()

# Exercise 1 (5 points)
Enter your name in this Markdown cell and complete the path for your data:
- Name: Haichen Wei

In [5]:
Scanners_big_df = ss.read.csv("/storage/home/hxw5245/MiniProj1/Day_2020_profile.csv", header=True, inferSchema=True )

In [6]:
Scanners_big_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: long (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)



# We want to find out the range of the "Packets" column for the entire big dataset so that we can design bucketing based on the range.
## The rest of the Jupyter Notebook (for local mode) will use the small dataset.

In [7]:
Scanners_big_df.select("Packets").describe().show()

+-------+-----------------+
|summary|          Packets|
+-------+-----------------+
|  count|          2270625|
|   mean|826.3704949077897|
| stddev|95219.37995087758|
|    min|                1|
|    max|         29924992|
+-------+-----------------+



## The higest value for the "Packets" column is 29,924,992 (close to 30 million). The minimum value is 1.

In [8]:
Scanners_df = ss.read.csv("/storage/home/hxw5245/MiniProj1/sampled_profile.csv", header= True, inferSchema=True )

## We can use printSchema() to display the schema of the DataFrame Scanners_df to see whether it was inferred correctly.

In [9]:
Scanners_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)



## Part A Compute the Correlation among features.
### Reason: Using two highly correlated features is (almost) like using a feature twice -- It can bias the clustering result because the distance calculation gives the feature more weight.

In [10]:
Scanners_df.stat.corr("Packets", "Bytes")

0.9996587858794584

In [11]:
from pyspark.sql.functions import corr
Scanners_df.select(corr("Packets", "Bytes")).show()

+--------------------+
|corr(Packets, Bytes)|
+--------------------+
|  0.9996587858794584|
+--------------------+



In [12]:
Scanners_df.select(corr("lifetime", "Packets")).show()

+-----------------------+
|corr(lifetime, Packets)|
+-----------------------+
|     0.8327176197182774|
+-----------------------+



In [13]:
Scanners_df.select(corr("lifetime", "numports")).show()

+------------------------+
|corr(lifetime, numports)|
+------------------------+
|     0.41013843142637924|
+------------------------+



In [14]:
Scanners_df.select(corr("numports", "MinUniqueDests")).show()

+------------------------------+
|corr(numports, MinUniqueDests)|
+------------------------------+
|          -2.46489881261615...|
+------------------------------+



# Part B: Transforming Numerical Features Using Bucketing

## B.1 Bucketing
Transform a numerical feature into multiple "buckets" by specifying their boundaries.
- A benefit of this transformation is controlling the maximal distance of this feature (so that it does not overweigh other features such as One-Hot-Encoded features)


In [15]:
Scanners_df.select("Packets").describe().show()

+-------+-----------------+
|summary|          Packets|
+-------+-----------------+
|  count|           227062|
|   mean|772.6227946552043|
| stddev|85686.56151383772|
|    min|                1|
|    max|         23726033|
+-------+-----------------+



## Even though the highest value of "Packets" column for this smaller scanner dataset is 23,726,033.  We have found earlier that the higest value of "Packets" column for the big scanner dataset is close to 30 million.

In [16]:
Packets_RDD=Scanners_df.select("Packets").rdd

In [17]:
Packets_rdd = Packets_RDD.map(lambda row: row[0])

In [18]:
Packets_rdd.histogram([0,10,100,1000,10000,100000,1000000,10000000,100000000])

([0, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000],
 [83571, 106530, 33670, 2857, 307, 110, 13, 4])

# Exercise 2 (10 points)
Complete the code below to convert the feature ``Packets`` into 10 buckets (based on 11 borders in ``bucketBorders``)

In [19]:
from pyspark.ml.feature import Bucketizer
bucketBorders=[-1.0, 5.0, 10.0, 25.0, 50.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0, 100000000.0]
bucketer = Bucketizer().setSplits(bucketBorders).setInputCol("Packets").setOutputCol("Packets_B10")
Scanners2_df = bucketer.transform(Scanners_df)

In [20]:
Scanners2_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Packets_B10: double (nullable = true)



In [21]:
Scanners2_df.select("Packets","Packets_B10").where("Packets > 100000").show(30)

+--------+-----------+
| Packets|Packets_B10|
+--------+-----------+
| 5156788|        9.0|
|  111408|        8.0|
|  106182|        8.0|
|  423063|        8.0|
|  128252|        8.0|
|  101375|        8.0|
|22989586|        9.0|
|  257781|        8.0|
|  454434|        8.0|
|  110155|        8.0|
|  135825|        8.0|
|  533497|        8.0|
|  120583|        8.0|
|  173663|        8.0|
|  433552|        8.0|
|  158969|        8.0|
|  125822|        8.0|
|  213661|        8.0|
|  132546|        8.0|
|  136777|        8.0|
|  148520|        8.0|
|  678087|        8.0|
|  360056|        8.0|
|  204668|        8.0|
|  258245|        8.0|
|  212670|        8.0|
| 5503873|        9.0|
|  122909|        8.0|
|  177976|        8.0|
|  431381|        8.0|
+--------+-----------+
only showing top 30 rows



In [22]:
Scanners2_df.select("numports").describe().show()

+-------+-----------------+
|summary|         numports|
+-------+-----------------+
|  count|           227062|
|   mean|6.280645814799482|
| stddev|352.0095436248727|
|    min|                1|
|    max|            65386|
+-------+-----------------+



In [23]:
Scanners_df.where(col('mirai')).count()

17132

# Part C: One Hot Encoding 
## This part is identical to that of Miniproject Deliverable #2
We want to apply one hot encoding to the set of ports scanned by scanners.  
- C.1 Like Mini Project deliverable 1 and 2, we first convert the feature "ports_scanned_str" to a feature that is an Array of ports
- C.2 We then calculate the total number of scanners for each port
- C.3 We identify the top n port to use for one-hot encoding (You choose the number n).
- C.4 Generate one-hot encoded feature for these top n ports.

In [None]:
# Scanners_df.select("ports_scanned_str").show(30)

In [24]:
Scanners3_df=Scanners2_df.withColumn("Ports_Array", split(col("ports_scanned_str"), "-") )
# Scanners_df2.persist().show(10)

# C.1 Convert the `ports_scanned_str` column to an array
## We only need the column ```Ports_Array``` to calculate the top ports being scanned

In [25]:
Ports_Scanned_RDD = Scanners3_df.select("Ports_Array").rdd

In [None]:
# Ports_Scanned_RDD.persist().take(5)

# C.2 Calculate the Total Number of scanners for each port
### Because each port number in the Ports_Array column for each row occurs only once, we can count the total occurance of each port number through flatMap.

In [26]:
Ports_list_RDD = Ports_Scanned_RDD.map(lambda row: row[0] )

In [None]:
# Ports_list_RDD.persist()

In [27]:
Ports_list2_RDD = Ports_Scanned_RDD.flatMap(lambda row: row[0] )

In [28]:
Port_count_RDD = Ports_list2_RDD.map(lambda x: (x, 1))
# Port_count_RDD.take(2)

In [29]:
Port_count_total_RDD = Port_count_RDD.reduceByKey(lambda x,y: x+y, 1)
# Port_count_total_RDD.persist().take(5)

In [30]:
Sorted_Count_Port_RDD = Port_count_total_RDD.map(lambda x: (x[1], x[0])).sortByKey( ascending = False)

In [None]:
# Sorted_Count_Port_RDD.persist().take(50)

# C.3 Identify top n ports to use for one-hot encoding.
### Select top_ports to be the number of top ports you want to use for one-hot encoding.  I recommend a number between 20 and 60.

In [31]:
top_ports= 50
Sorted_Ports_RDD= Sorted_Count_Port_RDD.map(lambda x: x[1])
Top_Ports_list = Sorted_Ports_RDD.take(top_ports)

In [32]:
Top_Ports_list

['17132',
 '17140',
 '17128',
 '17138',
 '17130',
 '17136',
 '23',
 '445',
 '54594',
 '17142',
 '17134',
 '80',
 '8080',
 '0',
 '2323',
 '5555',
 '81',
 '1023',
 '52869',
 '8443',
 '49152',
 '7574',
 '37215',
 '34218',
 '34220',
 '33968',
 '34224',
 '34228',
 '33962',
 '33960',
 '33964',
 '34216',
 '34226',
 '33970',
 '33972',
 '50401',
 '34222',
 '34230',
 '33966',
 '33974',
 '3389',
 '1433',
 '22',
 '5353',
 '21',
 '8291',
 '8728',
 '443',
 '5900',
 '8000']

In [None]:
# Scanners_df3=Scanners_df2.withColumn(FeatureName, array_contains("Ports_Array", Top_Ports_list[0]))

In [None]:
# Scanners_df3.show(10)

# C.4 Generate Hot-One Encoded Feature for each of the top ports in the Top_Ports_list

- Iterate through the Top_Ports_list so that each top port is one-hot encoded.

The following PySpark code encodes top n ports using One Hot Encoding, where n is specified by the variable ```top_ports```

In [33]:
for i in range(0, top_ports):
    # "Port" + Top_Ports_list[i]  is the name of each new feature created through One Hot Encoding
    Scanners_df3 = Scanners3_df.withColumn("Port" + Top_Ports_list[i], array_contains("Ports_Array", Top_Ports_list[i]))
    Scanners3_df = Scanners_df3

In [34]:
Scanners3_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Packets_B10: double (nullable = true)
 |-- Ports_Array: array (nullable = true)
 |    |-- element: st

# Exercise 3 (15 points)
Complete the following code to use k-means to cluster the scanners using the following features
- bucketing of 'packets' numerical feature
- one-hot encoding of top k ports (k=50)

# Part D: Clustering
## Specify Parameters for k Means Clustering
## We use the variable `cluster_num` for the number of clusters to be generated by k-means clustering.

In [35]:
cluster_num = 100
seed = 123
km = KMeans(featuresCol="features", predictionCol="prediction").setK(cluster_num).setSeed(seed)
km.explainParams()

'distanceMeasure: the distance measure. Supported options: \'euclidean\' and \'cosine\'. (default: euclidean)\nfeaturesCol: features column name. (default: features, current: features)\ninitMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)\ninitSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)\nk: The number of clusters to create. Must be > 1. (default: 2, current: 100)\nmaxIter: max number of iterations (>= 0). (default: 20)\npredictionCol: prediction column name. (default: prediction, current: prediction)\nseed: random seed. (default: -2471357743403942232, current: 123)\ntol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)\nweightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)'

In [36]:
input_features = ["Packets_B10"]
for i in range(0, top_ports):
    input_features.append( "Port"+Top_Ports_list[i] )

In [37]:
print(input_features)

['Packets_B10', 'Port17132', 'Port17140', 'Port17128', 'Port17138', 'Port17130', 'Port17136', 'Port23', 'Port445', 'Port54594', 'Port17142', 'Port17134', 'Port80', 'Port8080', 'Port0', 'Port2323', 'Port5555', 'Port81', 'Port1023', 'Port52869', 'Port8443', 'Port49152', 'Port7574', 'Port37215', 'Port34218', 'Port34220', 'Port33968', 'Port34224', 'Port34228', 'Port33962', 'Port33960', 'Port33964', 'Port34216', 'Port34226', 'Port33970', 'Port33972', 'Port50401', 'Port34222', 'Port34230', 'Port33966', 'Port33974', 'Port3389', 'Port1433', 'Port22', 'Port5353', 'Port21', 'Port8291', 'Port8728', 'Port443', 'Port5900', 'Port8000']


In [38]:
va = VectorAssembler().setInputCols(input_features).setOutputCol("features")

In [39]:
data= va.transform(Scanners3_df)

In [40]:
data.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: double, Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: double, mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Packets_B10: double, Ports_Array: array<string>, Port17132: boolean, Port17140: boolean, Port17128: boolean, Port17138: boolean, Port17130: boolean, Port17136: boolean, Port23: boolean, Port445: boolean, Port54594: boolean, Port17142: boolean, Port17134: boolean, Port80: boolean, Port8080: boolean, Port0: boolean, Port2323: boolean, Port5555: boolean, Port81: boolean, Port1023: boolean, Port52869: boolean, Port8443: boolean, Port49152: boolean, Port7574: boolean, Port37215: boolean, Port34218: boolean, Port34220: boolean, Port33968: boolean, Port34224: boolean, Port34228: bo

In [41]:
kmModel=km.fit(data)

In [42]:
kmModel

KMeansModel: uid=KMeans_8fe228837210, k=100, distanceMeasure=euclidean, numFeatures=51

In [43]:
predictions = kmModel.transform(data)

In [44]:
predictions.persist().show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+-----------+--------------+---------+---------+---------+---------+---------+---------+------+-------+---------+---------+---------+------+--------+-----+--------+--------+------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+--------+--------+------+--------+------+--------+--------+-------+--------+--------+--------------------+----------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|ports_

In [45]:
Cluster0_df=predictions.where(col("prediction")==0)

In [46]:
Cluster0_df.count()

705

# The following code shows the sizes of the clusters generated.

In [47]:
summary = kmModel.summary

In [48]:
summary.clusterSizes

[705,
 36653,
 6176,
 7142,
 4685,
 1176,
 675,
 1109,
 67,
 728,
 6551,
 2562,
 855,
 6292,
 519,
 4434,
 4722,
 3837,
 606,
 2056,
 859,
 3377,
 4706,
 134,
 2483,
 418,
 913,
 2312,
 6810,
 3282,
 1493,
 4160,
 883,
 1312,
 1755,
 1687,
 986,
 999,
 559,
 2310,
 966,
 722,
 936,
 428,
 2011,
 1271,
 330,
 5888,
 372,
 1360,
 4272,
 569,
 461,
 3992,
 2914,
 3599,
 1106,
 1034,
 725,
 5897,
 696,
 241,
 2068,
 524,
 6055,
 528,
 821,
 3701,
 968,
 425,
 1580,
 773,
 2545,
 3651,
 744,
 637,
 264,
 312,
 421,
 1409,
 1668,
 685,
 366,
 1113,
 6395,
 196,
 674,
 690,
 635,
 3670,
 1514,
 572,
 2672,
 936,
 695,
 1183,
 1034,
 1021,
 337,
 802]

# Exercise 4 (5 points)
## Complete the following code to find the Silhouette Score of the clustering result.

In [49]:
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)

In [50]:
print('Silhouette Score of the Clustering Result is ', silhouette)

Silhouette Score of the Clustering Result is  0.523235331825663


In [51]:
centers = kmModel.clusterCenters()

In [52]:
centers[0]

array([5.01278409e+00, 9.54545455e-01, 9.61647727e-01, 9.65909091e-01,
       9.97159091e-01, 9.61647727e-01, 9.65909091e-01, 0.00000000e+00,
       0.00000000e+00, 2.84090909e-03, 9.81534091e-01, 1.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 5.68181818e-03, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.17897727e-01, 1.84659091e-02, 2.13068182e-02, 5.49715909e-01,
       3.40909091e-02, 1.84659091e-02, 1.03693182e-01, 1.22159091e-01,
       6.49147727e-01, 1.16477273e-01, 1.13636364e-01, 1.20738636e-01,
       0.00000000e+00, 6.67613636e-02, 7.24431818e-02, 8.66477273e-02,
       8.80681818e-02, 1.42045455e-03, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

In [None]:
# print("Cluster Centers:")
# i=0
# for center in centers:
#    print("Cluster ", str(i+1), center)
#    i = i+1

In [53]:
len(input_features)

51

In [54]:
centers[0]

array([5.01278409e+00, 9.54545455e-01, 9.61647727e-01, 9.65909091e-01,
       9.97159091e-01, 9.61647727e-01, 9.65909091e-01, 0.00000000e+00,
       0.00000000e+00, 2.84090909e-03, 9.81534091e-01, 1.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 5.68181818e-03, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       1.17897727e-01, 1.84659091e-02, 2.13068182e-02, 5.49715909e-01,
       3.40909091e-02, 1.84659091e-02, 1.03693182e-01, 1.22159091e-01,
       6.49147727e-01, 1.16477273e-01, 1.13636364e-01, 1.20738636e-01,
       0.00000000e+00, 6.67613636e-02, 7.24431818e-02, 8.66477273e-02,
       8.80681818e-02, 1.42045455e-03, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

In [55]:
centers[0][50]

0.0

# Part E Use Mirai Signature (External Label) to Evaluate/Interpret Clustering Results

# Exercise 5 (10 points)
## Complete the following code to compute the percentage of Mirai Malwares in each cluster.

In [56]:
# Define columns of the Pandas dataframe
column_list = ['cluster ID', 'size', 'mirai_ratio' ]
for feature in input_features:
    column_list.append(feature)
mirai_clusters_df = pd.DataFrame( columns = column_list )
threshold = 0.2
for i in range(0, top_ports):
    cluster_row = [ ]
    cluster_i = predictions.where(col('prediction')==i)
    cluster_i_size = cluster_i.count()
    cluster_i_mirai_count = cluster_i.where(col("mirai")).count()
    cluster_i_mirai_ratio = cluster_i_mirai_count/cluster_i_size
    if cluster_i_mirai_count > 0:
        print("Cluster ", i, "; Mirai Ratio:", cluster_i_mirai_ratio, "; Cluster Size: ", cluster_i_size)
    if cluster_i_mirai_ratio > threshold:
        cluster_row = [i, cluster_i_size, cluster_i_mirai_ratio]
        # Add the cluster center (average) value for each input feature for cluster i to cluster_row
        for j in range(0, len(input_features)):
            cluster_row.append(centers[i][j])
        mirai_clusters_df.loc[i]= cluster_row

Cluster  1 ; Mirai Ratio: 0.01380514555425204 ; Cluster Size:  36653
Cluster  2 ; Mirai Ratio: 0.010362694300518135 ; Cluster Size:  6176
Cluster  3 ; Mirai Ratio: 0.6300756090730888 ; Cluster Size:  7142
Cluster  5 ; Mirai Ratio: 0.027210884353741496 ; Cluster Size:  1176
Cluster  21 ; Mirai Ratio: 0.015398282499259699 ; Cluster Size:  3377
Cluster  27 ; Mirai Ratio: 0.8944636678200693 ; Cluster Size:  2312
Cluster  29 ; Mirai Ratio: 0.7269957343083485 ; Cluster Size:  3282
Cluster  34 ; Mirai Ratio: 0.007407407407407408 ; Cluster Size:  1755
Cluster  38 ; Mirai Ratio: 0.04830053667262969 ; Cluster Size:  559
Cluster  45 ; Mirai Ratio: 0.025177025963808025 ; Cluster Size:  1271
Cluster  47 ; Mirai Ratio: 0.034986413043478264 ; Cluster Size:  5888


# Exercise 6 (5 points) 
## Complete the code below to save the Pandas dataframe in a CSV file in your directory.

In [57]:
mirai_clusters_df.to_csv("/storage/home/hxw5245/MiniProj3/Bucketing10_CC_Mirai.csv")

# Part F Comparing with Clustering Results Using Only OHE 
## The following code cells are similiar to those in MiniProject 2, which generates a k-means clusteringresult based only on One-Hot-Encoding of ports being scanned. Execute all of the code cells below, including new ones (not in MiniProject 2) that uses Mirai labels to interpret the clusters that contain a significant portion of Mirai signatures.

In [58]:
input_features2 = [ ]
for i in range(0, top_ports ):
    input_features2.append( "Port"+Top_Ports_list[i] )

In [59]:
print(input_features2)

['Port17132', 'Port17140', 'Port17128', 'Port17138', 'Port17130', 'Port17136', 'Port23', 'Port445', 'Port54594', 'Port17142', 'Port17134', 'Port80', 'Port8080', 'Port0', 'Port2323', 'Port5555', 'Port81', 'Port1023', 'Port52869', 'Port8443', 'Port49152', 'Port7574', 'Port37215', 'Port34218', 'Port34220', 'Port33968', 'Port34224', 'Port34228', 'Port33962', 'Port33960', 'Port33964', 'Port34216', 'Port34226', 'Port33970', 'Port33972', 'Port50401', 'Port34222', 'Port34230', 'Port33966', 'Port33974', 'Port3389', 'Port1433', 'Port22', 'Port5353', 'Port21', 'Port8291', 'Port8728', 'Port443', 'Port5900', 'Port8000']


In [60]:
va2 = VectorAssembler().setInputCols(input_features2).setOutputCol("features2")

In [61]:
data2= va2.transform(Scanners3_df)

In [62]:
data2.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: double, Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: double, mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Packets_B10: double, Ports_Array: array<string>, Port17132: boolean, Port17140: boolean, Port17128: boolean, Port17138: boolean, Port17130: boolean, Port17136: boolean, Port23: boolean, Port445: boolean, Port54594: boolean, Port17142: boolean, Port17134: boolean, Port80: boolean, Port8080: boolean, Port0: boolean, Port2323: boolean, Port5555: boolean, Port81: boolean, Port1023: boolean, Port52869: boolean, Port8443: boolean, Port49152: boolean, Port7574: boolean, Port37215: boolean, Port34218: boolean, Port34220: boolean, Port33968: boolean, Port34224: boolean, Port34228: bo

In [63]:
km2 = KMeans(featuresCol="features2", predictionCol="prediction2").setK(cluster_num).setSeed(seed)
km2.explainParams()

'distanceMeasure: the distance measure. Supported options: \'euclidean\' and \'cosine\'. (default: euclidean)\nfeaturesCol: features column name. (default: features, current: features2)\ninitMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)\ninitSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)\nk: The number of clusters to create. Must be > 1. (default: 2, current: 100)\nmaxIter: max number of iterations (>= 0). (default: 20)\npredictionCol: prediction column name. (default: prediction, current: prediction2)\nseed: random seed. (default: -2471357743403942232, current: 123)\ntol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)\nweightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)'

In [64]:
kmModel2=km2.fit(data2)

In [65]:
kmModel2

KMeansModel: uid=KMeans_171cc7b9ccd7, k=100, distanceMeasure=euclidean, numFeatures=50

In [66]:
predictions2 = kmModel2.transform(data2)

In [67]:
predictions2.persist().show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+-----------+--------------+---------+---------+---------+---------+---------+---------+------+-------+---------+---------+---------+------+--------+-----+--------+--------+------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+--------+--------+------+--------+------+--------+--------+-------+--------+--------+--------------------+-----------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|ports

In [68]:
summary2 = kmModel2.summary

In [69]:
summary2.clusterSizes

[1226,
 7751,
 760,
 25079,
 430,
 2808,
 37512,
 532,
 318,
 611,
 8408,
 22175,
 17349,
 7438,
 220,
 505,
 966,
 1518,
 979,
 1226,
 168,
 369,
 8087,
 1182,
 412,
 229,
 531,
 448,
 256,
 1160,
 287,
 226,
 634,
 4042,
 7079,
 7044,
 1204,
 641,
 8115,
 894,
 1042,
 381,
 4057,
 8360,
 189,
 89,
 1802,
 249,
 515,
 312,
 924,
 53,
 222,
 1048,
 253,
 432,
 446,
 903,
 284,
 974,
 320,
 445,
 545,
 638,
 1567,
 603,
 227,
 743,
 343,
 437,
 466,
 780,
 200,
 609,
 255,
 323,
 395,
 606,
 185,
 304,
 2402,
 985,
 854,
 643,
 326,
 736,
 411,
 94,
 656,
 787,
 1109,
 1253,
 464,
 1120,
 403,
 374,
 14,
 712,
 157,
 217]

In [70]:
centers2 = kmModel2.clusterCenters()

In [71]:
evaluator2 = ClusteringEvaluator(featuresCol='features2', predictionCol='prediction2')
silhouette2 = evaluator2.evaluate(predictions2)

In [72]:
print('Silhouette Score of the Clustering Result is ', silhouette2)

Silhouette Score of the Clustering Result is  0.7394072054440451


In [73]:
input_features2

['Port17132',
 'Port17140',
 'Port17128',
 'Port17138',
 'Port17130',
 'Port17136',
 'Port23',
 'Port445',
 'Port54594',
 'Port17142',
 'Port17134',
 'Port80',
 'Port8080',
 'Port0',
 'Port2323',
 'Port5555',
 'Port81',
 'Port1023',
 'Port52869',
 'Port8443',
 'Port49152',
 'Port7574',
 'Port37215',
 'Port34218',
 'Port34220',
 'Port33968',
 'Port34224',
 'Port34228',
 'Port33962',
 'Port33960',
 'Port33964',
 'Port34216',
 'Port34226',
 'Port33970',
 'Port33972',
 'Port50401',
 'Port34222',
 'Port34230',
 'Port33966',
 'Port33974',
 'Port3389',
 'Port1433',
 'Port22',
 'Port5353',
 'Port21',
 'Port8291',
 'Port8728',
 'Port443',
 'Port5900',
 'Port8000']

In [74]:
# Define columns of the Pandas dataframe
column_list2 = ['cluster ID', 'size', 'mirai_ratio' ]
for feature in input_features2:
    column_list2.append(feature)
mirai_clusters2_df = pd.DataFrame( columns = column_list2 )
threshold = 0.2
for i in range(0, top_ports):
    cluster_i = predictions2.where(col('prediction2')==i)
    cluster_i_size = cluster_i.count()
    cluster_i_mirai_count = cluster_i.where(col('mirai')).count()
    cluster_i_mirai_ratio = cluster_i_mirai_count/cluster_i_size
    if cluster_i_mirai_count > 0:
        print("Cluster ", i, "; Mirai Ratio:", cluster_i_mirai_ratio, "; Cluster Size: ", cluster_i_size)
    if cluster_i_mirai_ratio > threshold:
        cluster_row2 = [i, cluster_i_size, cluster_i_mirai_ratio]
        for j in range(0, len(input_features2)):
            cluster_row2.append(centers2[i][j])
        mirai_clusters2_df.loc[i]= cluster_row2

Cluster  6 ; Mirai Ratio: 0.0018127532522925996 ; Cluster Size:  37512
Cluster  12 ; Mirai Ratio: 0.7860971813937403 ; Cluster Size:  17349
Cluster  38 ; Mirai Ratio: 0.006654343807763401 ; Cluster Size:  8115
Cluster  46 ; Mirai Ratio: 0.24250832408435072 ; Cluster Size:  1802
Cluster  48 ; Mirai Ratio: 0.8970873786407767 ; Cluster Size:  515


# Exercise 7 (5 points)
Complete the code below to save the Pandas dataframe in a CSV file in your directory

In [75]:
mirai_clusters2_df.to_csv("/storage/home/hxw5245/MiniProj3/OHE_CC_Mirai.csv")

# Exercise 8 (10 points) 
## Select a cluster that has a high mirai ratio. Complete the code below to see the distribtion of scanner's country in a cluster.

In [76]:
# This is an example.  You should modify it based on the cluster you want to investigate.
cluster = predictions2.where(col('prediction2')==12)

In [77]:
cluster.groupBy("Country").count().orderBy("count", ascending=False).show()

+-------+-----+
|Country|count|
+-------+-----+
|     EG|12161|
|     MX| 1668|
|     GR|  533|
|     CN|  517|
|     BR|  309|
|     ID|  212|
|     IR|  209|
|     IN|  168|
|     TW|  148|
|     KR|  148|
|     US|  147|
|     TH|  119|
|     RU|   88|
|     AR|   75|
|     TR|   59|
|     VN|   54|
|     LK|   49|
|     IT|   43|
|     ES|   38|
|     ZA|   32|
+-------+-----+
only showing top 20 rows



# Exercise 9 (15 points)
Modify a copy of this Jupyter Notebook for clustering the big data (Day_2020_profile) using the cluster mode in ICDS. 
You will need to modify the two output files so that they you can later compare the output files between the local mode and the cluster mode.
- You need to submit 
- .py file for running in the cluster mode
- The log file that contains the run time information.
- One output file (CSV) for clusters generated from Bucketing Packets and OHE whose percentage of scanners matching the Mirai signature is higher than the threshold.
- One output file (CSV) for clusters generated from only OHE whose percentage of scanners matching the Mirai signature is higher than the threshold.

# Exercise 10 (20 points) Submit a word file that answers the following two questions.
- (a) Discuss the characteristics for clusters generated from each approach (in the cluster mode).
- (b) Compare the characteristics for clusters formed from each approach.

In [None]:
ss.stop()