# DS/CMPSC 410 MiniProject Deliverable #2

# Spring 2022
### Instructor: Prof. John Yen
### TA: Rupesh Prajapati
### LA: Lily Jakielaszek and Cayla Shan Pun

### Learning Objectives
- Be able to apply k-means clustering to the Darknet dataset.
- Be able to identify the set of top k ports for one-hot encoding ports scanned.
- Be able to characterize generated clusters using cluster centers.
- Be able to compare and evaluate the result of k-means clustering with different features using Silhouette score and external labels.

### Total points: 100 
- Exercise 1: 5 points
- Exercise 2: 5 points 
- Exercise 3: 10 points 
- Exercise 4: 5 points
- Exercise 5: 10 points
- Exercise 6: 10 points
- Exercise 7: 15 points
- Exercise 8: 10 points
- Exercise 9: 30 points
  
### Due: 11:59 pm, April 10, 2022

In [1]:
import pyspark
import csv

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql.functions import col, column
from pyspark.sql.functions import expr
from pyspark.sql.functions import split
from pyspark.sql.functions import array_contains
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [3]:
ss = SparkSession.builder.master("local").appName("MiniProject 2 Clustering OHE").getOrCreate()

## Exercise 1 (5 points)
Complete the path for input file in the code below and enter your name in this Markdown cell:
- Name: Haichen Wei

In [4]:
Scanners_df = ss.read.csv("/storage/home/hxw5245/MiniProj1/sampled_profile.csv", header= True, inferSchema=True )

## We can use printSchema() to display the schema of the DataFrame Scanners_df to see whether it was inferred correctly.

In [5]:
Scanners_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)



# Exercise 1: Write Your Name Below: (5 points)
## Solution to Exercise 1: Haichen Wei

# Part A: One Hot Encoding of Top 50 Ports
We want to apply one hot encoding to the top k set of ports scanned by scanners. 
- A1: Find top k ports scanned by most scanners (This is similar to the first part of MiniProject 1)
- A2: Generate One Hot Encodding for these top k ports

In [6]:
Scanners_df.select("ports_scanned_str").show(4)

+-----------------+
|ports_scanned_str|
+-----------------+
|            13716|
|      17128-17136|
|            35134|
|            17140|
+-----------------+
only showing top 4 rows



# Count the Total Number of Scanners, regardless how many ports they scan, that Scan a Given Port
Like MiniProject 1, to calculate this, we need to 
- (a) convert the ports_scanned_str into an array/list of ports
- (b) Convert the DataFrame into an RDD
- (c) Use flatMap to count the total number of scanners for each port.

# The Following Code Implements the three steps.
## (a) Split the column "Ports_Array" into an Array of ports.

In [7]:
# (a)
Scanners_df2=Scanners_df.withColumn("Ports_Array", split(col("ports_scanned_str"), "-") )
Scanners_df2.show(10)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|   ports_scanned_str|host_tags_per_censys|host_services_per_censys|         Ports_Array|
+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+
|1645181|1645181|       1|     0.0|   60|      1|                60|             1|             1|               1|         

## (b) We convert the column ```Ports_Array``` into an RDD so that we can apply flatMap for counting.

In [8]:
Ports_Scanned_RDD = Scanners_df2.select("Ports_Array").rdd

In [9]:
Ports_Scanned_RDD.take(5)

[Row(Ports_Array=['13716']),
 Row(Ports_Array=['17128', '17136']),
 Row(Ports_Array=['35134']),
 Row(Ports_Array=['17140']),
 Row(Ports_Array=['54594'])]

## II.(c) Because each port number in the Ports_Array column for each row/scanner occurs only once, we can count the total number of scanners by counting the total occurance of each port number through flatMap.
### Because each element of the rdd is a Row object, we need to first extract the first element of the row object, which is the list of Ports, from the ``Ports_Scanned_RDD``
### We can then count the total number of occurance of a port using map and reduceByKey, like counting word/hashtag frequency in tweets.

In [10]:
# This step is for demonstration purpose only.
Ports_list_RDD = Ports_Scanned_RDD.map(lambda row: row[0] )

In [11]:
Ports_list_RDD.take(3)

[['13716'], ['17128', '17136'], ['35134']]

In [12]:
Ports_list2_RDD = Ports_Scanned_RDD.flatMap(lambda row: row[0] )

In [13]:
Ports_list2_RDD.take(7)

['13716', '17128', '17136', '35134', '17140', '54594', '17130']

In [14]:
Port_count_RDD = Ports_list2_RDD.map(lambda x: (x, 1))
Port_count_RDD.take(7)

[('13716', 1),
 ('17128', 1),
 ('17136', 1),
 ('35134', 1),
 ('17140', 1),
 ('54594', 1),
 ('17130', 1)]

In [15]:
Port_count_total_RDD = Port_count_RDD.reduceByKey(lambda x,y: x+y)
Port_count_total_RDD.take(5)

[('13716', 14),
 ('17128', 31850),
 ('17136', 31617),
 ('35134', 13),
 ('17140', 31865)]

# Exercise 2 (5%)
### Find the total number of ports being scanned by 
- (a) completing the code below

In [16]:
Port_count_total_RDD.count()

65536

# Exercise 3 (10 points)
### Complete the code below for finding top 50 ports.

In [17]:
Sorted_Count_Port_RDD = Port_count_total_RDD.map(lambda x: (x[1], x[0])).sortByKey( ascending = False)

In [18]:
Sorted_Count_Port_RDD.take(50)

[(32014, '17132'),
 (31865, '17140'),
 (31850, '17128'),
 (31805, '17138'),
 (31630, '17130'),
 (31617, '17136'),
 (29199, '23'),
 (25466, '445'),
 (25216, '54594'),
 (21700, '17142'),
 (21560, '17134'),
 (15010, '80'),
 (13698, '8080'),
 (8778, '0'),
 (6265, '2323'),
 (5552, '5555'),
 (4930, '81'),
 (4103, '1023'),
 (4058, '52869'),
 (4012, '8443'),
 (3954, '49152'),
 (3885, '7574'),
 (3874, '37215'),
 (3318, '34218'),
 (3279, '34220'),
 (3258, '33968'),
 (3257, '34224'),
 (3253, '34228'),
 (3252, '33962'),
 (3236, '33960'),
 (3209, '33964'),
 (3179, '34216'),
 (3167, '34226'),
 (3155, '33970'),
 (3130, '33972'),
 (2428, '50401'),
 (1954, '34222'),
 (1921, '34230'),
 (1919, '33966'),
 (1819, '33974'),
 (1225, '3389'),
 (1064, '1433'),
 (885, '22'),
 (878, '5353'),
 (604, '21'),
 (594, '8291'),
 (554, '8728'),
 (512, '443'),
 (382, '5900'),
 (330, '8000')]

In [19]:
top_ports= 50
Sorted_Ports_RDD= Sorted_Count_Port_RDD.map(lambda x: x[1] )
Top_Ports_list = Sorted_Ports_RDD.take(top_ports)

In [20]:
Top_Ports_list

['17132',
 '17140',
 '17128',
 '17138',
 '17130',
 '17136',
 '23',
 '445',
 '54594',
 '17142',
 '17134',
 '80',
 '8080',
 '0',
 '2323',
 '5555',
 '81',
 '1023',
 '52869',
 '8443',
 '49152',
 '7574',
 '37215',
 '34218',
 '34220',
 '33968',
 '34224',
 '34228',
 '33962',
 '33960',
 '33964',
 '34216',
 '34226',
 '33970',
 '33972',
 '50401',
 '34222',
 '34230',
 '33966',
 '33974',
 '3389',
 '1433',
 '22',
 '5353',
 '21',
 '8291',
 '8728',
 '443',
 '5900',
 '8000']

#  A.2 One Hot Encoding of Top K Ports
## One-Hot-Encoded Feature/Column Name
Because we need to create a name for each one-hot-encoded feature, which is one of the top k ports, we can adopt the convention that the column name is "PortXXXX", where "XXXX" is a port number. This can be done by concatenating two strings using ``+``.

In [21]:
Top_Ports_list[0]

'17132'

In [22]:
FeatureName = "Port"+Top_Ports_list[0]

In [23]:
FeatureName

'Port17132'

## One-Hot-Encoding using withColumn and array_contains

In [24]:
from pyspark.sql.functions import array_contains

In [25]:
Scanners_df3=Scanners_df2.withColumn(FeatureName, array_contains("Ports_Array", Top_Ports_list[0]))

In [26]:
Scanners_df3.show(10)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+---------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|   ports_scanned_str|host_tags_per_censys|host_services_per_censys|         Ports_Array|Port17132|
+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+---------+
|1645181|1645181|       1|     0.0|   60|      1|                60|             1|           

## Verify the Correctness of One-Hot-Encoded Feature
## Exercise 4 (5 points)
### Check whether one-hot encoding of the first top port is encoded correctly by completing the code below and enter your answer the in the next Markdown cell.

In [32]:
First_top_port_scanners_count = Scanners_df3.where(col("Port17132") == True).rdd.count()

In [33]:
print(First_top_port_scanners_count)

32014


## Answer for Exercise 4:
- The total number of scanners that scan the first top port, based on ``Sorted_Count_Port_RDD`` is: 32014.
- Is this number the same as the number of scanners whose One-Hot-Encoded feature of the first top port is True?

    It is the same number as One-Hot-Encoded feature of the first top port is True.

## Generate Hot-One Encoded Feature for each of the top k ports in the Top_Ports_list

- Iterate through the Top_Ports_list so that each top port is one-hot encoded.

## Exercise 5 (10 points)
Complete the following PySpark code for encoding the n ports using One Hot Encoding, where n is specified by the variable ```top_ports```

In [34]:
top_ports

50

In [35]:
Top_Ports_list[49]

'8000'

In [36]:
for i in range(0, top_ports):
    # "Port" + Top_Ports_list[i]  is the name of each new feature created through One Hot Encoding
    Scanners_df3 = Scanners_df2.withColumn("Port" + Top_Ports_list[i], array_contains("Ports_Array", Top_Ports_list[i]))
    Scanners_df2 = Scanners_df3

In [37]:
Scanners_df2.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Ports_Array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Port17132: 

## Exercise 6 (10 points)
Complete the code below to use k-means to cluster the scanners using one-hot-encoded top 50 ports and the following two  features:
- lifetime  : The average lifetime of scanners.
- Packets   : The average number of packets scanned by each scanner.

## Specify Parameters for k Means Clustering

In [38]:
input_features = [ "lifetime", "Packets"]
for i in range(0, top_ports ):
    input_features.append( "Port"+ Top_Ports_list[i] )

In [39]:
print(input_features)

['lifetime', 'Packets', 'Port17132', 'Port17140', 'Port17128', 'Port17138', 'Port17130', 'Port17136', 'Port23', 'Port445', 'Port54594', 'Port17142', 'Port17134', 'Port80', 'Port8080', 'Port0', 'Port2323', 'Port5555', 'Port81', 'Port1023', 'Port52869', 'Port8443', 'Port49152', 'Port7574', 'Port37215', 'Port34218', 'Port34220', 'Port33968', 'Port34224', 'Port34228', 'Port33962', 'Port33960', 'Port33964', 'Port34216', 'Port34226', 'Port33970', 'Port33972', 'Port50401', 'Port34222', 'Port34230', 'Port33966', 'Port33974', 'Port3389', 'Port1433', 'Port22', 'Port5353', 'Port21', 'Port8291', 'Port8728', 'Port443', 'Port5900', 'Port8000']


In [40]:
va = VectorAssembler().setInputCols(input_features).setOutputCol("features")

In [41]:
data= va.transform(Scanners_df2)

In [42]:
data.show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+--------------+---------+---------+---------+---------+---------+---------+------+-------+---------+---------+---------+------+--------+-----+--------+--------+------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+--------+--------+------+--------+------+--------+--------+-------+--------+--------+--------------------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|ports_scanned_str|host_tags_p

In [43]:
data.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: double, Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: double, mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Ports_Array: array<string>, Port17132: boolean, Port17140: boolean, Port17128: boolean, Port17138: boolean, Port17130: boolean, Port17136: boolean, Port23: boolean, Port445: boolean, Port54594: boolean, Port17142: boolean, Port17134: boolean, Port80: boolean, Port8080: boolean, Port0: boolean, Port2323: boolean, Port5555: boolean, Port81: boolean, Port1023: boolean, Port52869: boolean, Port8443: boolean, Port49152: boolean, Port7574: boolean, Port37215: boolean, Port34218: boolean, Port34220: boolean, Port33968: boolean, Port34224: boolean, Port34228: boolean, Port33962: boo

In [44]:
km = KMeans(featuresCol= "features", predictionCol="prediction").setK(100).setSeed(123)
km.explainParams()

'distanceMeasure: the distance measure. Supported options: \'euclidean\' and \'cosine\'. (default: euclidean)\nfeaturesCol: features column name. (default: features, current: features)\ninitMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)\ninitSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)\nk: The number of clusters to create. Must be > 1. (default: 2, current: 100)\nmaxIter: max number of iterations (>= 0). (default: 20)\npredictionCol: prediction column name. (default: prediction, current: prediction)\nseed: random seed. (default: -2704299597103909330, current: 123)\ntol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)\nweightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)'

In [45]:
kmModel=km.fit(data)

In [46]:
kmModel

KMeansModel: uid=KMeans_afd1ed92556e, k=41, distanceMeasure=euclidean, numFeatures=52

In [47]:
predictions = kmModel.transform(data)

In [48]:
predictions.show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+--------------+---------+---------+---------+---------+---------+---------+------+-------+---------+---------+---------+------+--------+-----+--------+--------+------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+--------+--------+------+--------+------+--------+--------+-------+--------+--------+--------------------+----------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|ports_scanned_str|

In [49]:
Cluster1_df=predictions.where(col("prediction")==0)

In [50]:
Cluster1_df.count()

202107

In [51]:
summary = kmModel.summary

In [52]:
summary.clusterSizes

[202107,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 16223,
 1,
 1,
 1,
 37,
 7082,
 1,
 1,
 1,
 7,
 37,
 1,
 2,
 142,
 1,
 1,
 17,
 1,
 1,
 7,
 48,
 1,
 7,
 1,
 1,
 1,
 1247,
 2,
 71,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [53]:
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)

In [54]:
print('Silhouette Score of the Clustering Result is ', silhouette)

Silhouette Score of the Clustering Result is  0.9083484568569015


In [55]:
centers = kmModel.clusterCenters()

In [56]:
print(centers)

[array([1.09118185e+03, 4.54591828e+01, 1.24042864e-01, 1.23592447e-01,
       1.23538001e-01, 1.23359814e-01, 1.22419383e-01, 1.22290692e-01,
       1.01913035e-01, 1.13183359e-01, 1.13871359e-01, 7.69718118e-02,
       7.62491647e-02, 3.46573613e-02, 2.99354072e-02, 4.02850991e-02,
       9.03308833e-03, 9.08258470e-03, 5.28621278e-03, 4.51901898e-03,
       4.77640013e-03, 4.57841463e-03, 4.40022768e-03, 4.38537877e-03,
       4.17254436e-03, 7.89467172e-03, 7.86497389e-03, 7.72638404e-03,
       7.60759274e-03, 7.64718984e-03, 7.55314673e-03, 7.63729057e-03,
       7.72143440e-03, 7.36506051e-03, 7.21162175e-03, 7.38980870e-03,
       7.25616849e-03, 8.34508872e-03, 4.27648675e-03, 4.22204074e-03,
       4.15274581e-03, 4.04385379e-03, 5.33075952e-03, 4.34083203e-03,
       3.53404113e-03, 5.24661569e-04, 2.10854555e-03, 2.53916401e-03,
       2.59361002e-03, 2.01450244e-03, 1.24235900e-03, 2.92028609e-04]), array([5.39254186e+08, 2.29895860e+07, 0.00000000e+00, 0.00000000e+00,
   

In [57]:
sc = ss.sparkContext

In [58]:
centers_rdd = sc.parallelize(centers)

In [59]:
centers_rdd.saveAsTextFile("MiniProject 2 ClusterResult A")

## Exercise 7 Complete the code below to perform k-means clustering using only One Hot Encoded Features (15%)

In [60]:
input_features2 = [ ]
for i in range(0, top_ports ):
    input_features2.append( "Port"+ Top_Ports_list[i] )

In [61]:
print(input_features2)

['Port17132', 'Port17140', 'Port17128', 'Port17138', 'Port17130', 'Port17136', 'Port23', 'Port445', 'Port54594', 'Port17142', 'Port17134', 'Port80', 'Port8080', 'Port0', 'Port2323', 'Port5555', 'Port81', 'Port1023', 'Port52869', 'Port8443', 'Port49152', 'Port7574', 'Port37215', 'Port34218', 'Port34220', 'Port33968', 'Port34224', 'Port34228', 'Port33962', 'Port33960', 'Port33964', 'Port34216', 'Port34226', 'Port33970', 'Port33972', 'Port50401', 'Port34222', 'Port34230', 'Port33966', 'Port33974', 'Port3389', 'Port1433', 'Port22', 'Port5353', 'Port21', 'Port8291', 'Port8728', 'Port443', 'Port5900', 'Port8000']


In [62]:
va2 = VectorAssembler().setInputCols(input_features2).setOutputCol("features2")

In [63]:
data2= va2.transform(Scanners_df2)

In [64]:
data2.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: double, Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: double, mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Ports_Array: array<string>, Port17132: boolean, Port17140: boolean, Port17128: boolean, Port17138: boolean, Port17130: boolean, Port17136: boolean, Port23: boolean, Port445: boolean, Port54594: boolean, Port17142: boolean, Port17134: boolean, Port80: boolean, Port8080: boolean, Port0: boolean, Port2323: boolean, Port5555: boolean, Port81: boolean, Port1023: boolean, Port52869: boolean, Port8443: boolean, Port49152: boolean, Port7574: boolean, Port37215: boolean, Port34218: boolean, Port34220: boolean, Port33968: boolean, Port34224: boolean, Port34228: boolean, Port33962: boo

In [65]:
data2.show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+--------------+---------+---------+---------+---------+---------+---------+------+-------+---------+---------+---------+------+--------+-----+--------+--------+------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+--------+--------+------+--------+------+--------+--------+-------+--------+--------+--------------------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|ports_scanned_str|host_tags_p

In [66]:
km2 = KMeans(featuresCol="features2", predictionCol="prediction2").setK(100).setSeed(123)
km2.explainParams()

'distanceMeasure: the distance measure. Supported options: \'euclidean\' and \'cosine\'. (default: euclidean)\nfeaturesCol: features column name. (default: features, current: features2)\ninitMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)\ninitSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)\nk: The number of clusters to create. Must be > 1. (default: 2, current: 100)\nmaxIter: max number of iterations (>= 0). (default: 20)\npredictionCol: prediction column name. (default: prediction, current: prediction2)\nseed: random seed. (default: -2704299597103909330, current: 123)\ntol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)\nweightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)'

In [67]:
kmModel2=km2.fit(data2)

In [68]:
kmModel2

KMeansModel: uid=KMeans_96198e02a611, k=100, distanceMeasure=euclidean, numFeatures=50

In [69]:
predictions2 = kmModel2.transform(data2)

In [70]:
predictions2.persist().show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+--------------+---------+---------+---------+---------+---------+---------+------+-------+---------+---------+---------+------+--------+-----+--------+--------+------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+--------+--------+------+--------+------+--------+--------+-------+--------+--------+--------------------+-----------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|ports_scanned_str

In [71]:
summary2 = kmModel2.summary

In [72]:
summary2.clusterSizes

[1226,
 7751,
 760,
 25079,
 430,
 2808,
 37512,
 532,
 318,
 611,
 8408,
 22175,
 17349,
 7438,
 220,
 505,
 966,
 1518,
 979,
 1226,
 168,
 369,
 8087,
 1182,
 412,
 229,
 531,
 448,
 256,
 1160,
 287,
 226,
 634,
 4042,
 7079,
 7044,
 1204,
 641,
 8115,
 894,
 1042,
 381,
 4057,
 8360,
 189,
 89,
 1802,
 249,
 515,
 312,
 924,
 53,
 222,
 1048,
 253,
 432,
 446,
 903,
 284,
 974,
 320,
 445,
 545,
 638,
 1567,
 603,
 227,
 743,
 343,
 437,
 466,
 780,
 200,
 609,
 255,
 323,
 395,
 606,
 185,
 304,
 2402,
 985,
 854,
 643,
 326,
 736,
 411,
 94,
 656,
 787,
 1109,
 1253,
 464,
 1120,
 403,
 374,
 14,
 712,
 157,
 217]

In [73]:
evaluator2 = ClusteringEvaluator(featuresCol='features2', predictionCol='prediction2')
silhouette2 = evaluator2.evaluate(predictions2)

In [74]:
print('Silhouette Score of the Clustering Result is ', silhouette2)

Silhouette Score of the Clustering Result is  0.7394072054440451


In [75]:
centers2 = kmModel2.clusterCenters()

In [76]:
centers2_rdd = sc.parallelize(centers2)

In [77]:
centers2_rdd.saveAsTextFile("MiniProject 2 Cluster Centers Only OHE")

# Exercise 8 (10 points) 
- (a) Compare the clutering results of the two approaches above (1) OHE + three numerical features, and (2) OHE.  (5 points)
- (b) Discuss the reasons one approach is worse than the other. (5 points)

# Answer to Exercise 8:
- (a) OHE+features' Clustering Result is  0.9083484568569015. The OHE Clustering Result is  0.7394072054440451. The OHE+features is better. 
- (b) Selecting specific features can help to find clusters more efficiently, understand the data better, and reduce data size for storage, collection, and processing. 

# Exercise 9 (30 points)
Modify the Jupyter Notebook for running in cluster mode using the big dataset (Day_2020_profile.csv). 
Submit the .py file using spark-submit in the cluster mode to calculate cluster centers of the two different approaches (one using OHE + numerical features, the other using only OHE).
- Submit the .py file and the log file that contains the run time information.
- Submit a screen shot showing the output directories (both in local mode and in cluster mode)
- Submit the output files for each approach in the cluster mode.

In [None]:
ss.stop()