# DS/CMPSC 410 MiniProject Deliverable #1 

# Spring 2022
## Instructor: John Yen
## TA: Rupesh Prajapati
## LA: Lily Jakielaszek and Cayla Shan Pun
## Learning Objectives
- Be able to identify frequent 2 port sets and 3 port sets that are scanned by scanners in the Darknet dataset
- Be able to adapt the Aprior algorithm by incorporating suitable threshold.
- Be able to improve the performance of frequent port set mining by suitable reuse of RDD, together with appropriate persist and unpersist on the reused RDD.
- Be able to enhance the performance of mining frequent port from a Big Dataset using persist.

### Total points: 100 
- Exercise 1: 10 points
- Exercise 2: 10 points
- Exercise 3: 5 points
- Exercise 4: 15 points
- Exercise 5: 20 points
- Exercise 6: 10 points
- Exercise 7: 30 points
  
### Due: midnight, April 1, 2022

In [1]:
import pyspark
import csv
import pandas as pd

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql.functions import col, column
from pyspark.sql.functions import expr
from pyspark.sql.functions import split
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.clustering import KMeans

In [3]:
ss = SparkSession.builder.master("local").appName("Mini Project #1 Freqent Port Sets").getOrCreate()

# Exercise 1 (10 points)
- Complete the path below for reading "sampled_profile.csv" you downloaded from Canvas, uploaded to your Mini Project 1 folder. (5 points)
- Fill in your Name (5 points): Haichen Wei

In [4]:
# Scanners_df = ss.read.csv("/gpfs/group/juy1/default/private/Day_2021_profile.csv", header= True, inferSchema=True )
Scanners_df = ss.read.csv("/storage/home/hxw5245/MiniProj1/sampled_profile.csv", header=True, inferSchema=True)

## We can use printSchema() to display the schema of the DataFrame Scanners_df to see whether it was inferred correctly.

In [5]:
Scanners_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: double (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: double (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)



# Part A Transfosrm the feature "ports_scanned_str" into an array of ports.
### The original value of the column is a string that connects all the ports scanned by a scanner. The different ports that are open by a scanner are connected by dash "-". For example, "81-161-2000" indicates the scanner has scanned three ports: port 81, port 161, and port 2000. Therefore, we want to use split to separate it into an array of ports by each scanner.  This transformation is important because it enables the identification of frequent ports scanned by scanners.

## The original value of the column "ports_scanned_str" 

In [6]:
Scanners_df.select("ports_scanned_str").show(10)

+--------------------+
|   ports_scanned_str|
+--------------------+
|               13716|
|         17128-17136|
|               35134|
|               17140|
|               54594|
|               17130|
|               54594|
|               37876|
|               17142|
|17128-17130-17132...|
+--------------------+
only showing top 10 rows



## Convert the Column 'ports_scanned_str' into an Array of ports scanned by each scanner (row)

In [7]:
Scanners_df2=Scanners_df.withColumn("Ports_Array", split(col("ports_scanned_str"), "-") )

## For Mining Frequent Port Sets being scanned, we only need the column ```Ports_Array```

In [8]:
# Scanners_df2.select("ports_scanned_str", "Ports_Array").show(10)

+--------------------+--------------------+
|   ports_scanned_str|         Ports_Array|
+--------------------+--------------------+
|               13716|             [13716]|
|         17128-17136|      [17128, 17136]|
|               35134|             [35134]|
|               17140|             [17140]|
|               54594|             [54594]|
|               17130|             [17130]|
|               54594|             [54594]|
|               37876|             [37876]|
|               17142|             [17142]|
|17128-17130-17132...|[17128, 17130, 17...|
+--------------------+--------------------+
only showing top 10 rows



In [9]:
Ports_Scanned_RDD = Scanners_df2.select("Ports_Array").rdd

In [10]:
Ports_Scanned_RDD.take(5)

[Row(Ports_Array=['13716']),
 Row(Ports_Array=['17128', '17136']),
 Row(Ports_Array=['35134']),
 Row(Ports_Array=['17140']),
 Row(Ports_Array=['54594'])]

# Convert an RDD of Ports_Array into an RDD of port list.
## Because the Row object in the RDD only has Ports_Array, we can access the content of Ports_Array using index 0.
## The created RDD `multi_Ports_list_RDD` will be used in finding frequent port sets below (Part B).

In [11]:
multi_Ports_list_RDD = Ports_Scanned_RDD.map(lambda x: x[0])

In [12]:
multi_Ports_list_RDD.take(5)

[['13716'], ['17128', '17136'], ['35134'], ['17140'], ['54594']]

# Part B: Finding all ports that have been scanned by at least 1000 scanners.

## Because each port number in the Ports_Array column for each row/scanner occurs only once, we can count the total occurance of each port number through flatMap.

## `flatMap` flatten the RDD into a list of ports. We can then just count the occurance of each port in the RDD, which is the number of scanners that scan the port.   

# Exercise 2 (10 points) Complete the code below to calculate the total number of scanners that scan each port. 

In [13]:
port_list_RDD = multi_Ports_list_RDD.flatMap(lambda x: x)

In [14]:
Port_count_RDD = port_list_RDD.map(lambda x: (x,1) )
Port_count_RDD.take(5)

[('13716', 1), ('17128', 1), ('17136', 1), ('35134', 1), ('17140', 1)]

In [15]:
Port_count_total_RDD = Port_count_RDD.reduceByKey(lambda x,y: x+y, 1)
Port_count_total_RDD.take(5)

[('13716', 14),
 ('17128', 31850),
 ('17136', 31617),
 ('35134', 13),
 ('17140', 31865)]

## How many ports are being scanned?

In [16]:
Port_count_total_RDD.count()

65536

In [17]:
Sorted_Count_Port_RDD = Port_count_total_RDD.map(lambda x: (x[1], x[0])).sortByKey( ascending = False)

In [18]:
Sorted_Count_Port_RDD.take(10)

[(32014, '17132'),
 (31865, '17140'),
 (31850, '17128'),
 (31805, '17138'),
 (31630, '17130'),
 (31617, '17136'),
 (29199, '23'),
 (25466, '445'),
 (25216, '54594'),
 (21700, '17142')]

## Since we are interested in ports that are scanned by at least 1000 scanners.  We can use 999 as the threshold.
## Exercise 3 (5 points) Complete the following code to filter for ports that have at least 1000 scanners.

In [19]:
threshold = 999
Filtered_Sorted_Count_Port_RDD= Sorted_Count_Port_RDD.filter(lambda x: x[0] > threshold)
Filtered_Sorted_Count_Port_RDD

PythonRDD[33] at RDD at PythonRDD.scala:53

In [20]:
Filtered_Sorted_Count_Port_RDD.count()

42

# After we apply collect to the RDD, we get a list of single ports that are scanned by at least 1000 scanners in the small dataset.

In [21]:
Top_Ports = Filtered_Sorted_Count_Port_RDD.map(lambda x: x[1]).collect()

In [22]:
Top_1_Port_count = len(Top_Ports)

In [23]:
print(Top_Ports)

['17132', '17140', '17128', '17138', '17130', '17136', '23', '445', '54594', '17142', '17134', '80', '8080', '0', '2323', '5555', '81', '1023', '52869', '8443', '49152', '7574', '37215', '34218', '34220', '33968', '34224', '34228', '33962', '33960', '33964', '34216', '34226', '33970', '33972', '50401', '34222', '34230', '33966', '33974', '3389', '1433']


In [24]:
print(Top_1_Port_count)

42


# Part C Finding Frequent Two-Port Sets

# We use the multi_Ports_list_RDD generated earlier to find frequent 2-port sets being scanned.

In [25]:
multi_Ports_list_RDD.take(5)

[['13716'], ['17128', '17136'], ['35134'], ['17140'], ['54594']]

In [26]:
# top_port_RDD = multi_Ports_list_RDD.filter(lambda x: Top_Ports[0] in x)

In [27]:
# top_port_RDD.take(10)

[['17128', '17130', '17132', '17134', '17136', '17138', '17140'],
 ['17128', '17132', '17136', '17140', '17142', '34230'],
 ['17128', '17130', '17132', '17136', '17138', '17140'],
 ['17128', '17132', '17136', '17138', '17140'],
 ['17130', '17132', '17136', '17138', '17140', '17142', '34218'],
 ['17132'],
 ['17128',
  '17130',
  '17132',
  '17134',
  '17136',
  '17138',
  '17140',
  '17142',
  '33960',
  '33966',
  '33968',
  '33972',
  '34224'],
 ['17132'],
 ['17132'],
 ['17128',
  '17130',
  '17132',
  '17134',
  '17136',
  '17138',
  '17140',
  '33964',
  '34226',
  '34228']]

# Exercise 4 (15 points)
- Complete the following code for finding 2 port sets (7 points)
- Add suitable persist and unpersist to suitable RDD (8 points)

In [28]:
# Initialize a Pandas DataFrame to store frequent port sets and their counts 
Two_Port_Sets_df = pd.DataFrame( columns= ['Port Sets', 'count'])
# Initialize the index to the Freq_Port_Sets_df to 0
index = 0
# Set the threshold for Large Port Sets to be 1000 
threshold = 999
for i in range(0, Top_1_Port_count-1):
    Scanners_port_i_RDD = multi_Ports_list_RDD.filter(lambda x: Top_Ports[i] in x)
    Scanners_port_i_RDD.persist() 
    one_port_count = Scanners_port_i_RDD.count()
    if one_port_count > threshold:
        for j in range(i+1, Top_1_Port_count-1):
            Scanners_port_i_j_RDD = Scanners_port_i_RDD.filter(lambda x: Top_Ports[j] in x)
            Scanners_port_i_j_RDD.persist() 
            two_ports_count = Scanners_port_i_j_RDD.count()
            if two_ports_count > threshold:
                Two_Port_Sets_df.loc[index]=[ [Top_Ports[i], Top_Ports[j]], two_ports_count]
                index = index +1
                print("Two Ports: ", Top_Ports[i], ", ", Top_Ports[j], ": Count ", two_ports_count)
            Scanners_port_i_j_RDD.unpersist() 
    Scanners_port_i_RDD.unpersist()

Two Ports:  17132 ,  17140 : Count  16317
Two Ports:  17132 ,  17128 : Count  16279
Two Ports:  17132 ,  17138 : Count  16299
Two Ports:  17132 ,  17130 : Count  16336
Two Ports:  17132 ,  17136 : Count  16148
Two Ports:  17132 ,  17142 : Count  12722
Two Ports:  17132 ,  17134 : Count  12761
Two Ports:  17132 ,  34218 : Count  2658
Two Ports:  17132 ,  34220 : Count  2666
Two Ports:  17132 ,  33968 : Count  2608
Two Ports:  17132 ,  34224 : Count  2619
Two Ports:  17132 ,  34228 : Count  2624
Two Ports:  17132 ,  33962 : Count  2591
Two Ports:  17132 ,  33960 : Count  2628
Two Ports:  17132 ,  33964 : Count  2567
Two Ports:  17132 ,  34216 : Count  2552
Two Ports:  17132 ,  34226 : Count  2555
Two Ports:  17132 ,  33970 : Count  2540
Two Ports:  17132 ,  33972 : Count  2564
Two Ports:  17132 ,  34222 : Count  1630
Two Ports:  17132 ,  34230 : Count  1599
Two Ports:  17132 ,  33966 : Count  1594
Two Ports:  17132 ,  33974 : Count  1484
Two Ports:  17140 ,  17128 : Count  16161
Two Port

# Part D: Finding Frequent 2-Port Sets and 3-Port Sets

# Exercise 5 (20 points)  Modify and complete the following code to find BOTH frequent 2 port sets AND frequent 3 port sets 
## Hint 1: Add the code of saving frequent two port sets in Pandas dataframe `Two_Port_Sets_df` (similar to those code in the previous Exercise).
## Hint 2: Need to have two `index` variables. One for each Pandas dataframe.

In [29]:
# Initialize a Pandas DataFrame to store frequent port sets and their counts 
Three_Port_Sets_df = pd.DataFrame( columns= ['Port Sets', 'count'])
# Initialize the index to the Three_Port_Sets_df to 0
index2 = 0
index3 = 0
# Set the threshold for Large Port Sets to be 1000
threshold = 999
for i in range(0, Top_1_Port_count-1):
    Scanners_port_i_RDD = multi_Ports_list_RDD.filter(lambda x: Top_Ports[i] in x)
    Scanners_port_i_RDD.persist()  
    for j in range(i+1, Top_1_Port_count-1):
        Scanners_port_i_j_RDD = Scanners_port_i_RDD.filter(lambda x: Top_Ports[j] in x)
        Scanners_port_i_j_RDD.persist()
        two_ports_count = Scanners_port_i_j_RDD.count()
        if two_ports_count > threshold:
            Two_Port_Sets_df.loc[index2]=[ [Top_Ports[i], Top_Ports[j]], two_ports_count]
            index2 = index2 +1
            print("Two Ports: ", Top_Ports[i], ", ", Top_Ports[j], ": Count ", two_ports_count)
            for k in range(j+1, Top_1_Port_count -1):
                Scanners_port_i_j_k_RDD = Scanners_port_i_j_RDD.filter(lambda x: Top_Ports[k] in x)
                three_ports_count = Scanners_port_i_j_k_RDD.count()
                if three_ports_count > threshold:
                    Three_Port_Sets_df.loc[index3] = [ [Top_Ports[i], Top_Ports[j], Top_Ports[k]], three_ports_count]
                    index3 = index3 + 1
                    print("Three Ports: ", Top_Ports[i], ", ", Top_Ports[j], ",  ", Top_Ports[k], ": Count ", three_ports_count)
        Scanners_port_i_j_RDD.unpersist()
    Scanners_port_i_RDD.unpersist()

Two Ports:  17132 ,  17140 : Count  16317
Three Ports:  17132 ,  17140 ,   17128 : Count  12594
Three Ports:  17132 ,  17140 ,   17138 : Count  12562
Three Ports:  17132 ,  17140 ,   17130 : Count  12665
Three Ports:  17132 ,  17140 ,   17136 : Count  12522
Three Ports:  17132 ,  17140 ,   17142 : Count  10461
Three Ports:  17132 ,  17140 ,   17134 : Count  10454
Three Ports:  17132 ,  17140 ,   34218 : Count  2463
Three Ports:  17132 ,  17140 ,   34220 : Count  2453
Three Ports:  17132 ,  17140 ,   33968 : Count  2420
Three Ports:  17132 ,  17140 ,   34224 : Count  2461
Three Ports:  17132 ,  17140 ,   34228 : Count  2443
Three Ports:  17132 ,  17140 ,   33962 : Count  2397
Three Ports:  17132 ,  17140 ,   33960 : Count  2413
Three Ports:  17132 ,  17140 ,   33964 : Count  2361
Three Ports:  17132 ,  17140 ,   34216 : Count  2359
Three Ports:  17132 ,  17140 ,   34226 : Count  2382
Three Ports:  17132 ,  17140 ,   33970 : Count  2334
Three Ports:  17132 ,  17140 ,   33972 : Count  239

# Convert the Pandas dataframes into PySpark DataFrame

In [30]:
Two_Port_Sets_DF = ss.createDataFrame(Two_Port_Sets_df)
Three_Port_Sets_DF = ss.createDataFrame(Three_Port_Sets_df)

# Exercise 6 (10 points)
Complete the following code to save your frequent 2 port sets and 3 port sets in an output file.

In [31]:
output_path_2_port = "/storage/home/hxw5245/MiniProj1/MiniProj1_2Ports_3_23_2022"
output_path_3_port = "/storage/home/hxw5245/MiniProj1/MiniProj1_3Ports_3_23_2022"
Two_Port_Sets_DF.rdd.saveAsTextFile(output_path_2_port)
Three_Port_Sets_DF.rdd.saveAsTextFile(output_path_3_port)

# Exercise 7 (30 points)
- Remove .master("local") from SparkSession statement
- Change the input file to "/gpfs/group/juy1/default/private/Day_2021_profile.csv"
- Remove part C. 
- Change the output files to two different directories from the ones you used in Exercise 5
- Export the notebook as a .py file
- Run spark-submit on ICDS Roar 
- Record the performance time.

performance time: 


real	19m4.872s

user	10m54.398s

sys	1m57.307s