### Welcome to the Colab Spark Tutorial.

We will be using Spark a few times in this course, and the _colab_ environment provides the compute (for 12 hours at a time) we need, along with this wonderful web-based notebook.

Today we will be configuring PySpark and exploring the SparkSQL features in relation to the Spark API

Source material includes [[1](https://opensource.com/article/19/3/apache-spark-and-dataframes-tutorial)]

Sections:

 1. Configuring your _colab_
 2. Using PySpark


Firstly, we need to configure the _colab_ instance

In [None]:
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.3 LTS
Release:	18.04
Codename:	bionic


In [None]:
!apt-get update

Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ InRelease [3,626 B]
Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran35/ Packages [95.7 kB]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Get:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release [564 B]
Get:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release.gpg [833 B]
Hit:8 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:9 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease [21.3 kB]
Get:10 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:12 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:13 https://developer.

In [None]:
# Install java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null


In [None]:
# get spark 
!wget https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz

--2020-09-02 08:19:06--  https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz
Resolving downloads.apache.org (downloads.apache.org)... 88.99.95.219, 2a01:4f8:10a:201a::2
Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 220272364 (210M) [application/x-gzip]
Saving to: ‘spark-3.0.0-bin-hadoop2.7.tgz’


2020-09-02 08:19:28 (9.84 MB/s) - ‘spark-3.0.0-bin-hadoop2.7.tgz’ saved [220272364/220272364]



In [None]:
# decompress spark
!tar xf spark-3.0.0-bin-hadoop2.7.tgz

# install python package to help with system paths
!pip install -q findspark

In [None]:
# Let Colab know where the java and spark folders are

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop2.7"

In [None]:
# add pyspark to sys.path using findspark
import findspark
findspark.init()

# get a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
spark

Let's download some url data ("Anonymized 120-day subset of the ICML-09 URL data containing 2.4 million examples and 3.2 million features" [UCI](https://archive.ics.uci.edu/ml/datasets/URL+Reputation)

In [None]:
! wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz
! wget http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names

--2020-09-02 08:19:41--  http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz
Resolving kdd.ics.uci.edu (kdd.ics.uci.edu)... 128.195.1.86
Connecting to kdd.ics.uci.edu (kdd.ics.uci.edu)|128.195.1.86|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2144903 (2.0M) [application/x-gzip]
Saving to: ‘kddcup.data_10_percent.gz’


2020-09-02 08:19:42 (1.89 MB/s) - ‘kddcup.data_10_percent.gz’ saved [2144903/2144903]

--2020-09-02 08:19:42--  http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
Resolving kdd.ics.uci.edu (kdd.ics.uci.edu)... 128.195.1.86
Connecting to kdd.ics.uci.edu (kdd.ics.uci.edu)|128.195.1.86|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1307 (1.3K)
Saving to: ‘kddcup.names’


2020-09-02 08:19:43 (206 MB/s) - ‘kddcup.names’ saved [1307/1307]



In [None]:
!gunzip kddcup.data_10_percent.gz

In [None]:
import pandas as pd
df = pd.read_csv('kddcup.data_10_percent', header=None)

In [None]:
df[2].value_counts()

ecr_i      281400
private    110893
http        64293
smtp         9723
other        7237
            ...  
X11            11
tim_i           7
pm_dump         1
red_i           1
tftp_u          1
Name: 2, Length: 66, dtype: int64

In [None]:
!kddcup.data_10_percent.gz

/bin/bash: kddcup.data_10_percent.gz: command not found


In [None]:
raw_rdd = spark.sparkContext.textFile('kddcup.data_10_percent').cache()
raw_rdd.take(5)

['0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.',
 '0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.',
 '0,tcp,http,SF,235,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,29,29,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',
 '0,tcp,http,SF,219,1337,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,39,39,1.00,0.00,0.03,0.00,0.00,0.00,0.00,0.00,normal.',
 '0,tcp,http,SF,217,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,6,0.00,0.00,0.00,0.00,1.00,0.00,0.00,49,49,1.00,0.00,0.02,0.00,0.00,0.00,0.00,0.00,normal.']

In [None]:
csv_rdd = raw_rdd.map(lambda row: row.split(","))
print(csv_rdd.take(2))
print(type(csv_rdd))

[['0', 'tcp', 'http', 'SF', '181', '5450', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '8', '8', '0.00', '0.00', '0.00', '0.00', '1.00', '0.00', '0.00', '9', '9', '1.00', '0.00', '0.11', '0.00', '0.00', '0.00', '0.00', '0.00', 'normal.'], ['0', 'tcp', 'http', 'SF', '239', '486', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '8', '8', '0.00', '0.00', '0.00', '0.00', '1.00', '0.00', '0.00', '19', '19', '1.00', '0.00', '0.05', '0.00', '0.00', '0.00', '0.00', '0.00', 'normal.']]
<class 'pyspark.rdd.PipelinedRDD'>


In [None]:
from pyspark.sql import Row

parsed_rdd = csv_rdd.map(lambda r: Row(
    duration=int(r[0]),
    protocol_type=r[1],
    service=r[2],
    flag=r[3],
    src_bytes=int(r[4]),
    dst_bytes=int(r[5]),
    wrong_fragment=int(r[7]),
    urgent=int(r[8]),
    hot=int(r[9]),
    num_failed_logins=int(r[10]),
    num_compromised=int(r[12]),
    su_attempted=r[14],
    num_root=int(r[15]),
    num_file_creations=int(r[16]),
    label=r[-1]
    )
)
parsed_rdd.take(5)

[Row(duration=0, protocol_type='tcp', service='http', flag='SF', src_bytes=181, dst_bytes=5450, wrong_fragment=0, urgent=0, hot=0, num_failed_logins=0, num_compromised=0, su_attempted='0', num_root=0, num_file_creations=0, label='normal.'),
 Row(duration=0, protocol_type='tcp', service='http', flag='SF', src_bytes=239, dst_bytes=486, wrong_fragment=0, urgent=0, hot=0, num_failed_logins=0, num_compromised=0, su_attempted='0', num_root=0, num_file_creations=0, label='normal.'),
 Row(duration=0, protocol_type='tcp', service='http', flag='SF', src_bytes=235, dst_bytes=1337, wrong_fragment=0, urgent=0, hot=0, num_failed_logins=0, num_compromised=0, su_attempted='0', num_root=0, num_file_creations=0, label='normal.'),
 Row(duration=0, protocol_type='tcp', service='http', flag='SF', src_bytes=219, dst_bytes=1337, wrong_fragment=0, urgent=0, hot=0, num_failed_logins=0, num_compromised=0, su_attempted='0', num_root=0, num_file_creations=0, label='normal.'),
 Row(duration=0, protocol_type='tcp',

In [None]:
df = spark.createDataFrame(parsed_rdd)
df.show()

+--------+-------------+-------+----+---------+---------+--------------+------+---+-----------------+---------------+------------+--------+------------------+-------+
|duration|protocol_type|service|flag|src_bytes|dst_bytes|wrong_fragment|urgent|hot|num_failed_logins|num_compromised|su_attempted|num_root|num_file_creations|  label|
+--------+-------------+-------+----+---------+---------+--------------+------+---+-----------------+---------------+------------+--------+------------------+-------+
|       0|          tcp|   http|  SF|      181|     5450|             0|     0|  0|                0|              0|           0|       0|                 0|normal.|
|       0|          tcp|   http|  SF|      239|      486|             0|     0|  0|                0|              0|           0|       0|                 0|normal.|
|       0|          tcp|   http|  SF|      235|     1337|             0|     0|  0|                0|              0|           0|       0|                 0|normal.

In [None]:
from pyspark.sql import functions as f

In [None]:
df.registerTempTable('data')

---
#0. Select columns

In [None]:
select = spark.sql("""SELECT protocol_type, service
                      FROM data""")

In [None]:
select.show(10)

+-------------+-------+
|protocol_type|service|
+-------------+-------+
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
+-------------+-------+
only showing top 10 rows



In [None]:
select_spark = df.select('protocol_type', 'service')

In [None]:
select_spark.show(10)

+-------------+-------+
|protocol_type|service|
+-------------+-------+
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
+-------------+-------+
only showing top 10 rows



#### OR using a list also works

In [None]:
select_spark = df.select(['protocol_type', 'service'])
select_spark.show(10)

+-------------+-------+
|protocol_type|service|
+-------------+-------+
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
|          tcp|   http|
+-------------+-------+
only showing top 10 rows



---
# 1. select as alias

In [None]:
alias = spark.sql("""SELECT protocol_type,
                            label as flag
                     FROM data
                  """)

In [None]:
alias.show()

+-------------+-------+
|protocol_type|   flag|
+-------------+-------+
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
+-------------+-------+
only showing top 20 rows



In [None]:
alias_spark = df.select('protocol_type', 'label').withColumnRenamed('label', 'flag')

In [None]:
alias_spark.show()

+-------------+-------+
|protocol_type|   flag|
+-------------+-------+
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
+-------------+-------+
only showing top 20 rows



#### OR using dataframe column-objects with .alias

In [None]:
alias_spark = df.select(df.protocol_type, df.label.alias('flag'))
alias_spark.show(10)

+-------------+-------+
|protocol_type|   flag|
+-------------+-------+
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
|          tcp|normal.|
+-------------+-------+
only showing top 10 rows



# 2. group by, count, order by

In [None]:
protocols = spark.sql("""
      SELECT protocol_type, count(*) as freq
      FROM data
      GROUP BY protocol_type
      ORDER BY 2 DESC
                           """)
protocols.show()

+-------------+------+
|protocol_type|  freq|
+-------------+------+
|         icmp|283602|
|          tcp|190065|
|          udp| 20354|
+-------------+------+



In [None]:
df.groupBy('protocol_type').count().orderBy('count', ascending=False).show()

+-------------+------+
|protocol_type| count|
+-------------+------+
|         icmp|283602|
|          tcp|190065|
|          udp| 20354|
+-------------+------+



In [None]:
df.count()

494021

---
# 3. group by, count, order by (using agg)

In [None]:
labels = spark.sql("""
  SELECT label, count(*) as freq
  FROM data
  GROUP BY label
  ORDER BY 2 DESC
""")

In [None]:
labels.show()

+----------------+------+
|           label|  freq|
+----------------+------+
|          smurf.|280790|
|        neptune.|107201|
|         normal.| 97278|
|           back.|  2203|
|          satan.|  1589|
|        ipsweep.|  1247|
|      portsweep.|  1040|
|    warezclient.|  1020|
|       teardrop.|   979|
|            pod.|   264|
|           nmap.|   231|
|   guess_passwd.|    53|
|buffer_overflow.|    30|
|           land.|    21|
|    warezmaster.|    20|
|           imap.|    12|
|        rootkit.|    10|
|     loadmodule.|     9|
|      ftp_write.|     8|
|       multihop.|     7|
+----------------+------+
only showing top 20 rows



In [None]:
labels_spark = df.groupBy('label')\
                .agg(f.count(f.lit(1))\
                    .alias('freq'))\
                    .orderBy('freq', ascending=False)

In [None]:
labels_spark.show()

+----------------+------+
|           label|  freq|
+----------------+------+
|          smurf.|280790|
|        neptune.|107201|
|         normal.| 97278|
|           back.|  2203|
|          satan.|  1589|
|        ipsweep.|  1247|
|      portsweep.|  1040|
|    warezclient.|  1020|
|       teardrop.|   979|
|            pod.|   264|
|           nmap.|   231|
|   guess_passwd.|    53|
|buffer_overflow.|    30|
|           land.|    21|
|    warezmaster.|    20|
|           imap.|    12|
|        rootkit.|    10|
|     loadmodule.|     9|
|      ftp_write.|     8|
|       multihop.|     7|
+----------------+------+
only showing top 20 rows



---
#4. case, group by, count, order by

In [None]:
attack_protocol = spark.sql("""
                           SELECT
                             protocol_type,
                             CASE label
                               WHEN 'normal.' THEN 'no attack'
                               ELSE 'attack'
                             END AS state,
                             COUNT(*) as freq
                           FROM data
                           GROUP BY protocol_type, state
                           ORDER BY 3 DESC
                           """)

In [None]:
attack_protocol.show()

+-------------+---------+------+
|protocol_type|    state|  freq|
+-------------+---------+------+
|         icmp|   attack|282314|
|          tcp|   attack|113252|
|          tcp|no attack| 76813|
|          udp|no attack| 19177|
|         icmp|no attack|  1288|
|          udp|   attack|  1177|
+-------------+---------+------+



In [None]:
att_prot_spark = df.withColumn('state', f.when(df.label=='normal.', 'no attack').otherwise('attack'))\
                  .groupBy('protocol_type', 'state')\
                  .agg(f.count(f.lit(1)).alias('freq'))\
                  .orderBy('freq', ascending=False)


In [None]:
att_prot_spark.show()

+-------------+---------+------+
|protocol_type|    state|  freq|
+-------------+---------+------+
|         icmp|   attack|282314|
|          tcp|   attack|113252|
|          tcp|no attack| 76813|
|          udp|no attack| 19177|
|         icmp|no attack|  1288|
|          udp|   attack|  1177|
+-------------+---------+------+



---
#5. group by, aggregations

In [None]:
attack_stats = spark.sql("""
                          SELECT
                            protocol_type,
                            CASE label
                              WHEN 'normal.' THEN 'no attack'
                              ELSE 'attack'
                            END AS state,
                            COUNT(*) as total_freq,
                            ROUND(AVG(src_bytes), 2) as mean_src_bytes,
                            ROUND(AVG(dst_bytes), 2) as mean_dst_bytes,
                            ROUND(AVG(duration), 2) as mean_duration,
                            SUM(num_failed_logins) as total_failed_logins,
                            SUM(num_compromised) as total_compromised,
                            SUM(num_file_creations) as total_file_creations,
                            SUM(su_attempted) as total_root_attempts,
                            SUM(num_root) as total_root_acceses
                          FROM data
                          GROUP BY protocol_type, state
                          ORDER BY 3 DESC
                          """)

In [None]:
attack_stats.show()

+-------------+---------+----------+--------------+--------------+-------------+-------------------+-----------------+--------------------+-------------------+------------------+
|protocol_type|    state|total_freq|mean_src_bytes|mean_dst_bytes|mean_duration|total_failed_logins|total_compromised|total_file_creations|total_root_attempts|total_root_acceses|
+-------------+---------+----------+--------------+--------------+-------------+-------------------+-----------------+--------------------+-------------------+------------------+
|         icmp|   attack|    282314|        932.14|           0.0|          0.0|                  0|                0|                   0|                0.0|                 0|
|          tcp|   attack|    113252|       9880.38|        881.41|        23.19|                 57|             2269|                  76|                1.0|               152|
|          tcp|no attack|     76813|       1439.31|       4263.97|        11.08|                 18|     

In [None]:
attack_stats_spark = df.withColumn('state', f.when(df.label=='normal.', 'no attack').otherwise('attack'))\
.groupBy('protocol_type', 'state')\
.agg(f.count(f.lit(1)).alias('total_freq'),
     f.avg('src_bytes').alias('mean_src_bytes'),
     f.avg('dst_bytes').alias('mean_dst_bytes'),
     f.avg('duration').alias('mean_duration'),
     f.sum('num_failed_logins').alias('total_failed_logins'),
     f.sum('num_compromised').alias('total_compromised'),
     f.sum('num_file_creations').alias('total_file_creations'),
     f.sum('su_attempted').alias('total_root_attempts'),
     f.sum('num_root').alias('total_root_acceses'),
     )\
     .orderBy('total_freq', ascending=False)

In [None]:
attack_stats_spark.show()

+-------------+---------+----------+------------------+-------------------+------------------+-------------------+-----------------+--------------------+-------------------+------------------+
|protocol_type|    state|total_freq|    mean_src_bytes|     mean_dst_bytes|     mean_duration|total_failed_logins|total_compromised|total_file_creations|total_root_attempts|total_root_acceses|
+-------------+---------+----------+------------------+-------------------+------------------+-------------------+-----------------+--------------------+-------------------+------------------+
|         icmp|   attack|    282314| 932.1362985895138|                0.0|               0.0|                  0|                0|                   0|                0.0|                 0|
|          tcp|   attack|    113252| 9880.375225161586|  881.4052467064599| 23.19422173559849|                 57|             2269|                  76|                1.0|               152|
|          tcp|no attack|     76813

---
# 6. filter, group by 

In [None]:
tcp_attack_stats = spark.sql("""
                              SELECT
                                service,
                                label as attack_type,
                                COUNT(*) as total_freq,
                                ROUND(AVG(duration), 2) as mean_duration,
                                SUM(num_failed_logins) as total_failed_logins,
                                SUM(num_file_creations) as total_file_creations,
                                SUM(su_attempted) as total_root_attempts,
                                SUM(num_root) as total_root_acceses
                              FROM data
                              WHERE protocol_type = 'tcp'
                              AND label != 'normal.'
                              GROUP BY service, attack_type
                              ORDER BY total_freq DESC
                              """)

In [None]:
tcp_attack_stats.show()

+----------+------------+----------+-------------+-------------------+--------------------+-------------------+------------------+
|   service| attack_type|total_freq|mean_duration|total_failed_logins|total_file_creations|total_root_attempts|total_root_acceses|
+----------+------------+----------+-------------+-------------------+--------------------+-------------------+------------------+
|   private|    neptune.|    101317|          0.0|                  0|                   0|                0.0|                 0|
|      http|       back.|      2203|         0.13|                  0|                   0|                0.0|                 0|
|     other|      satan.|      1221|          0.0|                  0|                   0|                0.0|                 0|
|   private|  portsweep.|       725|      1915.81|                  0|                   0|                0.0|                 0|
|  ftp_data|warezclient.|       708|       403.71|                  0|             

In [None]:
tcp_attack_stats_spark = df.filter((df.protocol_type  == "tcp") & (df.label  != "normal.")).groupBy('service', df.label.alias('attack_type'))\
.agg(f.count(f.lit(1)).alias('total_freq'),
     f.avg('duration').alias('mean_duration'),
     f.sum('num_failed_logins').alias('total_failed_logins'),
     f.sum('num_file_creations').alias('total_file_creations'),
     f.sum('su_attempted').alias('total_root_attempts'),
     f.sum('num_root').alias('total_root_acceses'))\
.orderBy('total_freq', ascending=False)

In [None]:
tcp_attack_stats_spark.show()

+----------+------------+----------+--------------------+-------------------+--------------------+-------------------+------------------+
|   service| attack_type|total_freq|       mean_duration|total_failed_logins|total_file_creations|total_root_attempts|total_root_acceses|
+----------+------------+----------+--------------------+-------------------+--------------------+-------------------+------------------+
|   private|    neptune.|    101317|                 0.0|                  0|                   0|                0.0|                 0|
|      http|       back.|      2203|  0.1289151157512483|                  0|                   0|                0.0|                 0|
|     other|      satan.|      1221|0.004914004914004914|                  0|                   0|                0.0|                 0|
|   private|  portsweep.|       725|  1915.8110344827587|                  0|                   0|                0.0|                 0|
|  ftp_data|warezclient.|       70

---
#7. sub-queries

In [None]:
tcp_attack_stats = spark.sql("""
                              SELECT
                                t.service,
                                t.attack_type,
                                t.total_freq
                              FROM
                              (SELECT
                                service,
                                label as attack_type,
                                COUNT(*) as total_freq,
                                ROUND(AVG(duration), 2) as mean_duration,
                                SUM(num_failed_logins) as total_failed_logins,
                                SUM(num_file_creations) as total_file_creations,
                                SUM(su_attempted) as total_root_attempts,
                                SUM(num_root) as total_root_acceses
                              FROM data
                              WHERE protocol_type = 'tcp'
                              AND label != 'normal.'
                              GROUP BY service, attack_type
                              ORDER BY total_freq DESC) as t
                                WHERE t.mean_duration > 0
                              """)

In [None]:
tcp_attack_stats.show()

+--------+----------------+----------+
| service|     attack_type|total_freq|
+--------+----------------+----------+
|    http|           back.|      2203|
| private|      portsweep.|       725|
|ftp_data|    warezclient.|       708|
|     ftp|    warezclient.|       307|
|   other|      portsweep.|       260|
| private|          satan.|       170|
|  telnet|   guess_passwd.|        53|
|  telnet|buffer_overflow.|        21|
|ftp_data|    warezmaster.|        18|
|   imap4|           imap.|        12|
|  telnet|        rootkit.|         5|
|  telnet|     loadmodule.|         5|
|   other|    warezclient.|         5|
|    http|            phf.|         4|
|  supdup|      portsweep.|         4|
|  telnet|           perl.|         3|
|   pop_3|      portsweep.|         3|
|    http|        ipsweep.|         3|
|csnet_ns|      portsweep.|         3|
|  finger|          satan.|         3|
+--------+----------------+----------+
only showing top 20 rows



In [None]:
tcp_attack_stats_spark = df.filter((df.protocol_type  == "tcp") & (df.label  != "normal."))\
.groupBy('service', df.label.alias('attack_type'))\
.agg(f.count(f.lit(1)).alias('total_freq'),
     f.avg('duration').alias('mean_duration'),
     f.sum('num_failed_logins').alias('total_failed_logins'),
     f.sum('num_file_creations').alias('total_file_creations'),
     f.sum('su_attempted').alias('total_root_attempts'),
     f.sum('num_root').alias('total_root_acceses'))\
.orderBy('total_freq', ascending=False)\
.filter(f.col('mean_duration') > 0)\
.select('service', 'attack_type', 'total_freq')

In [None]:
tcp_attack_stats_spark.show()

+--------+----------------+----------+
| service|     attack_type|total_freq|
+--------+----------------+----------+
|    http|           back.|      2203|
|   other|          satan.|      1221|
| private|      portsweep.|       725|
|ftp_data|    warezclient.|       708|
|     ftp|    warezclient.|       307|
|   other|      portsweep.|       260|
| private|          satan.|       170|
|  telnet|   guess_passwd.|        53|
|  telnet|buffer_overflow.|        21|
|ftp_data|    warezmaster.|        18|
|   imap4|           imap.|        12|
|  telnet|        rootkit.|         5|
|  telnet|     loadmodule.|         5|
|   other|    warezclient.|         5|
|    http|            phf.|         4|
|  supdup|      portsweep.|         4|
|  telnet|           perl.|         3|
|   pop_3|      portsweep.|         3|
|  finger|          satan.|         3|
|  gopher|        ipsweep.|         3|
+--------+----------------+----------+
only showing top 20 rows



---
#8. WIP

In [None]:
tcp_attack_stats = spark.sql("""
                              SELECT
                                service,
                                label as attack_type,
                                COUNT(*) as total_freq,
                                ROUND(AVG(duration), 2) as mean_duration,
                                SUM(num_failed_logins) as total_failed_logins,
                                SUM(num_file_creations) as total_file_creations,
                                SUM(su_attempted) as total_root_attempts,
                                SUM(num_root) as total_root_acceses
                              FROM data
                              WHERE (protocol_type = 'tcp'
                                    AND label != 'normal.')
                              GROUP BY service, attack_type
                              HAVING (mean_duration >= 50
                                      AND total_file_creations >= 5
                                      AND total_root_acceses >= 1)
                              ORDER BY total_freq DESC
                              """)

In [None]:
tcp_attack_stats.show()

+-------+----------------+----------+-------------+-------------------+--------------------+-------------------+------------------+
|service|     attack_type|total_freq|mean_duration|total_failed_logins|total_file_creations|total_root_attempts|total_root_acceses|
+-------+----------------+----------+-------------+-------------------+--------------------+-------------------+------------------+
| telnet|buffer_overflow.|        21|       130.67|                  0|                  15|                0.0|                 5|
| telnet|     loadmodule.|         5|         63.8|                  0|                   9|                0.0|                 3|
| telnet|       multihop.|         2|        458.0|                  0|                   8|                0.0|                93|
+-------+----------------+----------+-------------+-------------------+--------------------+-------------------+------------------+



In [None]:
tcp_attack_stats_spark = df.filter((df.protocol_type  == "tcp") & (df.label  != "normal."))\
.groupBy('service', df.label.alias('attack_type'))\
.agg(f.count(f.lit(1)).alias('total_freq'),
     f.avg('duration').alias('mean_duration'),
     f.sum('num_failed_logins').alias('total_failed_logins'),
     f.sum('num_file_creations').alias('total_file_creations'),
     f.sum('su_attempted').alias('total_root_attempts'),
     f.sum('num_root').alias('total_root_acceses'))\
.orderBy('total_freq', ascending=False)\
.filter(f.col('mean_duration') > 50 & f.col('total_file_creations') >= 5 & f.col('total_root_acceses') >= 1)\

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (2, 0))



Py4JError: ignored