### 0. Data set availability

Dataset availability: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Full dataset: http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz

10% dataset:http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz

### 1. Import libraries

In [1]:
import os
import sys
import re

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.mllib.clustering import KMeans, KMeansModel

# Uncomment these if you would like to make any graphs using Matplotlib
# from mpl_toolkits.mplot3d import Axes3D
# import matplotlib.pyplot as plt
# import matplotlib.patches as mpatches

# plt.style.use('ggplot')
# plt.rcParams['figure.figsize'] = (20.0, 8.0)

# %matplotlib inline


### 2. Initiate Spark session & load dataset

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('app').getOrCreate()

In [3]:
input_file = "kddcup.data.gz"

In [6]:
dataSchema = StructType([ \
    StructField('duration', IntegerType(), True), \
    StructField('protocol_type', StringType(), True), \
    StructField('service', StringType(), True), \
    StructField('flag', StringType(), True), \
    StructField('src_bytes', IntegerType(), True), \
    StructField('dst_bytes', IntegerType(), True), \
    StructField('land', StringType(), True), \
    StructField('wrong_fragment', IntegerType(), True), \
    StructField('urgent', IntegerType(), True), \
    StructField('hot', IntegerType(), True), \
    StructField('num_failed_logins', IntegerType(), True), \
    StructField('logged_in', StringType(), True), \
    StructField('num_compromised', IntegerType(), True), \
    StructField('root_shell', IntegerType(), True), \
    StructField('su_attempted', IntegerType(), True), \
    StructField('num_root', IntegerType(), True), \
    StructField('num_file_creations', IntegerType(), True), \
    StructField('num_shells', IntegerType(), True), \
    StructField('num_access_files', IntegerType(), True), \
    StructField('num_outbound_cmds', IntegerType(), True), \
    StructField('is_host_login', StringType(), True), \
    StructField('is_guest_login', StringType(), True), \
    StructField('count', IntegerType(), True), \
    StructField('srv_count', IntegerType(), True), \
    StructField('serror_rate', FloatType(), True), \
    StructField('srv_serror_rate', FloatType(), True), \
    StructField('rerror_rate', FloatType(), True), \
    StructField('srv_rerror_rate', FloatType(), True), \
    StructField('same_srv_rate', FloatType(), True), \
    StructField('diff_srv_rate', FloatType(), True), \
    StructField('srv_diff_host_rate', FloatType(), True), \
    StructField('dst_host_count', IntegerType(), True), \
    StructField('dst_host_srv_count', IntegerType(), True), \
    StructField('dst_host_same_srv_rate', FloatType(), True), \
    StructField('dst_host_diff_srv_rate', FloatType(), True), \
    StructField('dst_host_same_src_port_rate', FloatType(), True), \
    StructField('dst_host_srv_diff_host_rate', FloatType(), True), \
    StructField('dst_host_serror_rate', FloatType(), True), \
    StructField('dst_host_srv_serror_rate', FloatType(), True), \
    StructField('dst_host_rerror_rate', FloatType(), True), \
    StructField('dst_host_srv_rerror_rate', FloatType(), True), \
    StructField('type', StringType(), True) \
])

In [7]:
df = spark.read \
    .format('csv') \
    .options(header='True') \
    .options(delimiter=',') \
    .load(input_file, schema=dataSchema) 

In [8]:
df.show(5)

+--------+-------------+-------+----+---------+---------+----+--------------+------+---+-----------------+---------+---------------+----------+------------+--------+------------------+----------+----------------+-----------------+-------------+--------------+-----+---------+-----------+---------------+-----------+---------------+-------------+-------------+------------------+--------------+------------------+----------------------+----------------------+---------------------------+---------------------------+--------------------+------------------------+--------------------+------------------------+-------+
|duration|protocol_type|service|flag|src_bytes|dst_bytes|land|wrong_fragment|urgent|hot|num_failed_logins|logged_in|num_compromised|root_shell|su_attempted|num_root|num_file_creations|num_shells|num_access_files|num_outbound_cmds|is_host_login|is_guest_login|count|srv_count|serror_rate|srv_serror_rate|rerror_rate|srv_rerror_rate|same_srv_rate|diff_srv_rate|srv_diff_host_rate|dst_hos

### 3. Tasks

1. Using only numerical features, identify any anomalies in network connections. Remember, an anomaly is a data point, which does not fit in a 'reasonable' set of clusters for any given dataset 

2. In the above model, also include categorical features and determine any anomalies in network connections

3. Finally, make a 3D graph of data points using three dimensions to visualize anomalies