### Run the code below if pyspark operations throw random errors

In [1]:
import shutil
shutil.rmtree("C:/Users/m/AppData/Local/Temp", ignore_errors=True) 

### Importing libraries, starting spark session and taking a look at the data

In [11]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

In [3]:
spark = SparkSession.builder.appName("Finance_Model_Training").getOrCreate()

In [5]:
df = spark.read.csv("data/Train_data.csv", header=True, inferSchema=True)

In [6]:
df.show()

+--------+-------------+----------+----+---------+---------+----+--------------+------+---+-----------------+---------+---------------+----------+------------+--------+------------------+----------+----------------+-----------------+-------------+--------------+-----+---------+-----------+---------------+-----------+---------------+-------------+-------------+------------------+--------------+------------------+----------------------+----------------------+---------------------------+---------------------------+--------------------+------------------------+--------------------+------------------------+-------+
|duration|protocol_type|   service|flag|src_bytes|dst_bytes|land|wrong_fragment|urgent|hot|num_failed_logins|logged_in|num_compromised|root_shell|su_attempted|num_root|num_file_creations|num_shells|num_access_files|num_outbound_cmds|is_host_login|is_guest_login|count|srv_count|serror_rate|srv_serror_rate|rerror_rate|srv_rerror_rate|same_srv_rate|diff_srv_rate|srv_diff_host_rate|d

In [7]:
df.columns

['duration',
 'protocol_type',
 'service',
 'flag',
 'src_bytes',
 'dst_bytes',
 'land',
 'wrong_fragment',
 'urgent',
 'hot',
 'num_failed_logins',
 'logged_in',
 'num_compromised',
 'root_shell',
 'su_attempted',
 'num_root',
 'num_file_creations',
 'num_shells',
 'num_access_files',
 'num_outbound_cmds',
 'is_host_login',
 'is_guest_login',
 'count',
 'srv_count',
 'serror_rate',
 'srv_serror_rate',
 'rerror_rate',
 'srv_rerror_rate',
 'same_srv_rate',
 'diff_srv_rate',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate',
 'class']

### To understand what features we are playing with, description of all the features is given below:

##### **Basic features which describe the connection without looking at the payload**
- **duration** – Length (in seconds) of the connection.
- **protocol_type** – Network protocol used (e.g., TCP, UDP, ICMP).
- **service** – Destination service (e.g., http, telnet, ftp).
- **flag** – Status flag of the connection (e.g., S0, SF, REJ).
- **src_bytes** – Number of data bytes sent from source to destination.
- **dst_bytes** – Number of data bytes sent from destination to source.
- **land** – 1 if connection is from/to the same host/port; 0 otherwise.
- **wrong_fragment** – Number of wrong fragments in the packet.
- **urgent** – Number of urgent packets.
##### **Content features which describe what is within the connection payload**
- **hot** – Number of “hot” indicators (e.g., suspicious commands).
- **num_failed_logins** – Number of failed login attempts.
- **logged_in** – 1 if successfully logged in; 0 otherwise.
- **num_compromised** – Number of compromised conditions.
- **root_shell** – 1 if root shell is obtained; 0 otherwise.
- **su_attempted** – 1 if “su root” command attempted; 0 otherwise.
- **num_root** – Number of “root” accesses.
- **num_file_creations** – Number of file creation operations.
- **num_shells** – Number of shell prompts opened.
- **num_access_files** – Number of attempts to access control files.
- **num_outbound_cmds** – Number of outbound commands.
- **is_host_login** – 1 if user is a “host login”; 0 otherwise.
- **is_guest_login** – 1 if user is a “guest login”; 0 otherwise.
##### **Traffic features based on the past 2 seconds wiindow for same host
- **count** – Number of connections to the same host in the past 2 seconds.
- **srv_count** – Number of connections to the same service.
- **serror_rate** – % of connections with “SYN” errors.
- **srv_serror_rate** – % of connections with “SYN” errors to the same service.
- **rerror_rate** – % of connections with “REJ” errors.
- **srv_rerror_rate** – % of REJ errors to the same service.
- **same_srv_rate** – % of connections to the same service.
- **diff_srv_rate** – % of connections to different services.
- **srv_diff_host_rate** – % of connections to the same service but different host.
- **dst_host_count** – Number of connections to the destination host.
- **dst_host_srv_count** – Connections to the destination host using the same service.
- **dst_host_same_srv_rate** – % of same-service connections among dst_host_count.
- **dst_host_diff_srv_rate** – % of different-service connections.
- **dst_host_same_src_port_rate** – % of connections from same source port.
- **dst_host_srv_diff_host_rate** – % of same-service connections to different hosts.
- **dst_host_serror_rate** – % of connections with SYN errors.
- **dst_host_srv_serror_rate** – % of same-service SYN errors.
- **dst_host_rerror_rate** – % of connections with REJ errors.
- **dst_host_srv_rerror_rate** – % of same-service REJ errors.
##### **Label**
- **class** - either "normal" or "anomally"

#### Let's see if the data has duplicates and/or missing values

In [27]:
if df.count() == df.na.drop(how="any").count():
    print("No missing values in the dataset")
else:
    print("Missing values, need to treat the dataset")

if df.count() == df.dropDuplicates().count():
    print("No duplicates in the dataset")
else:
    print("Duplicate values, need to treat the dataset")

No missing values in the dataset
No duplicates in the dataset


Lets check the data distribution for all the columns

In [29]:
df.describe().show()

+-------+------------------+-------------+-------+-----+------------------+------------------+--------------------+--------------------+--------------------+-------------------+--------------------+-------------------+-------------------+--------------------+--------------------+------------------+--------------------+--------------------+--------------------+-----------------+-------------+--------------------+------------------+-----------------+-------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+----------------------+----------------------+---------------------------+---------------------------+--------------------+------------------------+--------------------+------------------------+-------+
|summary|          duration|protocol_type|service| flag|         src_bytes|         dst_bytes|                land|      wrong_fragment|              urgent|           

You can notice that some of the columns min, max, average and median is the same value, hence they don't have any value variation. Dropping these columns will help make our training more efficient

In [32]:
df.drop(["num_outbound_cmds", "is_host_login"])

PySparkTypeError: [NOT_COLUMN_OR_STR] Argument `col` should be a Column or str, got list.

Lets check if the datatypes are assigned accurately or if they need ot be changed

In [35]:
df.show()

+--------+-------------+----------+----+---------+---------+----+--------------+------+---+-----------------+---------+---------------+----------+------------+--------+------------------+----------+----------------+-----------------+-------------+--------------+-----+---------+-----------+---------------+-----------+---------------+-------------+-------------+------------------+--------------+------------------+----------------------+----------------------+---------------------------+---------------------------+--------------------+------------------------+--------------------+------------------------+-------+
|duration|protocol_type|   service|flag|src_bytes|dst_bytes|land|wrong_fragment|urgent|hot|num_failed_logins|logged_in|num_compromised|root_shell|su_attempted|num_root|num_file_creations|num_shells|num_access_files|num_outbound_cmds|is_host_login|is_guest_login|count|srv_count|serror_rate|srv_serror_rate|rerror_rate|srv_rerror_rate|same_srv_rate|diff_srv_rate|srv_diff_host_rate|d

In [33]:
df.printSchema()

root
 |-- duration: integer (nullable = true)
 |-- protocol_type: string (nullable = true)
 |-- service: string (nullable = true)
 |-- flag: string (nullable = true)
 |-- src_bytes: integer (nullable = true)
 |-- dst_bytes: integer (nullable = true)
 |-- land: integer (nullable = true)
 |-- wrong_fragment: integer (nullable = true)
 |-- urgent: integer (nullable = true)
 |-- hot: integer (nullable = true)
 |-- num_failed_logins: integer (nullable = true)
 |-- logged_in: integer (nullable = true)
 |-- num_compromised: integer (nullable = true)
 |-- root_shell: integer (nullable = true)
 |-- su_attempted: integer (nullable = true)
 |-- num_root: integer (nullable = true)
 |-- num_file_creations: integer (nullable = true)
 |-- num_shells: integer (nullable = true)
 |-- num_access_files: integer (nullable = true)
 |-- num_outbound_cmds: integer (nullable = true)
 |-- is_host_login: integer (nullable = true)
 |-- is_guest_login: integer (nullable = true)
 |-- count: integer (nullable = true