In [1]:
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

### Getting attack interactions using subtract

For illustrative purposes, imagine we already have our RDD with non attack (normal) interactions from some previous analysis.

In [2]:
normal_raw_data = raw_data.filter(lambda x:'normal.' in x)

取出非交集的那部分，即作减法

In [3]:
attack_raw_data = raw_data.subtract(normal_raw_data)

In [4]:
from time import time
# count all
t0 = time()
raw_data_count = raw_data.count()
tt = time() - t0

print("All count in {} secs".format(round(tt,3)))

All count in 3.241 secs


In [6]:
# count normal
t0 = time()
normal_raw_data_count = normal_raw_data.count()
tt = time() - t0
print("Normal count in {} secs".format(round(tt,3)))

Normal count in 1.503 secs


In [7]:
# count attacks
t0 = time()
attack_raw_data_count = attack_raw_data.count()
tt = time() - t0
print("Attack count in {} secs".format(round(tt,3)))

Attack count in 6.151 secs


In [8]:
print("There are {} normal interactions and {} attacks, \
from a total of {} interactions".format(normal_raw_data_count,attack_raw_data_count,raw_data_count))

There are 97278 normal interactions and 396743 attacks, from a total of 494021 interactions


So now we have two RDDs, one with normal interactions and another one with attacks.

### Protocol and service combinations using cartesian

去重

In [9]:
csv_data = raw_data.map(lambda x: x.split(","))
protocols = csv_data.map(lambda x: x[1]).distinct()
protocols.collect()

['tcp', 'udp', 'icmp']

In [10]:
services = csv_data.map(lambda x:x[2]).distinct()
services.collect()

['http',
 'smtp',
 'finger',
 'domain_u',
 'auth',
 'telnet',
 'ftp',
 'eco_i',
 'ntp_u',
 'ecr_i',
 'other',
 'private',
 'pop_3',
 'ftp_data',
 'rje',
 'time',
 'mtp',
 'link',
 'remote_job',
 'gopher',
 'ssh',
 'name',
 'whois',
 'domain',
 'login',
 'imap4',
 'daytime',
 'ctf',
 'nntp',
 'shell',
 'IRC',
 'nnsp',
 'http_443',
 'exec',
 'printer',
 'efs',
 'courier',
 'uucp',
 'klogin',
 'kshell',
 'echo',
 'discard',
 'systat',
 'supdup',
 'iso_tsap',
 'hostnames',
 'csnet_ns',
 'pop_2',
 'sunrpc',
 'uucp_path',
 'netbios_ns',
 'netbios_ssn',
 'netbios_dgm',
 'sql_net',
 'vmnet',
 'bgp',
 'Z39_50',
 'ldap',
 'netstat',
 'urh_i',
 'X11',
 'urp_i',
 'pm_dump',
 'tftp_u',
 'tim_i',
 'red_i']

cartesian:返回两个RDD的笛卡尔积对（a,b）a 在一个RDD，b来自另一个RDD
例如：

In [12]:
rdd = sc.parallelize([1,2])
sorted(rdd.cartesian(rdd).collect())

[(1, 1), (1, 2), (2, 1), (2, 2)]

In [11]:
product = protocols.cartesian(services).collect()
print("There are {} combinations of protocol X service".format(len(product)))

There are 198 combinations of protocol X service


显然，对于这样的小型RDD，使用Spark笛卡尔积非常有意义。 使用distinct后我们可以完美地收集这些值，并在本地进行笛卡尔积。 此外，不同的和笛卡尔是昂贵的操作，因此当操作数据集很大时必须小心使用它们。