# 7 - Data visualization and an introduction to clustering

In this notebook we will see for the first time some techniques for data visualization and clustering, which will be useful for the assignment.

If you click on **[this](http://205.174.165.80/CICDataset/NSL-KDD/Dataset/NSL-KDD.zip)** link the download of the dataset should start.

If the link above doesn't work, follw these instructions:
- go to https://www.unb.ca/cic/datasets/nsl.html
- scroll to the end of the page, there is a link to the actual download;
- you will be redirected to another page asking for some information (there should be no check on the data you provide, so you can even fill everything with *asd* if you want);
- download the NSL-KDD.zip file

---

Regardless of the link you used for downloading the dataset, you should now have an archive named *NSL-KDD.zip*; extract it in the folder of the notebook

You should now have a directory named NSL-KDD, containing several files. You have to focus on the following ones:
- *index.html*: contains a brief description of the files, you should read it
- *KDDTrain+.txt*: the file containing the training data
- *KDDTest+.txt*: the file containing the test data

As it happened in previous sessions, you might have some troubles running this notebook on the whole dataset. If that is the case, you can use the reduced training set, which is stored in: 
- *KDDTrain+_20Percent.txt*: reduced training set

---

#### For this notebook you will need the `matplotlib` library. You can install in the same way as you installed the other libraries, e.g. scikit-learn. For instance, if you performed the installation via pip, you can do:
```
pip install matplotlib
```

---

#### Import the required libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#### Define the filename for the training data

In [None]:
TRAIN_DATA_FILENAME = 'NSL-KDD/KDDTrain+.txt'

#### Define the columns of the dataframe

In [None]:
headers = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 
    'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 
    'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 
    'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 
    'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 
    'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 
    'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
    'dst_host_srv_rerror_rate', 'class', 'difficulty_level'
]

#### Read the file

In [None]:
train_df = pd.read_csv(TRAIN_DATA_FILENAME, names=headers)

---

<div class="alert alert-block alert-danger">
<b>Q: Display 10 random rows of the dataframe.</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Show the columns of the dataframe</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Print the number of rows of the dataframe</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Print the number of columns of the dataframe</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: How many features are there in the original dataset?</b> [Be careful, not all the columns are features...]
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Find the type of each feature (i.e. categorical, binary, numerical, etc.)</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: How many rows are there for each unique value of "root_shell"?</b>

\[BTW, it is a binary feature. if you though it was not a binary feature go back to the previous question and focus a bit more on that!\]
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: How many rows are there for each unique value of "logged_in"?</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: How many rows are there for each unique value of "class"?</b>
</div>

---

#### As in the previous sessions, we will now map each value of the 'class' column to 1 of 5 possible categories

In [None]:
category_mapping = {
    'normal': 'benign',
    'back': 'dos',
    'buffer_overflow': 'u2r',
    'ftp_write': 'r2l',
    'guess_passwd': 'r2l',
    'imap': 'r2l',
    'ipsweep': 'probe',
    'land': 'dos',
    'loadmodule': 'u2r',
    'multihop': 'r2l',
    'neptune': 'dos',
    'nmap': 'probe',
    'perl': 'u2r',
    'phf': 'r2l',
    'pod': 'dos',
    'portsweep': 'probe',
    'rootkit': 'u2r',
    'satan': 'probe',
    'smurf': 'dos',
    'spy': 'r2l',
    'teardrop': 'dos',
    'warezclient': 'r2l',
    'warezmaster': 'r2l',
}

In [None]:
train_df['attack_type'] = train_df.apply(lambda r: category_mapping[r['class']], axis=1)

<div class="alert alert-block alert-danger">
<b>Q: How many rows are there for each unique value of "attack_type"?</b>
</div>

---

# matplotlib

## histogram

#### Let's display the distribution of feature `duration`

- In order to do this, you can compute the number of occurrences of each possible value of `duration` (the are limited since it is an integer), and then print these values

In [None]:
df_distribution_duration = train_df.groupby('duration').size().reset_index().sort_values('duration')
display(df_distribution_duration)

- but that's definitely not readable! Instead, you can plot those values, using matplotlib.

In [None]:
# bins represent the number of "buckets" to group the data in
plt.hist(train_df['duration'].values, bins=50)
plt.show()

- in order to see the effects of `bins`, look at the difference between the previous plot and the following one.

In [None]:
# bins represent the number of "buckets" to group the data in
plt.hist(train_df['duration'].values, bins=10)
plt.show()

- the plot above shows that the distribution is extremely skewed towards 0 (as you could also see from the table above). Anyway, it doesn't say much about the distribution of the other values. But we can do something about that, removing the values that are too close to 0.

In [None]:
plt.hist(train_df[train_df['duration'] > 1 ]['duration'].values, bins=20)
plt.show()

<div class="alert alert-block alert-danger">
<b>Q: What is the main difference between the two plots above?</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

- Alternatively, we could plot the whole dataset but using a log scale for the y-axis:

In [None]:
fig, ax = plt.subplots()
ax.hist(train_df['duration'].values, bins=25)
ax.set_yscale('log')
plt.show()

<div class="alert alert-block alert-danger">
<b>Q: Can you see how the scale was set to a logarithmic one?</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

#### Using matplotlib in a more readable way

If you want to have a plot which is more readable, you have to use matplotlib a bit differently

In [None]:
fig, ax = plt.subplots()

ax.hist(train_df[train_df['duration']>1]['duration'].values, bins=20, label='duration')

ax.set_title('Distribution of the feature "duration"')
ax.set_xlabel('duration')
ax.set_ylabel('number of occurrences')
ax.legend()
plt.show()

There aren't any differences about how the data is showed, but there are many differences in the format (labels on the axis, legend, title of the plot).

---

<div class="alert alert-block alert-danger">
<b>Q: Plot the distribution of the feature "src_bytes".</b>
</div>

In [None]:
fig, ax = plt.subplots()

ax.  # TODO: COMPLETE THIS LINE

ax.set_title('Distribution of the feature "src_bytes"')
ax.set_xlabel('src_bytes')
ax.set_ylabel('frequency')
ax.legend()
plt.show()

<div class="alert alert-block alert-danger">
<b>Q: Print the possible unique values of "protocol_type".</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Now plot the distribution of "src_bytes" separately for each "protocol_type".</b>

Remember, if you want to filter a dataframe df in order to keep the entries that have "protocol_type" equal to 'asd' you can do: <code>df[df['protocol_type']=='asd']</code>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Try to analyse the differences between the distributions of "src_bytes" for each protocol type.</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Now do the same as above but for the "dst_bytes". Plot the distribution, the distribution for each protocol type and analyse the differences, if any.</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

## Scatter plot

#### Let's analyse whether there is some easily-visible correlation between src_bytes and dst_bytes.</b>

In [None]:
x, y = train_df['src_bytes'].values, train_df['dst_bytes'].values

In [None]:
fig, ax = plt.subplots()
ax.scatter(x, y)
plt.show()

- most of the points are very close to 0.0 (be careful with the scale of the axis: it is 1e9, which means 10^9 !), let's focus on them

In [None]:
fig, ax = plt.subplots(figsize=(8, 8)) # with figsize you can set the size of the plot
ax.scatter(x, y, s=5, alpha=0.5)

# with set_xlim and set_ylim you can look at only that portion of the whole
ax.set_xlim(0, 10**5)
ax.set_ylim(0, 10**5)

plt.show()

- still a bit too far, probably...

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.scatter(x, y, s=5, alpha=0.5)

# with set_xlim and set_ylim you can look at only that portion of the whole
ax.set_xlim(0, 10**3)
ax.set_ylim(0, 10**3)

plt.show()

- It looks like there are different "groups". We can use different colours for the points depending on some attributes. For instance, let's assume we want to have different colours depending on the protocol type. We can create a new list containing the colours of each point.

In [None]:
colors = []
for protocol in train_df['protocol_type'].values:
    if protocol == 'tcp':
        colors.append('red')
    elif protocol == 'udp':
        colors.append('darkgreen')
    elif protocol == 'icmp':
        colors.append('lightblue')

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

# with c you can set the color. 
# It can be a single color for the whole scatter or a list containing the color of each point
# alpha let's us choose how transparent the plot should be
ax.scatter(x, y, s=5, alpha=0.5, c=colors)

ax.set_xlim(-10, 10**3)
ax.set_ylim(-10, 10**3)

ax.set_xlabel('src_bytes')
ax.set_ylabel('dst_bytes')

plt.show()

- this plot, with different colours, gives a different insight into the data!

<div class="alert alert-block alert-danger">
<b>Q: The scatter plots above do not suggest any easily-visible correlation between dst_bytes and src_bytes. If you think about it, this makes very much sense; why?</b>
    
\[hint: think about the type of data we are looking at!\]
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Try to see if there is any easily visible correlation between the "duration" and the "src_bytes"</b>. 
    
\[HINT: as before, try to focus on specific areas and possibly use different colours\]
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: Try to see if there is any correlation between the "duration" and the "dst_bytes"</b>.
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

## bar plot

#### Let's display, for the 10 most common services, the number of occurrences.

In [None]:
groupedby_df = train_df.groupby('service').size().reset_index().sort_values(0, ascending=False)
service_list = groupedby_df['service'][:10].values
count_list = groupedby_df[0][:10].values

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))
ax.bar(service_list, count_list)

plt.show()

<div class="alert alert-block alert-danger">
<b>Q: Plot, for the 10 LEAST common services, the number of occurrences.</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Plot the number of occurrences of each protocol.</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

# Some more visualizations

<div class="alert alert-block alert-danger">
<b>Q: Plot the number of occurrences of each possible "urgent" value</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Plot the number of occurrences of the most frequent "flag" values</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

<div class="alert alert-block alert-danger">
<b>Q: Try to see if there is any visible correlation between "same_srv_rate" and "diff_srv_rate"</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---

# Clustering

In this session we will have a look only at the K-Means algorithm. In the next sessions, we will focus a bit more on clustering. 

In [None]:
from sklearn.cluster import KMeans

In order to be able to perform clustering, we have to perform the usual preprocessing of the data:
- one hot encoding
- scaling

In [None]:
col_names = np.array(headers)

nominal_idx = [1, 2, 3]
binary_idx = [6, 11, 13, 14, 20, 21]
numeric_idx = list(set(range(41)).difference(nominal_idx).difference(binary_idx))

nominal_cols = col_names[nominal_idx].tolist()
binary_cols = col_names[binary_idx].tolist()
numeric_cols = col_names[numeric_idx].tolist()

In [None]:
print("Nominal cols:\n", nominal_cols, "\n")
print("Binary cols:\n", binary_cols, "\n")
print("numeric_cols:\n", numeric_cols, "\n")

---

Some of the clustering algorithms tend to be intractable on small machines as the dimensionality increases.
Thus, in order to (hopefully) avoid problems on your machine, I remove here the 'service' column, which is a categorical one and increases a lot the dimensionality of the dataset

In [None]:
train_df = train_df.drop('service', axis=1)
nominal_cols = [x for x in nominal_cols if x != 'service']

---

<div class="alert alert-block alert-danger">
<b>Q: Perform one hot encoding of the nominal cols</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

In [None]:
train_df =   # TODO: complete this line

---

<div class="alert alert-block alert-danger">
<b>Q: Perform scaling with the StandardScaler</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
standard_scaler = 
train_df[numeric_cols] = 

---

# Let's perform the actual clustering

While performing clustering, we do not want information about the label, since it is information that is missing in unseen data (it is the target label).
So we will fit the clustering model on the dataframe after dropping such columns, as in:
```
    train_df.drop(['class', 'attack_type', 'difficulty_level'], axis=1)
```

## K-MEANS

- training the clustering algorithm

In [None]:
kmeans = KMeans(n_clusters=2, random_state=0).fit(
    train_df.drop(['class', 'attack_type', 'difficulty_level'], axis=1)
)

- you can print the centers of the clusters (which are, in this case, two 52-D points) 

In [None]:
kmeans.cluster_centers_

- if you want, you can print the labels for the elements in DF

In [None]:
kmeans.labels_

- let's save those labels in a new column of the dataframe

In [None]:
train_df['cluster'] = kmeans.labels_

<div class="alert alert-block alert-danger">
<b>Q: In this case, we know the ground truth (i.e. the attack_type). Try to evaluate the accuracy of clustering by looking at the attack types of the entries of each cluster.</b>
</div>

e.g. after a perfect clustering, I'd have in each cluster only elements belonging to one attack_type

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

<div class="alert alert-block alert-danger">
<b>Q: That is far from perfect. Do you have any idea why? What could you try to improve that?</b>
</div>

<div class="alert alert-block alert-success">
<b>ANS</b>
</div>

---