# Mini Project - Net Attack

To experience some more-or-less real data we will explore webserver logs from a honeypot machine.
A honeypot is a server especially setup so that attacks on a website (or service)
are likely to be directed to that server instead of the actual server running the site.

A honeypot has several heuristics that allow it to determine whether
a connection was an attempt at something malicious.
These heuristics include information about the connection
and about any extra activity performed during the connection.
For example, if there was excessive memory of CPU usage from a process linked
to a connection, the connection is considered malicious.
Since honeypots attempt to attract malicious connections,
the ratio between malicious and non-malicious connections is reasonable.

We will look at the webserver logs for more then 10k connections,
and at the evaluation of the connection by the honeypot heuristics.
We will try to use only the webserver log information to build a machine learning model
to classify connections as malicious or not malicious based on that information alone.
We could then use this model to catch connections that did not direct themselves at the honeypot,
and possibly prevent them from reaching deep into the server actually running the service.

## The Data

The file `net-attack-access.log` (contained in `net-attack-access.log.zip`) contains the webserver logs.
This is a *binary file* despite containing mostly text.
Each line of the file contains a JSON encoded over ISO-8859-1 (latin-1) character set,
and possibly a couple of bytes JSON escaped.
All JSONs contain the exact same keys, and all keys are (byte) strings.

The file `net-attack-access.labels` (contained in `net-attack-access.labels.zip`) contains a label
of the class that the corresponding JSON log is labeled against.
The positive class (1) means a malicious connection,
and the negative class (0) means a not malicious connection.
The ratio is that about 1/10 of all connections in the file are malicious.

## Objective

We want to build an ML model that will reasonably classify connections as malicious
or not malicious based on webserver log data alone.
This model shall have a reasonable *generalization* score.
The requirements are as follows.

- Select a non-parametric model for a baseline (e.g. Naive Bayes)
- Perform data pre-processing and evaluate if you can improve the baseline on preprocessing alone
- Select a different model which improves the generalization of the classification

Describe each decision, e.g. why did you try one model over another,
or why you decided on one form of preprocessing over another.

Note: The data are binary strings but ML models accept numbers only.
You will need to extract features from the data before attempting any classification.

## Tips

- This is byte data, therefore byte ranges and byte string lengths are easy to retrieve as features.

- Not all features are useful.
  The data is actually quite dirty, some keys are absolutely the same across all samples.

- See the Die Hard example for string processing in pandas.
  Whether you work on the strings as byte string or as converted UTF-8 strings does not alter the result.

- Some features are more important than others, despite having smaller variance.
  Do not blindly perform dimensionality reduction expecting to perform better.

- One-Hot-Encoding will be useful.
  Several features have only a handful of values, i.e. they are categorical.

- Explore the data!  Find all values and the number of values within each feature.
  On UNIX a useful tool is [jq][], which is capable of querying the combined JSON file directly.

[jq]: https://stedolan.github.io/jq/

## Extra

If you want a bigger challenge (i.e. this is not strictly part of the project),
try clustering the connections.
In other words, had we not had the honeypot heuristics whether we could be able
to still guess malicious from non-malicious connections from the data alone.

Using the same data preparation for the classification use `kmeans` to find clusters.
And compare how well the clusters match to the honeypot labels.
Here dimensionality reduction may help.

For an even bigger challenge you can try [other forms of clustering][cluster]
available in `sklearn`.

[cluster]: http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html