# Network Intrusion Detection

### Types of Network Intrusion Detection

* Misuse Detection
    * Exact descriptions of known bad behavior
* Anomaly detection
    * Deviations from profiles of normal behavior
    * First proposed in 1987 by Dorothy Denning

### Why ML for security: Attack landscape

* Attack sophistication
    * 403M new variants of malware created in 2011
    * 100K unique malware samples daily in 2012 Q1
* Required attacker knowledge decreasing
* Highly motivated attackers

### FACT
* Almost all NIDS systems used in operational environments are misuse-based
    * Despite lots of research on anomaly detection
    * Despite appeal of anomaly detection to find new attacks
    * Despite success of ML in other domains

### Challenges
* Outliers detection
* High cost of errors
* Lack of appropriate training data
* Interpretation of results
* Variability in network traffic
* Adaptive adversaries
* Evaluation difficulties

### Challenge: Outlier Detection

* Training Sample - Classification
    * Many from both classes
* Required quality - Classification
    * Enough to distinguish two classes
* Training Sample - Outlier Detection
    * Almost all from one class
* Required quality - Outlier Detection
    * Perfect model of normal
    
* Premise: Anomaly detection can find novel attacks
* Fact: ML is better at finding similar patterns than at finding outliers
    * Example: recommend similar products; similarity: products purchased together
* Conclusion: ML is better for finding variants of know attacks

* Underlying assumptions
    * Malicious activity is anomalous
    * Anomalies correspond to malicious activity
* Do these assumptions hold?
    * Former employee requests authorization code
        * Account revocation bug? Insider threat?
        * Username typo
    * User authentication fails 10k times
        * Brute force attack
        * User changed password, forgot to update script
        
### Challenge: High Cost of Errors
* Product Recommendation - Cost of False Negatives
    * Low: potential missed sales
* Production Recommendation - Cost of False Positives
    * Low: continue shopping
* Spam Detection - Cost of False Negatives
    * Low: spam finding way to inbox
* Spam detection - Cost of False Positives
    * High: missed important email
* Intrusion Detection - Cost of False NEgatives
    * High: Arbitrary damage
* Intrusion Detection - Cost of False Positives
    * High: Wasted precious analylst time
* Post-processing:
    * Spelling/grammar checkers to clean up results
    * Proofreading: much easier than verifying a network intrusion

### Challenge: Lack of appropriate training data
* attack free data hard to obtain
* Labeled data expensive to obtain
* Production recommendatin - training (supervised)
* Spam detection - training (supervised)
* Intrusion detection - training (unsupervised)

* Network operator needs actionable reports
    * What does that anomaly mean?
    * Abnormal activity vs. attack
    * Incorporation of site-specific security policies
    * Relation between features of anomaly detection & semantics of environment
    
### Challenge: Variability in network traffic
* Variability across all layers of the network
    * Even most basic characteristics: Bandwidth, duration of connections, application mix
* Large bursts of activity
* What is a stable notion of normality?
* Anomalies does not equal attacks
* One solution: Reduce granularity
    * Example: time of day, day of week
    * Pro: More stable
    * Con: Reduce visibility

### Challenge: Adaptive adversaries
* adversaries adapt
    * ML assumptions do not necessarily hold
        * I.I.D, stationary distributions, linear separability, etc
    * ML algorithm itself can be an attack target
        * Mistraining, evasion

### Challenge: Evaluation
* Difficulties with data
    * Data's sensitive nature
    * Lack of appropriate public data
        * Automated translation: European Union documents
    * Simulation
        * Capturing characteristics of real data
        * Capturing novel attack detection
    * Anonymization
        * Fear of de-anonymization
        * Removing features of interest to anomaly detection
    * Interpreting the results
        * "HTTP traffic of host did not match profile"
        * Contrast with spam detection: little room for interpretation
    * Adversarial environment
        * Contrast with product recommendation: little incentive to mislead the recommendation system
        
### Root cause
* Using tools borrowed from ML in inappropriate ways
* Goal: effective adoption of ML for large-scale operational environments
    * not a black box approach
    * crisp definition of context
    * understanding semantics of detection

### Guidelines
* understand the threat model
* keep the scope narrow
* reduce the costs
* use secure ML
* evaluation
* gain insights to the problem space

### Guideline: Understand the threat model
* What kind of target environment?
    * academic vs enterprise; small vs large/backbone
* Cost of missed attacks
    * security demands, other deployed detectors
* Attackers' skills and resources
    * targeted vs background radiation
* Risk posed by easion
* What are the specific attacks to detect?
* Choose the right tool for the task
    * ML is not a silver bullet
    * Common pitfall: start with intention to use ML or even worse a particular ML tool
    * No free lunch theorem
* Identify the appropriate features

### Guideline: Reduce the costs
* Reduce the system's scope
* Classification over outlier detection
* Aggregate features over suitable intervals
* Post-process the alerts
* Provide meta-information to analyst to speed up inspection

### Guideline: Evaluation
* Develop insight into anomaly detection system's capabilities
    * What can/can't it detect? Why?

### Guideline: gain insights to the problem space
* ML as means to identify important features
* Use those features to build non-ML detectors
* ML as a means to an end