# Final Capstone

## Overview
For my final capstone in the Thinkful Data Science Course, I will be implementing mutliple models in attempt estimate the probability of sequences in computer log activity.  I will determine the likelihood that a user follows a particular path and map these to the MITRE ATT&CK Framework available <a href="https://attack.mitre.org/wiki/Main_Page">here</a>. 

Certain activities by a user or program on a computer network can generally be mapped to certain categories of behavior, known as a Tactic in the ATT&CK ontology.  Once mapped, the behaviors form graphical models of behavior that can be represented by a Markov Chain.  The following are the MITRE ATT&CK Tactics:
* Persistence
* Privilege Escalation
* Defense Evasion
* Credential Access
* Discovery
* Lateral Movement
* Execution
* Collection
* Exfiltration
* Command and Control

When a malicious user or attacker attacks a network, there are generally traces of this behavior left behind in the form of logs, whether these are endpoint logs, e.g., Windows or Linux Operating system event, access and security logs, network based logs, e.g., Intrusion Detection Systems (IDS), netflow logs, firewall logs, router logs, or application logs, e.g., HypterText Transfer Protocol (HTTP) web server logs, database logs, etc. With the multiple different logs, identification of a malicious user is difficult, and anomaly detection plays a critical role in identifying malicious behavior.

While anomaly detection identifies interesting events in these logs, it is typically not sufficient individually to determine whether or not a particular event is actually malicious. In order to just that, multiple individual anomalous events that can be correlated together provide a strong indication that actual malicious activity is present.  

In this Capstone, I will identify anomalous activity in log data, relate those back to the Tactics, and use a Markov chain to predict the probability that a user performed those activities.  Probabilities that are very low would indicate that that sequence of events has happened and would be a strong indication that an attack has occured.  This utility of this model is two-fold: 
* Correlating anomalous events together automatically reduces the human requirement to manually pull these events together, freeing security analysts to perform other tasks
* Providing the probability of a particular sequence being followed can aide in attribution of specific attacks

In [30]:
import pandas as pd
from datetime import datetime
import numpy as np

In [3]:
df = pd.read_csv('C:\\Users\\Kim\\Downloads\\r1\\logon.csv')

In [4]:
df.head()

Unnamed: 0,id,date,user,pc,activity
0,{Y6O4-A7KC67IN-0899AOZK},01/04/2010 00:10:37,DTAA/KEE0997,PC-1914,Logon
1,{O5Y6-O7CJ02JC-6704RWBS},01/04/2010 00:52:16,DTAA/KEE0997,PC-1914,Logoff
2,{D2D1-C6EB14QJ-2100RSZO},01/04/2010 01:17:20,DTAA/KEE0997,PC-3363,Logon
3,{H9W1-X0MC70BT-6065RPAT},01/04/2010 01:28:34,DTAA/KEE0997,PC-3363,Logoff
4,{H3H4-S5AZ00AZ-9560IYHC},01/04/2010 01:57:30,DTAA/BJM0992,PC-3058,Logon


In [7]:
df.activity.value_counts()

Logon     470877
Logoff    378702
Name: activity, dtype: int64

In [14]:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y %H:%M:%S").strftime("%m-%d-%Y"))

In [15]:
# We are not going to concern ourselves with Logoff activity
df = df[df['activity']=='Logon']

In [20]:
count_df = df.groupby(['date', 'user']).size().reset_index(name="Count")

In [21]:
test = count_df[count_df['user']=='DTAA/TPC0102']

In [22]:
test.head()

Unnamed: 0,date,user,Count
872,01-03-2011,DTAA/TPC0102,2
1868,01-04-2010,DTAA/TPC0102,3
2828,01-04-2011,DTAA/TPC0102,2
3824,01-05-2010,DTAA/TPC0102,3
4784,01-05-2011,DTAA/TPC0102,2


In [28]:
import pymc3 as pm
import theano.tensor as tt

n_count_data = len(test)

with pm.Model() as model:
    alpha = 1.0/test.mean()  # Recall count_data is the
                                   # variable that holds our txt counts
    lambda_1 = pm.Exponential("lambda_1", alpha, shape=1)
    lambda_2 = pm.Exponential("lambda_2", alpha, shape=1)
    
    tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)

In [31]:
with model:
    idx = np.arange(n_count_data) # Index
    lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)

In [32]:

with model:
    observation = pm.Poisson("obs", lambda_, observed=count_data)

NameError: name 'count_data' is not defined