# Methodology


 We want to extend the insights that Zeek and RITA are giving us about the probable sketchiness of particular connections. The goal in this notebook is to explain the data analaysis in a way that other, more fluid, tools can be built. 
 
 
**RITAs statistical analysis plus locally gathered heuristics**   
 RITA and Zeek are fantastic tools to explore traffic and narrow down bad actors but published blacklists are always laggy and incomplete, when they work and are supported. Also - the scoring doesn't filter for items like DNS or other known-good services, which adds to the SecOps response time to filter.  Here we can bring some low-cost tools to bear and some local understanding to narrow down what to invest time into, good or bad.


Some methods we're using:
- Combine Beacons and Conns files to identify unique talkers
- retrieve BGP Autonomous Systems info to identify originators (or listeners) in Wild West areas

also - are we getting connections from nets that practice good hygiene? 
- does an ip have a DNS entry?
- does an ip have a PTR record?

What local tools can add dimension?
- was the sender identified as malicious by other means? (fail2ban ICMP type 3 returns)


Unfortunately these don't work anymore:
- MalwareDomains.com
- MalwareDomainList.com
- malware-domains.com

Fresher Blacklist providers (as of 8/2021):
- https://urlhaus.abuse.ch/downloads/csv/
- https://github.com/curbengh/urlhaus-filter
- https://github.com/StevenBlack/hosts



**Home and Home Office Networks**  
You may want to analyze office or home net traffic and find out chatty corporate tools. You can add your own, but this will add a score to tag "friendly surveillance" from Apple, Google, et al. My lists are US-Centric - tailor to your locale. These may or may not be things you want in your custom RITA blacklist, but you may not know what they are yet. 





In [1]:
# imports
import pandas as pd
import numpy as np

# Viz imports
import matplotlib.pyplot as plt
import seaborn as sns

# Config matplotlib
%matplotlib inline
plt.rcParams["patch.force_edgecolor"] = True # in matplotlib, edge borders are turned off by default.
sns.set_style("darkgrid") # set a grey grid as a background

# turn off warnings
import warnings
warnings.filterwarnings('ignore')

import csv
import json
import datetime as datetime
import time

# ip/AS lookup tools
import socket
from ipwhois import IPWhois
from ipwhois.net import Net
from ipwhois.asn import IPASN

In [2]:
# define corporate target AS Descriptors
invasive_corps = ['AMAZON','APPLE','GOOGLE','MICROSOFT','CLOUDFLARENET','SALESFORCE','AKAMAI','OPENDNS']
sketchy_countries = ['CN','RU','VN','HK','TW','IN','BR','RO','HU','KR','IT','UG','TR','MY','BO','CO']


In [3]:
sketchy_providers = []
providers = open('beaconish_asns','r').readlines()
for p in providers:
    sketchy_providers.append(p.split()[0])

In [4]:
sketchy_providers[:5]

['AS-SONICTELECOM,', 'ASIANET', 'ASN-SPIN,', 'ASN-WINDTRE', 'BAIDU']

### load data

this takes the output of rita show-long-connections (dfconns) and rita show-beacons (dfbeacons).

The obscured IPs must be the same for each file - we merge the two on a matched ipsrc->ipdst key. The df will fail to create if there is nothing to merge on.

In [5]:
dfbeacons = pd.read_csv('records/scrubbed_ext_20210315062437_beacons.csv')
dfconns = pd.read_csv('records/scrubbed_ext_20210315062437_longconns.csv')
#dfdns = pd.read_csv('dns.csv')

In [6]:
dfbeacons.head(1)

Unnamed: 0,Score,Source IP,Destination IP,Connections,Avg Bytes,Intvl Range,Size Range,Top Intvl,Top Size,Top Intvl Count,Top Size Count,Intvl Skew,Size Skew,Intvl Dispersion,Size Dispersion
0,0.874,122.194.229.37,192.168.23.89,43244,60,308072,2583,11,60,7773,43188,0.0,0.0,1,0


In [7]:
dfconns.head(1)

Unnamed: 0,Source IP,Destination IP,Port:Protocol:Service,Duration
0,97.113.95.12,192.168.23.89,53718:tcp:- 53716:tcp:- 44496:tcp:- 44494:tcp:...,367302.0


### merge 

In [8]:
dfconns['ConnString'] = dfconns['Source IP'] + '->' + dfconns['Destination IP']

In [9]:
dfbeacons['ConnString'] = dfbeacons['Source IP'] + '->'+ dfbeacons['Destination IP']

In [10]:
df = pd.merge(dfbeacons, dfconns, on=['ConnString'], how='inner')

In [12]:
df.head(3)

Unnamed: 0,Score,Source IP_x,Destination IP_x,Connections,Avg Bytes,Intvl Range,Size Range,Top Intvl,Top Size,Top Intvl Count,Top Size Count,Intvl Skew,Size Skew,Intvl Dispersion,Size Dispersion,ConnString,Source IP_y,Destination IP_y,Port:Protocol:Service,Duration
0,0.835,65.254.18.118,192.168.23.89,1224,88,101090,7306,600,60,1052,1201,0.0,0.0,0,0,65.254.18.118->192.168.23.89,65.254.18.118,192.168.23.89,25:tcp:- 25:tcp:smtp,596.079
1,0.834,104.153.105.82,192.168.23.89,272,1762,356385,147842,1,0,81,189,0.0,0.0,0,0,104.153.105.82->192.168.23.89,104.153.105.82,192.168.23.89,80:tcp:- 443:tcp:- 53:udp:dns,149.911
2,0.832,192.168.23.89,212.70.149.71,3005,104,7209,45408,110,88,930,2965,0.0,0.0,1,0,192.168.23.89->212.70.149.71,192.168.23.89,212.70.149.71,3:icmp:-,7123.61


### clean up merge data

since the merge key is an amagalm of connection to connection strings, the Source IP and Destination IP collision columns _x and _y should be erroneous.

**delete _ys and rename _x**

In [13]:
del df['Source IP_y']

In [14]:
del df['Destination IP_y']

In [15]:
df.rename(columns={"Destination IP_x": "Destination IP",'Source IP_x':'Source IP'}, inplace=True)

In [16]:
df.columns

Index(['Score', 'Source IP', 'Destination IP', 'Connections', 'Avg Bytes',
       'Intvl Range', 'Size Range', 'Top Intvl', 'Top Size', 'Top Intvl Count',
       'Top Size Count', 'Intvl Skew', 'Size Skew', 'Intvl Dispersion',
       'Size Dispersion', 'ConnString', 'Port:Protocol:Service', 'Duration'],
      dtype='object')

### adding hostname lookups

In [17]:
def iplookup(ipaddress):
    #print(ipaddress)
    try: 
        fullhostname = socket.gethostbyaddr(ipaddress)
        hostname = fullhostname[0]
    except Exception as e:
        hostname = ipaddress
    return hostname

#### This takes awhile.

In [18]:
df['Source Name'] = df['Source IP'].apply(iplookup)

#### This takes even longer.

Garbage connections often don't have a DNS record. The timeout process makes this table take awhile to build

In [None]:
# this takes awhile thanks to DNS timeouts
#start = datetime.datetime.now()
df['Destination Name'] = df['Destination IP'].apply(iplookup)


In [None]:
df.head()

### Describing the IP sets

In [None]:
# unique localhosts
len(df['Source IP'].unique())

In [None]:
df['Source IP'].unique()

In [None]:
# unique targets
len(df['Destination IP'].unique())

In [None]:
# unique connections
len(df['ConnString'].unique())

### Adding AS info

If you're reading this I'm guessing that you probably already know what all this is and get why we're grading  data in this way. If not, read on:

**Quick BGP/AS intro (stolen liberally from Cloudflare's great tutorial):**  
The *Border Gateway Protocol (BGP)* is the postal service of the Internet. To manage these endpoints each network broken into smaller networks known as *Autonomous systems (AS)*. Each of these networks is essentially a large pool of routers run by a single organization. 

If we continue to think of BGP as the postal service of the Internet, AS’s are like individual post office branches. A town may have hundreds of mailboxes, but the mail in those boxes must go through the local postal branch before being routed to another destination. The internal routers within an AS are like mailboxes, they forward their outbound transmissions to the AS, which then uses BGP routing to get these transmissions to their destinations.

To get on the Internet you need an IP Block, which needs a BGP AS. The companies that own the AS are responsible for the traffic that goes through them. If you have a lot of bad traffic coming from one server in an AS then there's reason to believe that blocks in other IP space controlled by that AS are also probably poorly managed. 

**Grading traffic from a particular AS block**  
This may be part of a decision to drop traffic coming from a single server or from the entire IP space as a Network Admin, but in this context we're simply going to grade traffic to that AS as more suspicious.

**Grading traffic coming from a Country**
Each BGP area is controlled within a Region, which then distributes to countries who have laws regarding internet traffic, hacking, etc. Some countries are more permissive than others with regard to hacking, fraud and spam. While it's incorrect and unfair to grade the citizens or services of a country based on the worst of their netizens, it's reasonable to grade countries with overly promiscuous (or non-existent) laws about hacking higher for further review. 

**Grading traffic coming from a Company**  
Much of the same rules apply here - if a company has a policy for their devices to send tracking data home through your networks you should be able to know about it. If they have lax policies concerning network access or services that could host C2 or bad traffic, you should be able to know about that, too. 

In [29]:
def getDstAsInfo(ipaddress, category='asn', **kwargs):
    """
    This is meant to return AS info for everything not an rfc1918 and mark rfc1918 elsewhere
    df['Destination IP'].apply(getAsInfoKwargs, category='asncidr',axis=1)
    """
    private = ['10.','172.','192.168.']
    if '192.168' in ipaddress:
        return 'rfc1918'
    net = Net(ipaddress)
    obj = IPASN(net)
    results = obj.lookup()
    if category is None:
        return "no category"
        #print(category)
    
    if category == 'reg':
        return results['asn_registry']
    elif category == 'asnnum':
        return results['asn']
    elif category == 'asncidr':
        return results['asn_cidr']
    elif category == 'asncc':
        return results['asn_country_code']
    elif category == 'asndate':
        return results['asn_date']
    elif category == 'asndesc':
        return results['asn_description']
    else:
        return False
    

In [26]:
def getAsInfo(item, category='asn', **kwargs):
    """
    We want to do a lookup on the non-rfc1918 address either send or receive in one fell swoop
    
    df[['Source IP','Destination IP']].apply(getAsInfoKwargs, category='asncidr',axis=1)
    
    asncidr, asn, asn_desc, asn_country
    """
    one, two = item['Source IP'], item['Destination IP']
    target_ip = one
    
    private = ['10.','172.','192.168.']
    
    # get as info for the non-rfc1918 address
    if '192.168' in one:
        target_ip = two
    
    net = Net(target_ip)
    obj = IPASN(net)
    results = obj.lookup()
    
    if category == 'reg':
        return results['asn_registry']
    elif category == 'asn':
        return results['asn']
    elif category == 'asn_cidr':
        return results['asn_cidr']
    elif category == 'asn_country':
        return results['asn_country_code']
    elif category == 'asn_date':
        return results['asn_date']
    elif category == 'asn_desc':
        return results['asn_description']
    if category == 'all':
        return results['asn_cidr'], results['asn'], results['asn_description'],results['asn_country_code']
    else:
        return False
    

In [23]:
asdata = ['asn','asn_cidr','asn_country','asn_desc']

In [27]:
for a in asdata:
    df[a] =  df[['Source IP','Destination IP']].apply(getAsInfo,category=a,axis=1)

In [28]:
df[:10]


Unnamed: 0,Score,Source IP,Destination IP,Connections,Avg Bytes,Intvl Range,Size Range,Top Intvl,Top Size,Top Intvl Count,...,Intvl Dispersion,Size Dispersion,ConnString,Port:Protocol:Service,Duration,Source Name,asn,asn_cidr,asn_country,asn_desc
0,0.835,65.254.18.118,192.168.23.89,1224,88,101090,7306,600,60,1052,...,0,0,65.254.18.118->192.168.23.89,25:tcp:- 25:tcp:smtp,596.079,smtp.jobdivabk.com,46887,65.254.0.0/19,US,"LIGHTOWER, US"
1,0.834,104.153.105.82,192.168.23.89,272,1762,356385,147842,1,0,81,...,0,0,104.153.105.82->192.168.23.89,80:tcp:- 443:tcp:- 53:udp:dns,149.911,v-104-153-105-82.unman-vds.premium-chicago.nfo...,14586,104.153.105.0/24,US,"NUCLEARFALLOUT-CHI, US"
2,0.832,192.168.23.89,212.70.149.71,3005,104,7209,45408,110,88,930,...,1,0,192.168.23.89->212.70.149.71,3:icmp:-,7123.61,192.168.23.89,204428,212.70.149.0/24,BG,"SS-NET, BG"
3,0.828,49.235.37.144,192.168.23.89,112,126,179,1853,44,60,26,...,1,0,49.235.37.144->192.168.23.89,22:tcp:- 22:tcp:ssh,103.858,49.235.37.144,45090,49.235.32.0/20,CN,CNNIC-TENCENT-NET-AP Shenzhen Tencent Computer...
4,0.828,192.168.23.89,119.28.83.164,38,159,241,2192,62,88,9,...,1,0,192.168.23.89->119.28.83.164,3:icmp:-,237.145,192.168.23.89,132203,119.28.82.0/23,CN,"TENCENT-NET-AP-CN Tencent Building, Kejizhongy..."
5,0.823,192.168.23.89,157.230.210.84,42,226,220,1944,208,88,7,...,2,0,192.168.23.89->157.230.210.84,3:icmp:-,117.083,192.168.23.89,14061,157.230.208.0/20,US,"DIGITALOCEAN-ASN, US"
6,0.823,192.168.23.89,49.233.77.12,102,162,268,2564,85,88,15,...,2,0,192.168.23.89->49.233.77.12,3:icmp:-,258.82,192.168.23.89,45090,49.233.64.0/20,CN,CNNIC-TENCENT-NET-AP Shenzhen Tencent Computer...
7,0.823,192.168.23.89,27.128.236.189,114,140,162,2036,80,88,19,...,2,0,192.168.23.89->27.128.236.189,3:icmp:-,156.136,192.168.23.89,4134,27.128.0.0/15,CN,"CHINANET-BACKBONE No.31,Jin-rong Street, CN"
8,0.823,192.168.23.89,179.97.86.254,137,133,146,2120,68,88,23,...,2,0,192.168.23.89->179.97.86.254,3:icmp:-,136.993,192.168.23.89,28361,179.97.86.0/23,BR,"RR conect, BR"
9,0.823,192.168.23.89,118.24.134.15,65,177,153,2152,136,88,11,...,2,0,192.168.23.89->118.24.134.15,3:icmp:-,136.034,192.168.23.89,45090,118.24.132.0/22,CN,CNNIC-TENCENT-NET-AP Shenzhen Tencent Computer...


In [None]:
def testAsn(ip):
    try:
        net = Net(ip)
        obj = IPASN(net)
        results = obj.lookup()
    except Exception as e:
        return e
    stuff = results['asn_cidr'], results['asn'], results['asn_description'],results['asn_country_code']
    #print(results[['asn_cidr','asn','asn_description','asn_country_code']])
#     print(results['asn_cidr'], results['asn'], results['asn_description'],results['asn_country_code'])
#     return len(results), results
    print(len(stuff),stuff)

In [None]:
df[['Source IP','Destination IP']][:5]

In [None]:
def testAsInfo(item):
    one, two = item['Source IP'], item['Destination IP']
    target_ip = one    
    if '192.168' in one:
        target_ip = two
        
    print(one, two, target_ip)

In [None]:
df[['Source IP','Destination IP']][:5].apply(getAsInfo,category='all',axis=1)

In [None]:
df[['asncidr','asn', 'asn_desc', 'asn_country']][:5]

In [None]:
df['Source IP'].apply(testAsn)

In [None]:
# seems to hate this because I want to infer 4 values from 2 values
#df[['asncidr','asn', 'asn_desc', 'asn_country']] = df[['Source IP','Destination IP']].apply(getAsInfo,category='all',axis=1)

In [None]:
df[:3]

#### add ASN columns

takes a little time for the lookups

In [None]:
# add asncidr
df['asncidr'] = df['Destination IP'].apply(getDstAsInfo, category='asncidr',axis=1)

In [None]:
df['asn'] = df['Destination IP'].apply(getDstAsInfo, category='asnnum',axis=1)

In [None]:
df['asn_desc'] = df['Destination IP'].apply(getDstAsInfo, category='asndesc',axis=1)

In [None]:
df['asn_country'] = df['Destination IP'].apply(getDstAsInfo, category='asncc',axis=1)

In [None]:
df.head(3)

#### for all the AS entries where we punted in dst, redo for src

In [None]:
# fill in the blanks for sources
df['asncidr'] = df[df['asncidr'] == 'rfc1918_dst']['Source IP'].apply(getSrcAsInfo, category='asncidr',axis=1)

In [None]:
df['asn'] = df[df['asn'] == 'rfc1918_dst']['Source IP'].apply(getSrcAsInfo, category='asnnum',axis=1)

In [None]:
df['asn_desc'] = df[df['asn_desc'] == 'rfc1918_dst']['Source IP'].apply(getSrcAsInfo, category='asndesc',axis=1)

In [None]:
# countries 
df['asn_country'] = df[df['asn_country'] == 'rfc1918_dst']['Source IP'].apply(getSrcAsInfo, category='asncc',axis=1)

In [None]:
df[:10]

In [None]:
df[df['asn']== 'rfc1918_src']

**how many are unique?**

In [None]:
len(df['asn'].unique())

In [None]:
df['asn'].unique()[:10]

In [None]:
len(df['asn_country'].unique())

In [None]:
df['asn_country'].unique()

In [None]:
countries = df['asn_country'].unique()

In [None]:
df[df['asn_desc']!= 'rfc1918'][:5]

In [None]:
df[['asn','asn_desc','Source IP']].value_counts()

#### add AS Features

 Add booleans if the connection is either a known invasive tech company or in the sketchy country list.

In [None]:
def is_sketchy(asn):
    return True if asn in sketchy_countries else False 

In [None]:
def is_corp(asn):
    return True if asn in invasive_corps else False

In [None]:
def is_sketchy_provider(asn):
    
    return True if asn in sketchy_providers else False

In [None]:
len(df[df['asn_country'].apply(is_sketchy)])

In [None]:
df['sketchy'] = df['asn_country'].apply(is_sketchy)

In [None]:
df['iscorp'] = df['asn_desc'].apply(is_corp)

In [None]:
df['sketchy_provider'] = df['asn_desc'].apply(is_sketchy_provider)

In [None]:
df.head(3)

### Network sanity

Are DNS/Reverse protocols handled in a friendly way?

- reverse pointers
- DNS entries

In [None]:
df['Source Name'][0]

In [None]:
def isip(id):
    """
    is the string an ipv4 address?
    """
    try: 
        socket.inet_aton(id)
        return True
    except:
        return False

In [None]:
def has_dns(id):
    """
    earlier we checked for a dns entry and return an IP if none is found.
    here we say "if that id is an IP then there was no DNS record"
    """
    try: 
        socket.inet_aton(id)
        return False
    except:
        return True

In [None]:
def has_ptr(id):
    """
    earlier we checked for a ptr and return an IP if none is found.
    here we say "if that id is an IP then there was no PTR record"
    """
    try: 
        socket.inet_aton(id)
        return False
    except:
        return True

In [None]:
df['Source Name'].apply(isip)

In [None]:
df['src_ptr'] = df['Source Name'].apply(has_ptr)
df['dst_ptr'] = df['Destination Name'].apply(has_ptr)
df['src_dns'] = df['Source Name'].apply(has_dns)
df['dst_dns'] = df['Destination Name'].apply(has_dns)


In [None]:
df.head(3)

**did we miss any?**

In [None]:
# weird entry - whois returns the AS info, but no description or prefix
# AS      | IP               | BGP Prefix          | CC | Registry | Allocated  | AS Name
# NA      | 69.195.171.128   | NA                  | US | arin     | 2017-09-18 | NA
# From Hurricane Electric - Twitter:
# AS13414 IRR Valid 69.195.171.0/24 Twitter Inc.
df[df['asn'] == 'NA']

### Checking for fail2ban entries

https://www.fail2ban.org/wiki/index.php/Main_Page

If you aren't familiar, fail2ban scans log files (e.g. /var/log/apache/error_log) and bans IPs that show malicious signs -- too many password failures, seeking for exploits, etc. If something hammers the logs enough to trigger a fail2ban entry this adds suspicion to the originating connection. 

A return of ICMP 3 (unreachable) means that the host was caught by fail2ban so we can filter on that from the logs.





In [None]:
# here we show the unique protocols available in our test
len(df['Port:Protocol:Service'].unique())

In [None]:
# and here's a count of which protocols are represented in our sample
df['Port:Protocol:Service'].value_counts()

In [None]:
#services = {'icmp':3,'ssh':22,'smtp':25,'dns':53,'ssl':443,'http':80}
services = ['icmp','ssh','smtp','dns','ssl','http']

In [None]:
def f2b_marked(s):
    """
    fail2ban responds to connection overload by replying with ICMP type 3 "unreachable"
    if this exists in the connection, we'll presume that this host was flooding
    """
    if 'icmp' in s:
        return True
    return False

In [None]:
# multiple match list
# [s for s in my_list if any(xs in s for xs in matchers)] # greedy - returns too much
# {s for s in my_list for xs in matchers if xs in s}

In [None]:
# add fail2ban hit feature
df['fail2ban'] = df['Port:Protocol:Service'].apply(f2b_marked)

### some simple aggregated term analysis


In [None]:
# sketchy is false
df[~df['sketchy']][:3]

In [None]:
# connections flagged by fail2ban with no DNS entry
df[(~df['dst_dns'])&(df['fail2ban'])]

### Extracts using the flags

Now we can use pandas and the features to test the output.

In [None]:
# all providers where connection has no dst_ptr or dst_dns and has a fail2ban hit
df[(~df['dst_ptr'])&(~df['dst_dns']) &(df['fail2ban'])].asn_desc.unique()

In [None]:
# grab the value of the asn_desc where the item not sketchy
df[(~df['sketchy']) & (~df['src_dns'])][:3]

In [None]:
df[(~df['sketchy']) &(df['fail2ban'])][:3]

### what AS regions get the most traffic?

In [None]:
df[['asn','asn_desc','Source IP']].value_counts()

In [None]:
df[['asn','asn_desc','Source IP']][:11].value_counts()

### Stats analysis

In [None]:
# describe the stats
df.describe()

mean score

In [None]:
df['Score'].mean().round(3)

mean duration in ms

In [None]:
df['Duration'].mean().round(3)

relative item correlation

In [None]:
df.corr()

#### adding a heatmap to the correlation

This data doesn't have corporate returns or ASs from the sketchy provider map. 
TODO: sort out sketchy providers from the data at the start of the definitions

In [None]:
fig= plt.figure(figsize=(15,8))
sns.heatmap(df.corr(), linewidths=.1, linecolor='black')

### adding viz and stats

What are the most prevalent AS Numbers?

In [None]:
df['asn'][:30].value_counts().plot(kind='bar')

Where are they coming from?

In [None]:
df['asn_desc'][:30].value_counts().plot(kind='bar')

What countries account for the most traffic?

In [None]:
df['asn_country'][:10].value_counts().plot(kind='bar')

Is there a correlation between average bytes and number of connections?

In [None]:
df[['Avg Bytes','Connections']][:10].plot()

**whats the relative occurrance of high beacon traffic?**

How about the occurrange of high beaconish traffic?

In [None]:
df[['Score']][:30].plot(y='Score')

In [None]:
sns.distplot(df['Score'])

**how about long duration**

In [None]:
sns.distplot(df['Duration'])

#### how about services?


In [None]:
# The highest hits is fail2ban attempting to quash traffic, so we'll remove the ICMP entries
df[~df['Port:Protocol:Service'].str.contains('icmp')]['Port:Protocol:Service'].value_counts().plot(kind='bar')

### High Beaconish Originators

In [None]:
df[(df['Score'] >.80)][3:]['asn_desc']

In [None]:
df['asn_country'].value_counts()

In [None]:
# highest traffic country entries
df[df['asn_country'] =='CN']

### build a view of connections where duration value is short and beaconish is high

In [None]:
# What are the relative duration statistics?
df['Duration'].describe()

In [None]:
# What is the relative score distribution?
df['Score'].describe()

In [None]:
# looking at raw duration length values
df['Duration'].sort_values()

**Start drilling down**

Find the mean of all the Duration values. Use the Mean to determine how ordinary the duration of the traffic is

In [None]:
# 
df['Duration'].mean()

In [None]:
# Show only durations below the mean
df[df['Duration'] < df['Duration'].mean()]

**what are connections where duration is below a particular quantile?**

In [None]:
df[df['Duration'] < df['Duration'].quantile(.2)]

**connections where duration value is short and beaconish is high**

- only get low duration connections which exhibit above %75 beaconism 

In this case, there a bunch of ICMP messages originating from my host heading to (mostly) China. If fail2ban wasn't running this might be cause for further investigation, but fail2ban sends ICMP type 3 packets to an originator when it gets jailed. We're catching this upstream in the fail2ban column. 

In [None]:
print(len(df[(df['Duration'] < df['Duration'].quantile(.2)) & (df['Score'] > .75)]))
df[(df['Duration'] < df['Duration'].quantile(.2)) & (df['Score'] > .75)][:3]

**if anything is left that's not fail2ban there is something to dig further into**

In [None]:
# if anything is left that's not fail2ban there is something to dig further into
df[(df['Duration'] < df['Duration'].quantile(.2)) & (df['Score'] > .75)&(~df['fail2ban'])][:10]

**is anything not originating from my ip?**

In [None]:
# is anything not originating from my ip?
df[(df['Duration'] < df['Duration'].quantile(.2)) & (df['Score'] > .75)]['Source IP'].unique()

**look at only non-fail2ban items where Duration is in the upper quantile, Score is greater than .75 and originates from my server**

In [None]:
# look at only non-fail2ban items where Duration is in the upper quantile, Score is greater than .75 and originates from my server
# nothing here - so it looks like beaconish activity here is fail2ban related (handled by )
df[(df['Duration'] < df['Duration'].quantile(.2)) & (df['Score'] > .75)& (~df['fail2ban']) &(df['Source IP'].str.contains('192.168.23.89'))]

In [None]:
df[(df['Duration'] < df['Duration'].quantile(.2)) & (df['Score'] > .75)&(df['Source IP'] != '192.168.23.89')]

**a quick tool to see if an ipaddress is in the pandas datastore**

In [None]:
def showline(ipaddress):
    return df[df['Source IP'] == ipaddress]

In [None]:
showline('123.127.244.100')

**where are ssh connections coming/going?**

In [None]:
# where are ssh connections coming/going?
# everything appears to be incoming, so we aren't launching any attacks
df[(df['Port:Protocol:Service'].str.contains('ssh')) & (df['Source IP'] != "10.4.86.55")]['ConnString'].unique()

**show all unique source names with scores above 80%**

In [None]:
# show all unique source names with scores above 80%
df[df['Score']> .8]['Source Name'].unique()

### Blacklists

Adding a blacklist heuristic. Most of rita-bl seems borked right now (stale data, backends offline, etc). In the meantime, lets get visibility using the spamhaus data.

In [None]:
# importing a custom, line-delimited list
blacklistraw = open('20210827154850_blacklisted_ips.txt','r').readlines()
blacklist = [x.strip('\n') for x in blacklistraw]

In [None]:
def blacklist_test(ip):
    badreturns = []
    hits = []
    if ip in blacklist:
        hits.append(ip)
#     else:
#         badreturns.append(ip)
    if len(hits) > 0:
        return hits

In [None]:
def in_blacklist(ip):
    if ip not in blacklist:
        return False
    return True

In [None]:
# add feature
df['blacklisted'] = df['Destination IP'].apply(in_blacklist)

In [None]:
# are there any hits?
df[df['blacklisted']]

### Heuristics 
#### show the tally points

Here we want to score based on the conditions. Some things are bad if they're True (sketchy TLD like Russia or China) some are bad if they're False (no reverse_ptr). Scoring needs a scale: some things are inherently worse (domain is the source of an attack in the wild) and some are not (reverse DNS).

reasons to believe the traffic is not good (this could use expansion)
sketchy - if True (the connection is from a poorly managed country tld) add 3
fail2ban - if True (domain is spawning attacks in the wild) add 3
sketchy_provider - if True then bad - add 3

formal laziness: 
src_ptr - if they are the source and his is false, then bad 2
dst_ptr - if they are the dst and it is false, then bad 2
src_dns - if they are the source and his is false, then bad 2
dst_dns - if they are the dst and it is false, then bad 2

corporate canaries:
iscorp - corporate canaries (apple, google, microsoft, etc). If true, then bad (though probably harmless). 1  



so what I need is:
- a feature that lets me know if src/dst is important for ptr and dns
- a function that returns the value if the feature is present for each item and then tallies a score to be added as a feature.

In [None]:
# adding a "score" feature first - Beaconish Score (how likely is this a problematic beacon?)
df['bscore'] = 0

In [None]:
# we're tallying on these columns
df[['sketchy','src_ptr','dst_ptr','iscorp','sketchy_provider','src_dns','dst_dns','fail2ban', 'blacklisted']][:5]

In [None]:
def tally_total(item):
    """
    tally up scores per row. 
    where are we **sending** data (beacons)?
    need the columns
    df[['Score','asn',sketchy','src_ptr','dst_ptr','iscorp','sketchy_provider','src_dns','dst_dns','fail2ban','blacklisted']]

    must call apply with axis=1 e.g.
    df[['Score','asn','sketchy','src_ptr','dst_ptr','iscorp','sketchy_provider','src_dns','dst_dns','fail2ban','blacklisted']].apply(test_return,axis=1)
    """

    total=item['bscore']
    
    # presuming our internal network is in RFC1918. Open Internet Addresses should have reverse pointers and DNS, even if we don't internally
    if item['asn'] == 'rfc1918':
        if not item['src_ptr']:
            total +=2
        elif not item['src_dns']:
            total +=2
    elif item['asn'] != 'rfc1918':
        if not item['dst_ptr']:
            total +=2
        elif not item['dst_dns']:
            total +=2
    # fail2ban violations are from 
    if item['fail2ban'] :
        total +=3
    # is the connection to a sketchy country?
    elif item['sketchy']:
        total+=3
    # how about to a sketchy provider?
    elif item['sketchy_provider']:
        total +=3
    # is the IP in the spamhaus blacklist?
    elif item['blacklisted']:
        total +=3
    # corporate spyware is the lowest priority. This scoring should make it easier to build filters, also.
    elif item['iscorp']:
        total +=1
    return total
        
        

In [None]:
# using tally_total
df[['bscore','asn','sketchy','src_ptr','dst_ptr','iscorp','sketchy_provider','src_dns','dst_dns','fail2ban','blacklisted']][:10].apply(tally_total,axis=1)

In [None]:
# full scoring
# using tally_total
df['bscore'] = df[['bscore','asn','sketchy','src_ptr','dst_ptr','iscorp','sketchy_provider','src_dns','dst_dns','fail2ban','blacklisted']].apply(tally_total,axis=1)

In [None]:
df[:5].sort_values(by='bscore', ascending=False)

#### Now create a total score

Now to make the single value that represents RITA's statistical analysis score ('Score') and our heuristical score ('bscore'). For the moment it seems that Score * bscore is useful because Score is a Percentage which should scale the raw heuristical tally nicely.

In [None]:
def total_score(row):
    """
    multiply RITA score and bscore
    Usage:
    df[['Score','bscore']].apply(total_score,axis=1)
    """
    return row['Score'] * row['bscore']

In [None]:
df[['Score','bscore']][:10].apply(total_score,axis=1)

##### create the new feature

In [None]:
df['total_score'] = df[['Score','bscore']].apply(total_score,axis=1)

##### Sort the list by total score 

Non-corp connections should bubble up and we should only be grading on bad actors and malware.

In [None]:
df[:10].sort_values(by='total_score', ascending=False)

## Final

And that's the basic process. I want to be able to scan through connections at least daily, then export the outputs to a datastore or to reporting for the Sec Team to follow up on. 

TODOs include: 
- building this into a running script
- summarize this as a report (that could be used to kick off tickets)
- auto export the ranking to a datastore that other apps could use
- make the heuristics modular such that we can import blacklists, etc



In [None]:
df.columns

## Summarize reporting

In [None]:
df[['Score','bscore', 'total_score','Source Name', 'Destination Name', 'Connections', 'Avg Bytes','asn_desc','asn_country']].sort_values(by='total_score', ascending=False)