## MSTICPy and Notebooks in InfoSec
---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">Session 3 - Acquiring Data Using MSTICPy</a>

---

## What this session covers:
 - Setting up query providers
 - Connecting to providers
 - Querying for data
 - Offline data options

## Prerequisites
- Python >= 3.8 Environment
- Jupyter installed
- MSTICPy
- The msticpyconfig.yaml file you recently populated


### MSTICPy has a number of supported data providers
- Microsoft Sentinel
- Microsoft Defender/Defender for Endpoint
- Splunk
- Sumologic
- Microsoft Graph
- Local data
- Mordor/Security Datasets
- Kusto/Azure Data Explorer
- Azure Resource Graph

These provide way to connect to and query data from these sources in a structured and standardized way.<br>
The providers also provide a way to create, store and call templated queries simply and easily.

Ref: https://msticpy.readthedocs.io/en/latest/DataAcquisition.html

In [None]:
#Set up MSTICPy
import msticpy as mp 
mp.init_notebook()

The QueryProvider handles this functionality and can be configured to work with the supported data sources.

`list_data_environments` shows us the names of the providers available to us.

In [None]:
mp.QueryProvider.list_data_environments()

You can then pass the name of the required provider to `QueryProvider`.

In [None]:
qry_prov = mp.QueryProvider("MSSentinel")

---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">Authenticating to Providers</a>

---

Once we have created our QueryProvider for the data source we want the next step is to connect the provider to the source and authenticate. <br>
In order to connect we need to tell the provider which instance to connect to, i.e. what workspace, cluster, or database.<br>

To do that we need to provide a set of connection parameters or a *connection string*.<br> 
We can do this **manually** or we can store these details in our `msticpyconfig file` and pull them directly from there.<br>

First, we are going to connect using a manually-created connection string, and later using our config file, which is a much more manageable way of handling it.

The connection parameters typically require the following information:
- An ID of the resource to connect to
- An indicator of the credential type you want to use to authenticate
- The ID of the authority (AAD) that will authenticate/authorize the connection.
- Data source specific parameters (e.g. DefaultDatabase)

> Note: Some of these may inherit from your account or other settings

The authentication method for the provider will depend on the type of providers, and what is supported.<br>
We don't have the breadth to cover all of the options here today but most providers have a authentication method that requires the user to log in each time, either via an interactive login, or device code login.<br>
However we can also configure most providers to use tokens already on a host, such as MSI and Azure CLI tokens. This removes the need to authenticate each time.<br>

Generally for Microsoft services the following options are supported:
 - Interactive/Device Code 
 - Azure CLI
 - MSI
 - Creds stored as Environment Variables
 - VSCode or PowerShell Credentials

Some other providers (such as Defender) use app level authentication instead. The documentation will detail what authentication options are possible for each provider.

### Using an connection string to connect
Below we will connect with a specific connection string, and the default auth method for this provider - Device Code.

Ref: https://msticpy.readthedocs.io/en/latest/data_acquisition/DataProviders.html

In [None]:
la_connection_string = f'loganalytics://code().tenant("72f988bf-86f1-41af-91ab-2d7cd011db47").workspace("8ecf8077-cf51-4820-aadd-14040956f35d")'
qry_prov.connect(connection_str=la_connection_string)

As we can see the above method is a bit cumbersome for every day use - having a more seamless authentication method, and storing workspace details in config is much smoother.

To use the a settings from our config instead of the connection string we can use the<br>
`workspace` parameter pull the settings from file and pass them to the connection method.<br>

We are also going explicitly request to use Azure CLI credentials using the `auth_methods` parameter.<br>
You typically don't need to do this unless you want to override the defaults in `msticpyconfig.yaml`
<br>

---
**Note -**
You only need to perform the CLI authentication once per token lifetime rather than every time you connect.<br>
If you've done this already today, you probably don't need to do it again.

In [None]:
!az login

Now when we connect to our QueryProvider we can tell the provider to use CLI authentication. 

---
**Note -**
The authentication methods are passed as a list, this is because you can often provide multiple options that it will use in order until it successfully authenticates.

If you have configured default credential types in your `msticpyconfig.yaml`,
you don't need to use the `auth_methods` parameter unless you
need to override these.

```yaml
Azure:
  auth_methods:
  - cli
  - msi
  - devicecode
  cloud: global
```

In [None]:
qry_prov = mp.QueryProvider("MSSentinel_New")
qry_prov.connect(workspace="Default", auth_methods=['cli'])

Once connected we can start running queries to get data.
We can do this with the built in queries or with our own queries.

We will start with the built in queries. We can list the available queries with `list_queries`.

Ref: https://msticpy.readthedocs.io/en/latest/DataAcquisition.html#built-in-data-queries

In [None]:
qry_prov.list_queries()

We can also use `browse` to get a clearer view of what's available

<div style="border: solid; padding: 5pt; background-color: blue"><b>Warning</b> BUG - browser is broken</div>

In [None]:
qry_prov.browse()

You can also search for a query:
- just supply a string or regex (or a list of search terms) to search over all query metadata
- search for queries using a specific table name (`table="DeviceProcessEvents")
- search for queries using a specific parameter name

Examples:
```python
qry_prov.search("ip_address")
qry_prov.search("ip_address", table="Office")
qry_prov.search(param="URL")
```

In [None]:
qry_prov.search("Network")


In [None]:
qry_prov.search(param="url")

### Running a query is a function call

```python3
qry_prov.list_queries()
```
```
['Azure.get_vmcomputer_for_host',
 'Azure.get_vmcomputer_for_ip',
 'Azure.list_aad_signins_for_account',
 'Azure.list_aad_signins_for_ip',
 'Azure.list_all_signins_geo',  <<<<--- The query we want
 'Azure.list_azure_activity_for_account',
 'Azure.list_azure_activity_for_ip',
 'Azure.list_azure_activity_for_resource',
```

Append to the query provider with a dot
```python3
qry_prov.Azure.list_all_signins_geo()
```

In [None]:
df = qry_prov.Azure.list_all_signins_geo()
df.head()

Some queries require parameters such as a account or host name to search for results in.

In [None]:
help(qry_prov.Office365.list_activity_for_account)

In [None]:
office_activity = qry_prov.Office365.list_activity_for_account(account_name="KDickens@seccxp.ninja")
office_activity.head()

### Debug Tip

You can get a clearer view of what a built in query actually is by adding the `"print"` keyword as the first parameter when calling it.<br>
This will printed the parameterized query rather than run it. The printed query will include any parameter values you passed it.

In [None]:
from pprint import pprint
query_text = qry_prov.Office365.list_activity_for_account("print", account_name="KDickens@seccxp.ninja")
pprint(query_text)


In [None]:
from rich import print as rprint
rprint(qry_prov.Office365.list_activity_for_account("print", account_name="KDickens@seccxp.ninja"))

## Where is it getting the start/end time parameters from?

Every query provider has a `query_time` attribute that you can set
the time boundaries of the query.

Having a single query timespan is useful when you are doing lots
of related queries.


In [None]:
qry_prov.query_time

You can also supply these parameters manually
- as datetimes
- as a parsable datestring

In [None]:
office_activity = qry_prov.Office365.list_activity_for_account(
    account_name="KDickens@seccxp.ninja",
    start="2023-06-22 00:00:00",
    end="2023-06-23 00:00:00"
)
office_activity.head()

In [None]:
from msticpy.nbwidgets import QueryTime
qt = QueryTime(start="2023-06-20 00:00:00", end="2023-06-21 00:00:00")
display(qt)
qry_prov.Office365.list_activity_for_account(
    account_name="KDickens@seccxp.ninja",
    start=qt.start,
    end=qt.end,
)


## Extending queries with the `add_query_items` parameter

We can also customize built in queries with by adding query items to them.

In [None]:
office_activity_filtered = qry_prov.Office365.list_activity_for_account(
    account_name="KDickens@seccxp.ninja",
    add_query_items="| where Operation != 'MailItemsAccessed'"
)
office_activity_filtered.head()

You can also add your own built in queries by specifying them in a yaml file and adding the required path to your msticpyconfig.yaml file. 

We can also use `exec_query` to run our own queries.

In [None]:
query = """
OfficeActivity 
| where TimeGenerated > ago(7d) 
| where UserId =~ 'KDickens@seccxp.ninja' 
| summarize count() by Operation
"""
custom_query_df = qry_prov.exec_query(query)
custom_query_df

When writing our own queries for a Log Analytics (or Kusto) based data source we can check the schema of any table in our connected workspace with `.schema`.<br>
This will return a dictionary with all the tables, their column names, and the data type of each field.

In [None]:
qry_prov.schema['W3CIISLog']

---
**Extra**

It is also possible to add your own queries to the built in queries in MSTICPy.<br>
See this document in our ReadTheDocs documentation
In addition our documentation shows how to structure the required files and reference them in your configuration.<br>
Adding queries to MSTICPy: https://msticpy.readthedocs.io/en/latest/extending/Queries.html<br>
Also see this notebook: https://github.com/ianhelle/pycon2021/blob/main/Extending-MSTICPy.ipynb<br>


## <a style="border: solid; padding:5pt; color:black; background-color:#309030">1st Exercise - Run a query</a>

Execute a query against the created `qry_prov`. This can be a built in query or a custom query - its up to you.

If using a built-in query, experiment with changing the `qry_prov.query_time` time range.

<details>
<summary>Hints...</summary>
<ul>
<li>If you add "print" as a parameter when calling a query it will print out the query rather than executing it.</li>
<li>help(qry_prov.CAT.query_name) will show you the code and required params need to run each query in there</li>
<li>qry_prov.SecurityAlert.list_alerts() doesn't need any extra parameters - uses the time defaults</li>
</ul>
</details>


In [None]:
qry_prov.SecurityAlert.list_alerts()

---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">Kusto</a>

---

Sentinel isn't the only data provider available and we have plenty more that we can use to connect to.<br>
Kusto is a popular data source for a lot of uses.

We added Kusto cluster to our `msticpyconfig.yaml` file in the previous session. We will use this Kusto cluster (`https://msticpytraining.eastus.kusto.windows.net`).

In [None]:
from msticpy.config import  MpConfigEdit

mp.get_config("KustoClusters")



## <a style="border: solid; padding:5pt; color:black; background-color:#309030">2nd Exercise - Kusto</a>

1. Connect to the Kusto cluster https://msticpytraining.eastus.kusto.windows.net/ and the `msticpydata` database. <br>
2. Run a query to understand the schema of the Syslog table and get some data


<details>
<summary>Hints...</summary>
<ul>
<li>You need to specify a cluster to connect to - the cluster can be specified as:
    <ul>
    <li>A cluster friendly name - the entry name in our configuration</li>
    <li>The full URL</li>
    <li>Just the host part of the URL - e.g. "msticpytraining"</li>
    </ul>
</li>
<li>We gave the Kusto cluster the short name "Kusto-Firecon23" in our config.</li>
<li>https://msticpy.readthedocs.io/en/latest/data_acquisition/DataProv-Kusto-New.html has the details you need</li>
<li>The query <pre>`Syslog | getschema`</pre> returns the schema of the Syslog table.</li>
<li>You can specify the default database in the 'connect' call (database="msticpydata") or passing this parameter
to 'exec_query()'
</li>
<li>You can also get the schema using the qry_prov.get_database_schema() function</li>
</ul>
</details>


In [None]:
kusto_prov = mp.QueryProvider("Kusto_New")
# your answer here...


There are also some helper functions in the Kusto query provider
to retrieve the schema:
- `qry_prov.get_database_schema(<database>)`
- `qry_prov.schema[<TableName>]`

In [None]:
# Using the get_database_schema method
print("get_database_schema")
display(kusto_prov.get_database_schema("msticpydata")["Syslog"])

# setting a default database and using the schema property
print("schema attribute")
kusto_prov.set_database("msticpydata")
kusto_prov.schema["Syslog"]

---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">Microsoft Defender</a>

---
Some data providers have different connection options, for example the Microsoft Defender for Endpoint and Microsoft 365 Defender APIs require a client application to handle authentication.<br>
You can pass in these application details when connecting but if we are using an application secret its better to keep these in KeyVault and reference them in our config file.

In [None]:
# if you didn't run these earlier in the notebook
# import msticpy as mp 
# mp.init_notebook()

You can store multiple instances in your config file. To select what instance to connect to use the `instance` keyword.<br>
In this example we will connect to our pre-configured Training instance.

Ref: https://msticpy.readthedocs.io/en/latest/data_acquisition/DataProv-MSDefender.html#connecting-to-m365-defender

In [None]:
defender_prov = mp.QueryProvider("M365D")
defender_prov.connect()

In [None]:
defender_prov.list_queries()

We can also execute our own queries in the same format as with the other providers.

In [None]:
defender_prov.exec_query("DeviceInfo | take 10")

## <a style="border: solid; padding:5pt; color:black; background-color:#309030">3rd Exercise - Defender Investigation</a>

1. Find the remote IP address associated MDE connections to the URL 'davlenwindows.com' on 10/14/2022
2. Find all the hosts that have connected to that URL address since 10/01/2022
3. Get the file hash of the initiating process for these connections on 10/14/2022 and get all the files names associated with this hash on that day


<details>
<summary>Hints...</summary>
<ul>
<li>You can do this with built in queries or your own queries</li>
<li>The Query Browser is your friend `qry_prov.browse()`</li>
<li>Don't forget you can use add_query_items to add to the built in queries to customize the returned data.</li>
</ul>
</details>

---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">Azure Resource Graph</a>

---


The Azure Resource graph provides a way to get details about Azure Resources using KQL, this is something that is really useful to adding context during an investigation.<br>
Below we are going to load our Resource Graph provider and connect using the Azure CLI tokens that we generated earlier.

In [None]:
res_qry_prov = mp.QueryProvider("ResourceGraph")
res_qry_prov.connect()

As with the other providers we can use in built queries or write our own custom queries. Hopefully by now you are familiar with this model and concept.


## <a style="border: solid; padding:5pt; color:black; background-color:#309030">4th Exercise - Azure Resource Graph</a>

 1 . Find out how many KeyVaults that you have access to. <br>
 2. What resources exist in the msticpy resource group.<br>
 3. Find the Key Vault that is detailed in your msticpyconfig.yaml file<br>


<details>
<summary>Hints...</summary>
<ul>
<li>All data in the Resource Graph is in the Resources table</li>
<li>https://learn.microsoft.com/en-us/azure/governance/resource-graph/samples/starter?tabs=azure-cli gives you some query examples</li>
<li>`Resources | where type =~ 'microsoft.keyvault/vaults' will show you all Keyvaults</li>
<li>You will need to use .exec_query here</li>
</ul>
</details>


In [None]:
res_qry_prov.exec_query("""
Resources 
| where type contains 'key' 
| summarize count() by type, kind 
| order by count_ desc
""")


## <a style="border: solid; padding:5pt; color:black; background-color:#309030">Bonus Exercise - Azure Resource Graph</a>

CDOC received a report that the VM MSTICAlertsWin1 has been compromised.
You need to answer the following questions:
1. Is this a real host?
2. Is it currently in use?
3. What IPs is it associated with?
4. Is it a production host?
5. What other resources might have been compromised?
6. Are there any users we can contact about this host?


Hints:
VMs type = "microsoft.compute/virtualmachines"
NetInterface type = "microsoft.network/networkinterfaces"
Interface VM = "properties.virtualMachine.id"




In [None]:
vm_df = res_qry_prov.exec_query("""
Resources 
| where type == 'microsoft.compute/virtualmachines' 
| where name contains 'MSTIC'
""").dropna(axis=1)
vm_df.head()

In [None]:
id = vm_df.iloc[0].id
interface_df = res_qry_prov.exec_query(f"""
Resources 
| where type == 'microsoft.network/networkinterfaces' 
| where properties.virtualMachine.id == '{id}'
""").dropna(axis=1)
interface_df.head()

In [None]:
interface_df.iloc[0]["properties.ipConfigurations"]

In [None]:
public_ip = interface_df.iloc[0]["properties.ipConfigurations"][0]["properties"]["publicIPAddress"]["id"]

pub_ip_df = res_qry_prov.exec_query(f"""
Resources 
| where id == '{public_ip}'
""").dropna(axis=1)
pub_ip_df.head()

---

# <a style="border: solid; padding:5pt; color:black; background-color:#909090">The <b>Panel</b> DataViewer</a>

---


- Uses [Holoviz Panel](https://panel.holoviz.org/) to display interactive data widget
- Uses the [Tabulator widget](https://panel.holoviz.org/reference/widgets/Tabulator.html)
- Requires `panel` to be installed (you should have this installed - `pip install msticpy[all]`)

Benefits
- Allows interactive:
  - Filtering
  - Sorting
  - Column selection
- Uses paging and scrolling by default
- Row selection can return indices or dataframe subset
- Works in most notebook environments - does not require Jupyter or Jupyterlab extension
- Has many built-in capabilities - parameters passed to underlying control

In [None]:
# Get some data to display
result_df = qry_prov.MDE.list_host_processes(host_name="workstation8.seccxp.ninja")

In [None]:
result_df.head()

In [None]:
from msticpy.vis.data_viewer import DataViewer
dv = DataViewer(result_df)
dv

In [None]:
result_df.columns

In [None]:
selected_columns = [
    "TimeGenerated"
    "AccountName",
    "FileName",
    "ProcessCommandLine",
    "InitiatingProcessFileName",
    "InitiatingProcessCommandLine",
]
dv = DataViewer(
    data=result_df, 
    selected_cols=selected_columns,
)
dv

In [None]:
dv = DataViewer(
    data=result_df, 
    selected_cols=selected_columns,
    detail_cols=["ProcessCommandLine", "InitiatingProcessCommandLine"],
)
dv

In [None]:
dv.selection

In [None]:
dv.selected_dataframe