# Digital Fingerprinting (DFP) — Feature Selection

This notebook includes an introduction to DFP, the suggested steps for DFP feature selection, and the code segments demonstrating how each step can be done in practice.<br>
Note. This notebook uses Azure AD logs as an example but the same process can be applied to any data sources.<br>
<font color='#C00000'>Disclaimer.</font> The data was generated using the python [faker](https://faker.readthedocs.io/en/master/#) package. If there is any resemblance to real individuals, it is purely coincidental.

## Table of Contents
1. [The Application of DFP](#1.2)
2. [DFP Features In Use - Azure AD Logs](#1.3)
3. [DFP Features In Use - DUO Authentication Logs](#1.4)
4. [Feature Engineering and Feature Types](#1.5)
5. [Steps for Selecting Raw Features](#1.6)<br>
    - 5.0. [Load Data](#1.6.1)<br>
    - 5.1. [Data Overview](#1.6.2)<br>
    - 5.2. [Overall Statistics](#1.6.3)<br>
        - 5.2.1 [Signs of a bad feature](#1.6.3.1)<br>
		- 5.2.2 [Good feature candidates](#1.6.3.5)<br>
	- 5.3. [Per-entity Statistics](#1.6.4)<br>
		- 5.3.1 [Good feature candidates](#1.6.4.1) <br>
	- 5.4. [Feature Correlation](#1.6.5)<br>
		- 5.4.1 [Pearson Correlation Coefficients - Numerical Feature Correlation](#1.6.5.1)<br>
		- 5.4.2 [Cramer's V - Categorical Feature Correlation](#1.6.5.2)<br>
	- 5.5. [Review with Security Experts](1.6.6)<br>
6. [Ideas on Derived Features](#1.7)
7. [Conclusion](#1.8)

## 1. The Application of DFP <a class="anchor" id="1.2"></a>
- DFP is a general pipeline that can ingest various data sources to do behavioral anomaly detection
- POC was done on <font color='#76B900'>Azure AD logs</font> and <font color='#76B900'>DUO authentication logs</font>, but the application of DFP can easily be expanded to other data sources
- The key to applying DFP to a new data source is through the process of <font color='#76B900'>feature selection</font>
    - DFP supports all types of features (numerical/categorical/binary)
    - Any data source can be fed into DFP after some preprocessing to get a feature vector per log/data point
- Note that DFP builds targeted model for each entity (user/service/machine… etc.), so it would work best if the chosen data source has a field that uniquely identify the entity we’re trying to model 

## 2. DFP Features In Use - Azure AD Logs <a class="anchor" id="1.3"></a>
**<font color='#76B900'>1. </font>appDisplayName**: e.g., Windows sign in, MS Teams, Office 365<br>
**<font color='#76B900'>2. </font>clientAppUsed**: e.g., IMAP4, Browser<br>
**<font color='#76B900'>3. </font>deviceDetail.displayName**: e.g., username-LT<br>
**<font color='#76B900'>4. </font>deviceDetail.browser**: e.g., EDGE 98.0.xyz, Chrome 98.0.xyz<br>
**<font color='#76B900'>5. </font>deviceDetail.operatingSystem**: e.g., Linux, IOS 15, Windows 10<br>
**<font color='#76B900'>6. </font>statusfailureReason**: e.g., external security challenge not satisfied, error validating credentials<br>
**<font color='#76B900'>7. </font>riskEventTypesv2**: AzureADThreatIntel, unfamiliarFeatures<br>
**<font color='#76B900'>8. </font>location.countryOrRegion**: country or region name<br>
**<font color='#76B900'>9. </font>location.city**: city name<br>

<ins>Derived features</ins>:<br>
**<font color='#76B900'>10. </font>Log count**: tracks the number of logs generated by a user within that day (increments with every log)<br>
**<font color='#76B900'>11. </font>Location increment**: increments every time we observe a new city (location.city) in a user’s logs within that day<br>
**<font color='#76B900'>12. </font>App increment**: increments every time we observe a new app (appDisplayName) in a user’s logs within that day<br>

## 3. DFP Features In Use - DUO Authentication Logs <a class="anchor" id="1.4"></a>
**<font color='#76B900'>1. </font>auth_device.name**: phone number<br>
**<font color='#76B900'>2. </font>access_device.browser**: e.g., Edge, Chrome, Chrome Mobile<br>
**<font color='#76B900'>3. </font>access_device.os**: e.g., Android, Windows<br>
**<font color='#76B900'>4. </font>result**: SUCCESS or FAILURE <br>
**<font color='#76B900'>5. </font>reason**: reason for the results, e.g., User Cancelled, User Approved, User Mistake, No Response<br>
**<font color='#76B900'>6. </font>access_device.location.city**: city name<br>

<ins>Derived features</ins>:<br>
**<font color='#76B900'>7. </font>Log count**: tracks the number of logs generated by a user within that day (increments with every log)<br>
**<font color='#76B900'>8. </font>Location increment**: increments every time we observe a new city (location.city) in a user’s logs within that day<br>

## 4. Feature Engineering and Feature Types <a class="anchor" id="1.5"></a>
- Feature engineering is key to every successful machine learning application
    - The more <font color='#76B900'>relevant</font> the features are to the problem being solved the better
    - <font color='#76B900'>Excluding redundant fields</font> in the raw data helps the model concentrate on key information
- The different feature types:<br>
    **<font color='#76B900'>1. </font>Raw feature**: raw value from a data field (e.g., operation name)<br>
    **<font color='#76B900'>2. </font>Derived feature**: processed value from one or more data fields and/or data records<br>
    - **Single-record extracted feature**: value computed from one or more fields of a single data record<br>
        E.g.: parse from IPv4 address the /16 CIDR block to get a “subnet” feature<br>
        E.g.: concatenate first and last name fields into a “name” feature<br>
    - **Aggregated feature**: value computed across multiple data records over time<br>
        E.g.: a “log count” feature that counts the number of logs generated by a user within a given day<br>
        (this is <font color='#76B900'>time-dependent</font> as the count increments with each log throughout the day)<br>
        <font color='#C00000'>Note. Aggregated features can require significant resource to compute. Use with caution to avoid unnecessary performance penalties.</font>

## 5. Steps for Selecting Raw Features <a class="anchor" id="1.6"></a>
- Each data source provides a unique set of information about cyber activities
    - There can be a high number of fields while many of them are <font color='#76B900'>unpopulated</font> or <font color='#76B900'>irrelevant</font> to our problem
    - Data analysis can help us quickly identify good and bad candidates for raw features
- The following steps are a general guideline on feature selection for custom DFP applications:<br>
    **<font color='#76B900'>1. </font>Date overview**: scan through all the features to understand what are available<br>
    **<font color='#76B900'>2. </font>Overall statistics**: collect global statistics for each feature to rule out bad fits<br>
    **<font color='#76B900'>3. </font>Per-entity statistics**: collect entity-level statistics for each feature to further evaluate their “usefulness”<br>
    **<font color='#76B900'>4. </font>Feature correlation**: evaluate the correlation between feature candidates to remove redundancy<br>
    **<font color='#76B900'>5. </font>Review with security experts**: run the feature candidates by security experts to make sure they are meaningful and relevant to the problem being solved<br>


<img src="steps.png" width="1000"/>

In [1]:
import json
import pandas as pd
import numpy as np
import scipy.stats as ss

### 5.0 Load Data <a class="anchor" id="1.6.1"></a>
The following steps show how to load a __nested json__ file and flatten it into a pandas dataframe.<br>
If your data doesn't have nested fields or is in other formats, you can load it by `pd.read_json` or other `pd.read_*` methods.<br>

__Note.__ Make sure you're loading the __entire__ dataset OR a sample as representative as it can be of the entire dataset.<br>
This is to avoid underestimating the cardinality of each feature by using a small, non-representative sample of the entire dataset.

In [2]:
json_obj = json.load(open('../../datasets/training-data/azure/azure-ad-logs-sample-training-data.json', 'r'))

In [3]:
print(f'# rows: {len(json_obj)}\nExample:\n{json.dumps(json_obj[0], indent=2, sort_keys=True)}')

# rows: 3239
Example:
{
  "Level": 4,
  "callerIpAddress": "44.22.19.201",
  "category": "NonInteractiveUserSignInLogs",
  "correlationId": "84ca338d-f4ff-4f34-9f2a-5a6e23f78c0b",
  "durationMs": 0,
  "identity": "Thomas Price",
  "location": "XN",
  "operationName": "Sign-in activity",
  "operationVersion": "1.0",
  "properties": {
    "appDisplayName": "Adobe Identity Management",
    "appId": "9a7e67c7-6f05-42a3-b226-97c7ec3e9696",
    "appServicePrincipalId": null,
    "appliedConditionalAccessPolicies": [],
    "authenticationContextClassReferences": [],
    "authenticationDetails": [],
    "authenticationProcessingDetails": [],
    "authenticationProtocol": "none",
    "authenticationRequirement": "singleFactorAuthentication",
    "authenticationRequirementPolicies": [],
    "autonomousSystemNumber": 230297,
    "clientAppUsed": "Mobile Apps and Desktop clients",
    "clientCredentialType": "none",
    "conditionalAccessStatus": "failure",
    "correlationId": "bf37c95d-08e3-4342

In [4]:
data = pd.json_normalize(json_obj)

### 5.1. Data Overview <a class="anchor" id="1.6.2"></a>
Scan through the columns and understand what are available.<br>
- A quick glance over the data helps us identify whether the data source is a good fit for DFP and what options we have on potential features
- In the POC, DFP works well with the following set of information:
    - **<span style='background:#76B900;color:white'>WHO</span>** Unique identifier of the entity: e.g., user ID, email address, machine name
    - **<span style='background:#76B900;color:white'>WHAT</span>** Device involved: e.g., device ID, browser, OS version
    - **<span style='background:#76B900;color:white'>WHERE</span>** Location of the event: e.g., country, state, city, latitude, longitude
    - **<span style='background:#76B900;color:white'>WHY</span>** Application used: e.g., app name, resource ID, service principal name
    - **<span style='background:#76B900;color:white'>WHEN</span>** Time stamp: for temporal analysis
    - Any features that can provide the above information are good candidates to consider!

In [5]:
pd.set_option('display.max_columns', None)  # ask pandas to show all the columns

In [6]:
data

Unnamed: 0,time,resourceId,operationName,operationVersion,category,tenantId,resultType,resultSignature,resultDescription,durationMs,callerIpAddress,correlationId,identity,Level,location,properties.id,properties.createdDateTime,properties.userDisplayName,properties.userPrincipalName,properties.userId,properties.appId,properties.appDisplayName,properties.ipAddress,properties.status.errorCode,properties.status.failureReason,properties.clientAppUsed,properties.userAgent,properties.deviceDetail.deviceId,properties.deviceDetail.displayName,properties.deviceDetail.operatingSystem,properties.deviceDetail.browser,properties.deviceDetail.trustType,properties.location.city,properties.location.state,properties.location.countryOrRegion,properties.location.geoCoordinates.latitude,properties.location.geoCoordinates.longitude,properties.correlationId,properties.conditionalAccessStatus,properties.appliedConditionalAccessPolicies,properties.authenticationContextClassReferences,properties.originalRequestId,properties.isInteractive,properties.tokenIssuerName,properties.tokenIssuerType,properties.authenticationProcessingDetails,properties.networkLocationDetails,properties.clientCredentialType,properties.processingTimeInMilliseconds,properties.riskDetail,properties.riskLevelAggregated,properties.riskLevelDuringSignIn,properties.riskState,properties.riskEventTypes,properties.riskEventTypes_v2,properties.resourceDisplayName,properties.resourceId,properties.resourceTenantId,properties.homeTenantId,properties.authenticationDetails,properties.authenticationRequirementPolicies,properties.authenticationRequirement,properties.servicePrincipalId,properties.userType,properties.flaggedForReview,properties.isTenantRestricted,properties.autonomousSystemNumber,properties.crossTenantAccessType,properties.ssoExtensionVersion,properties.uniqueTokenIdentifier,properties.incomingTokenType,properties.authenticationProtocol,properties.appServicePrincipalId,properties.resourceServicePrincipalId,properties.rngcStatus,properties.status.additionalDetails,properties.deviceDetail.isCompliant,properties.deviceDetail.isManaged,properties.ipAddressFromResourceProvider,properties.alternateSignInName,properties.signInIdentifier
0,2022-08-01T00:03:56.207532Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,NonInteractiveUserSignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,50158,,External security challenge was not satisfied.,0,44.22.19.201,84ca338d-f4ff-4f34-9f2a-5a6e23f78c0b,Thomas Price,4,XN,df70b726-7756-4baa-9a7d-5ac965198e00,2022-08-01T00:03:56.371532Z,Thomas Price,tprice@domain.com,32109ee2-ee4f-4d11-9c38-2556aec0f2b5,9a7e67c7-6f05-42a3-b226-97c7ec3e9696,Adobe Identity Management,44.22.19.201,50158,External security challenge was not satisfied.,Mobile Apps and Desktop clients,Mozilla/5.0 (X11; Linux i686) AppleWebKit/535....,0927e60c-8dfa-4ecf-be85-ad63bccf40a1,THOMASPRICE-LT,Windows 10,Edge 118.12158,Hybrid Azure AD joined,Littlemouth,Alexanderfurt,XN,25.443725,-109.530885,bf37c95d-08e3-4342-b83f-4827b020e5b4,failure,[],[],dce375bd-b82e-494f-98e4-1ad345dda0bf,False,,AzureAD,[],[],none,164,none,none,none,none,[],[],Adobe Identity Management Service,30663ca7-f8a9-43ab-8ca1-7906bdbc1485,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,230297,none,,sRma0NKgsBO3hkJGXjMCmsRgf5phSAdeuQ2CpFyvZiqp9BWu,primaryRefreshToken,none,,a6c259e5-7f16-48b2-a9f3-75becd6daa9b,0.0,,,,,,
1,2022-08-01T00:19:37.909827Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,SignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,0,,,0,99.116.100.205,7641103c-1db3-4e14-9ebc-6a9555ba02b2,Aaron Cole,4,XD,c98bb980-53fe-43a8-afd2-72b917706b00,2022-08-01T00:19:38.009827Z,Aaron Cole,acole@domain.com,5d557969-1645-4ba4-be83-b6fd943659f7,e47a1d38-5f61-45cd-b1b9-bc92f525c598,Altoura,99.116.100.205,0,,Browser,Mozilla/5.0 (Linux; Android 2.3.5) AppleWebKit...,,,Windows 10,Edge 105.19198,,Carrollstad,Norrisbury,XD,-38.854047,-17.674718,690fb113-a50a-46bc-b5ca-90b2c6988d8a,success,[],[],84920ffc-d938-4eb7-97ac-3f2769d09bba,True,,AzureAD,[],[],none,100,none,none,none,none,[],[],Altoura Online,f0630e90-b752-4960-bfb9-a0794fc34930,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,214655,none,,JmqXv1yjmpLCVKYQkwoaQn0ibst82O1aYzAfql41BAoBA52Q,none,none,,1f37c851-4be4-455b-928a-fbde7845a68a,0.0,,,,,,
2,2022-08-01T00:25:38.530749Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,NonInteractiveUserSignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,0,,,0,86.154.193.190,ef72b144-8295-493d-8231-c12e755a74d8,Kristen Howell,4,XR,4ef53074-987d-44ae-a8dd-b6e418929900,2022-08-01T00:25:38.621749Z,Kristen Howell,khowell@domain.com,20e1e3f9-665f-45bd-9984-5e74b526d3b5,9c5b7fe3-0ad2-4ea6-94e5-9e0001f367e3,Articulate 360,86.154.193.190,0,,Mobile Apps and Desktop clients,Mozilla/5.0 (X11; Linux i686) AppleWebKit/533....,6ea76864-5f18-47dd-adb9-2b1dfcbfc425,KRISTENHOWELL-LT,Windows 10,Edge 40.11325,Azure AD joined,Port Denisetown,Smithton,XR,3.551896,131.871582,106735ee-529f-4d3e-a513-557fac957792,success,[],[],f2b94018-fbfd-4349-9243-96169a0e79bf,False,,AzureAD,[],[],none,91,none,none,none,none,[],[],Articulate 360 Online,5e244d4f-26b9-4ef2-a497-f0c533628ee1,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,256668,none,,fDViSwd7VgvV1pXGIvJrscgRMoytG09bt259jU5BiY1GqRIZ,primaryRefreshToken,none,,36673087-318c-4304-8804-602d45a6f290,0.0,MFA requirement satisfied by claim in the token,True,True,,,
3,2022-08-01T00:37:00.149031Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,NonInteractiveUserSignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,50158,,External security challenge was not satisfied.,0,42.62.103.34,9901c16e-f768-4891-9b92-f1ab68223893,Joseph Taylor,4,XF,7f19788f-2e61-49ad-9601-4fe6e5b87200,2022-08-01T00:37:00.269031Z,Joseph Taylor,jtaylor@domain.com,af9847eb-c773-41ae-bf94-42747c5e986c,9c5b7fe3-0ad2-4ea6-94e5-9e0001f367e3,Articulate 360,42.62.103.34,50158,External security challenge was not satisfied.,Mobile Apps and Desktop clients,Mozilla/5.0 (iPad; CPU iPad OS 9_3_6 like Mac ...,76b96bd3-e5d7-49ee-b195-bf363facdb3d,JOSEPHTAYLOR-LT,Windows 10,Edge 99.14477,Azure AD joined,Jessicafurt,Smithbury,XF,17.508239,-115.304215,58556cc8-a955-4756-a85b-1f1bfd847881,failure,[],[],afe3b133-af74-4023-9ecc-473d415fe2a0,False,,AzureAD,[],[],none,120,none,none,none,none,[],[],Articulate 360 Online,5e244d4f-26b9-4ef2-a497-f0c533628ee1,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,393508,none,,1I6QLAKmwNYcV5FEKrwaPn10Nbxo4FP8BqYL5evsfG9mOo5g,primaryRefreshToken,none,,36673087-318c-4304-8804-602d45a6f290,0.0,,True,True,,,
4,2022-08-01T00:44:19.056251Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,NonInteractiveUserSignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,0,,,0,42.62.103.34,c25d344f-ea74-4470-94bb-ea652c630dd3,Joseph Taylor,4,XF,20f677f0-ddf9-46df-bf50-9e296caa9100,2022-08-01T00:44:19.137251Z,Joseph Taylor,jtaylor@domain.com,af9847eb-c773-41ae-bf94-42747c5e986c,9a7e67c7-6f05-42a3-b226-97c7ec3e9696,Adobe Identity Management,42.62.103.34,0,,Mobile Apps and Desktop clients,Mozilla/5.0 (iPad; CPU iPad OS 9_3_6 like Mac ...,76b96bd3-e5d7-49ee-b195-bf363facdb3d,JOSEPHTAYLOR-LT,Windows 10,Edge 99.14477,Azure AD joined,Jessicafurt,Smithbury,XF,17.508239,-115.304215,a8685bf7-ee79-4bdc-b8f0-68ccaad1026c,success,[],[],8f8b7fc8-c76f-4d21-b481-90dcee28c1de,False,,AzureAD,[],[],none,81,none,none,none,none,[],[],Adobe Identity Management Service,2c3383ff-432a-49ee-bb8b-8699a7d0416f,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,393508,none,,vjgAV7fm3LZCwDrHJQ7TPcc94vukyhQfpTAEP8i8ra8Jg8CN,primaryRefreshToken,none,,35efb8e5-3b35-4a86-b7ce-660b69bfa773,0.0,MFA requirement satisfied by claim in the token,True,True,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3234,2022-08-29T22:23:15.895201Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,NonInteractiveUserSignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,50158,,External security challenge was not satisfied.,0,147.132.189.119,d83e9693-78d9-4035-bb81-4f7766ea452d,Cassie Fernandez,4,XR,3252cc26-bd51-4e86-91d8-5aec768e6900,2022-08-29T22:23:16.018201Z,Cassie Fernandez,cfernandez@domain.com,67ef8be6-3f44-4db7-8172-fd6211428c1e,79510a8f-6421-4348-a2ea-bcc2b02485c1,Cisco AnyConnect,147.132.189.119,50158,External security challenge was not satisfied.,Mobile Apps and Desktop clients,Mozilla/5.0 (iPad; CPU iPad OS 9_3_6 like Mac ...,bf67c857-685a-4457-89c4-db2afa581bd2,CASSIEFERNANDEZ-LT,Windows 10,Edge 99.14477,Hybrid Azure AD joined,Port Denisetown,Smithton,XR,3.551896,131.871582,2323aef7-8894-45e5-aaf6-99bc8bc2b249,failure,[],[],6a7e00ad-14c2-4c26-a00d-bc33a75e88d0,False,,AzureAD,[],[],none,123,none,none,none,none,[],[],Cisco AnyConnect Online,5d5b8d28-6a28-472e-b837-a4094e466965,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,256668,none,,8KGkjOpLZPLP8EDPmW7kNvd4rLZtyqlbqEBuHnmozdiPYMoS,primaryRefreshToken,none,,15655e1f-5005-4c53-b3c0-c4a1d1ca5e9e,0.0,,,,,,
3235,2022-08-29T22:37:38.943454Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,NonInteractiveUserSignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,0,,,0,42.62.103.34,c8ff49e3-999e-4e3d-8193-c59fdcaf5805,Joseph Taylor,4,XF,86ec8f81-b065-45d1-84e2-abcf1c788600,2022-08-29T22:37:39.026454Z,Joseph Taylor,jtaylor@domain.com,af9847eb-c773-41ae-bf94-42747c5e986c,2eae4ad1-8cae-48fe-b718-76bd66e2fc7f,Bipsync,42.62.103.34,0,,Mobile Apps and Desktop clients,Mozilla/5.0 (iPad; CPU iPad OS 9_3_6 like Mac ...,76b96bd3-e5d7-49ee-b195-bf363facdb3d,JOSEPHTAYLOR-LT,Windows 10,Edge 99.14477,Azure AD joined,Jessicafurt,Smithbury,XF,17.508239,-115.304215,3c754d1e-33de-421b-8f6f-ea693b4d8fbb,success,[],[],3b82a0a7-f88a-4156-9f87-831da11ed0ab,False,,AzureAD,[],[],none,83,none,none,none,none,[],[],Bipsync Service,40a5d724-be1a-443b-8964-2d2e99910144,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,393508,none,,T5pXSDW5yEpayDDfpZpXHoBVHoFYuXxtJS3kKnuK2InhsOZn,primaryRefreshToken,none,,8678b6d2-0d07-4480-b90d-d1a18a815502,0.0,MFA requirement satisfied by claim in the token,True,True,,,
3236,2022-08-29T22:56:13.322849Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,NonInteractiveUserSignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,0,,,0,205.33.24.92,10b4d505-2bb4-4556-8406-39dbc1d1e8c8,Aaron Cole,4,XD,ada0cd79-9550-4249-a9fe-bc578c978c00,2022-08-29T22:56:13.389849Z,Aaron Cole,acole@domain.com,5d557969-1645-4ba4-be83-b6fd943659f7,cb9ad245-0b24-4c75-ba23-398500baf7a8,FileCloud,205.33.24.92,0,,Browser,,,,Windows,Rich Client 4.35.1.0,,Kristinmouth,Norrisbury,XD,26.278784,-92.042573,dd2c7972-f095-4c78-bb86-5cb43d99a562,success,[],[],c89fe99a-088b-4a21-8eb2-0bb46b111d9d,False,,AzureAD,[],[],none,67,none,none,none,none,[],[],FileCloud Service,120f310e-002d-4cf0-803a-8e3f1188c691,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,214655,none,,41bWMZvKD7wzmAhXxWOsYsRlmP2b1UA4kksFPP4ts2zHjEwu,primaryRefreshToken,none,,64fdf9a3-2711-4206-8c5a-100403727b93,0.0,,,,,,
3237,2022-08-29T22:56:55.676126Z,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,Sign-in activity,1.0,NonInteractiveUserSignInLogs,d3e5a967-5657-4a42-afcc-6106b6c3c299,0,,,0,13.113.40.157,c8abd3f4-3162-4637-9cfb-7187fd3fa340,Attack Target,4,XR,05126ca7-c65f-4544-8acb-9d5046044200,2022-08-29T22:56:55.778126Z,Attack Target,attacktarget@domain.com,d735e84b-dcca-404d-9f7d-700f360f41a6,9a7e67c7-6f05-42a3-b226-97c7ec3e9696,Adobe Identity Management,13.113.40.157,0,,Mobile Apps and Desktop clients,Mozilla/5.0 (iPad; CPU iPad OS 9_3_6 like Mac ...,a44625dc-6f81-449a-9799-8005f7209b42,ATTACKTARGET-LT,Windows 10,Edge 99.14477,Azure AD registered,Smithfort,Smithfort,XR,3.756410,-121.574606,a33f73ab-6941-46d7-87a9-995cc9750d41,success,[],[],786fcb98-1251-45f8-a208-f0df74c50fd3,False,,AzureAD,[],[],none,102,none,none,none,none,[],[],Adobe Identity Management Service,30663ca7-f8a9-43ab-8ca1-7906bdbc1485,d3e5a967-5657-4a42-afcc-6106b6c3c299,d3e5a967-5657-4a42-afcc-6106b6c3c299,[],[],singleFactorAuthentication,,Member,False,False,34974,none,,Vt6Ua2rvFtjHPoAoeuzIxNGe5ikYE472MrABYhK8yvYZmLex,primaryRefreshToken,none,,a6c259e5-7f16-48b2-a9f3-75becd6daa9b,0.0,,,,,,


Note that it's important to make sure there is a field that **uniquely identifies** each entity you wish to monitor with DFP.<br>
E.g. `properties.userPrincipalName` for Azure AD.


### 5.2. Overall Statistics <a class="anchor" id="1.6.3"></a>
Collect global statistics for each feature to rule out bad fits.

- Interesting stats to collect over the entire dataset for each feature:
    - Unique value count (cardinality)
    - Percentage of missing values or “null”
    - Distribution of the unique values 

In [7]:
def json_dumps_keep_null(obj):
    """A wrapper around jsonn.dumps to keep `null` from converted to a string `'null'`."""
    if type(obj) != list and pd.isnull(obj):
        return obj
    return json.dumps(obj)

def collect_overall_stats(data, n_example=3):
    """Take the data and return a dataframe that summarizes the data into stats and example values for each column."""
    overall_stats = []
    total_row_count = len(data)
    for col in data.columns:
        try:
            uniq_values = data[col].unique()
        except:
            # unique() will throw an error if the values in the column are not hashable
            # dump into strings if so
            uniq_values = data[col].apply(json_dumps_keep_null).unique()

        null_ratio = round(data[col].isnull().sum() / total_row_count, 4) # round the numbers to be more readable
        
        # Collect `n_example` examples for each column to include in the result dataframe (non-null values only)
        examples = []
        for val in uniq_values:
            if pd.isnull(val):
                continue
                
            examples.append(val)
            
            if len(examples) >= n_example:
                break
                
        # Pad sentinel values for columns with less than `n_example` unique values
        while len(examples) < n_example:
            examples.append('(empty)')

        overall_stats.append(
            [col, type(data[col][0]).__name__, len(uniq_values), len(uniq_values)/total_row_count, null_ratio, *examples]
        )
        
    result = pd.DataFrame(
        overall_stats, 
        columns=[
            'field', 
            'type', 
            'cardinality', 
            'uniq_ratio', 
            'null_ratio', 
            *[f'example{i+1}' for i in range(n_example)],
        ]
    )
    return result

In [8]:
pd.set_option('display.max_rows', 200) # ask pandas to display more rows

In [9]:
overall_stats = collect_overall_stats(data).sort_values(['null_ratio', 'cardinality'], ascending=[True, False]).reset_index(drop=True)

In [10]:
overall_stats

Unnamed: 0,field,type,cardinality,uniq_ratio,null_ratio,example1,example2,example3
0,time,str,3239,1.0,0.0,2022-08-01T00:03:56.207532Z,2022-08-01T00:19:37.909827Z,2022-08-01T00:25:38.530749Z
1,properties.createdDateTime,str,3239,1.0,0.0,2022-08-01T00:03:56.371532Z,2022-08-01T00:19:38.009827Z,2022-08-01T00:25:38.621749Z
2,properties.correlationId,str,3239,1.0,0.0,bf37c95d-08e3-4342-b83f-4827b020e5b4,690fb113-a50a-46bc-b5ca-90b2c6988d8a,106735ee-529f-4d3e-a513-557fac957792
3,properties.originalRequestId,str,3239,1.0,0.0,dce375bd-b82e-494f-98e4-1ad345dda0bf,84920ffc-d938-4eb7-97ac-3f2769d09bba,f2b94018-fbfd-4349-9243-96169a0e79bf
4,properties.uniqueTokenIdentifier,str,3239,1.0,0.0,sRma0NKgsBO3hkJGXjMCmsRgf5phSAdeuQ2CpFyvZiqp9BWu,JmqXv1yjmpLCVKYQkwoaQn0ibst82O1aYzAfql41BAoBA52Q,fDViSwd7VgvV1pXGIvJrscgRMoytG09bt259jU5BiY1GqRIZ
5,properties.id,str,3233,0.998148,0.0,df70b726-7756-4baa-9a7d-5ac965198e00,c98bb980-53fe-43a8-afd2-72b917706b00,4ef53074-987d-44ae-a8dd-b6e418929900
6,correlationId,str,3146,0.971287,0.0,84ca338d-f4ff-4f34-9f2a-5a6e23f78c0b,7641103c-1db3-4e14-9ebc-6a9555ba02b2,ef72b144-8295-493d-8231-c12e755a74d8
7,properties.processingTimeInMilliseconds,int64,307,0.094782,0.0,164,100,91
8,properties.appId,str,78,0.024082,0.0,9a7e67c7-6f05-42a3-b226-97c7ec3e9696,e47a1d38-5f61-45cd-b1b9-bc92f525c598,9c5b7fe3-0ad2-4ea6-94e5-9e0001f367e3
9,properties.appDisplayName,str,77,0.023773,0.0,Adobe Identity Management,Altoura,Articulate 360


Definition of `null`: NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike<br>
Note. Unique count includes null values

#### 5.2.1 Signs of a bad feature<a class="anchor" id="1.6.3.1"></a>
#### High Cardinality
Each row has a unique value (cardinality/(# of all data points) ≅ 1)<br>
E.g.: Log ID (correlationId) that uniquely identifies each log entry

In [11]:
overall_stats.loc[overall_stats.uniq_ratio > 0.95]

Unnamed: 0,field,type,cardinality,uniq_ratio,null_ratio,example1,example2,example3
0,time,str,3239,1.0,0.0,2022-08-01T00:03:56.207532Z,2022-08-01T00:19:37.909827Z,2022-08-01T00:25:38.530749Z
1,properties.createdDateTime,str,3239,1.0,0.0,2022-08-01T00:03:56.371532Z,2022-08-01T00:19:38.009827Z,2022-08-01T00:25:38.621749Z
2,properties.correlationId,str,3239,1.0,0.0,bf37c95d-08e3-4342-b83f-4827b020e5b4,690fb113-a50a-46bc-b5ca-90b2c6988d8a,106735ee-529f-4d3e-a513-557fac957792
3,properties.originalRequestId,str,3239,1.0,0.0,dce375bd-b82e-494f-98e4-1ad345dda0bf,84920ffc-d938-4eb7-97ac-3f2769d09bba,f2b94018-fbfd-4349-9243-96169a0e79bf
4,properties.uniqueTokenIdentifier,str,3239,1.0,0.0,sRma0NKgsBO3hkJGXjMCmsRgf5phSAdeuQ2CpFyvZiqp9BWu,JmqXv1yjmpLCVKYQkwoaQn0ibst82O1aYzAfql41BAoBA52Q,fDViSwd7VgvV1pXGIvJrscgRMoytG09bt259jU5BiY1GqRIZ
5,properties.id,str,3233,0.998148,0.0,df70b726-7756-4baa-9a7d-5ac965198e00,c98bb980-53fe-43a8-afd2-72b917706b00,4ef53074-987d-44ae-a8dd-b6e418929900
6,correlationId,str,3146,0.971287,0.0,84ca338d-f4ff-4f34-9f2a-5a6e23f78c0b,7641103c-1db3-4e14-9ebc-6a9555ba02b2,ef72b144-8295-493d-8231-c12e755a74d8


Fields with constantly changing values are not good candidates as a feature as there isn't much predictability.

#### Low Cardinality
All rows share the same value ( cardinality ≅ 1 )<br>
E.g.: tenantId if all logs are coming from the same tenant<br>

In [12]:
# excluding bool columns, as they'll by default have a low cardinality. Their usefulness needs to be determined by the security context they carry.
overall_stats.loc[(overall_stats.cardinality <= 3) & (overall_stats.type != 'bool')]

Unnamed: 0,field,type,cardinality,uniq_ratio,null_ratio,example1,example2,example3
27,properties.conditionalAccessStatus,str,3,0.000926,0.0,failure,success,notApplied
28,category,str,2,0.000617,0.0,NonInteractiveUserSignInLogs,SignInLogs,(empty)
29,properties.isInteractive,bool_,2,0.000617,0.0,False,True,(empty)
30,properties.authenticationRequirementPolicies,list,2,0.000617,0.0,[],"[{""requirementProvider"": ""request"", ""detail"": ...",(empty)
31,properties.authenticationRequirement,str,2,0.000617,0.0,singleFactorAuthentication,multiFactorAuthentication,(empty)
32,properties.incomingTokenType,str,2,0.000617,0.0,primaryRefreshToken,none,(empty)
33,resourceId,str,1,0.000309,0.0,/tenants/d3e5a967-5657-4a42-afcc-6106b6c3c299/...,(empty),(empty)
34,operationName,str,1,0.000309,0.0,Sign-in activity,(empty),(empty)
35,operationVersion,str,1,0.000309,0.0,1.0,(empty),(empty)
36,tenantId,str,1,0.000309,0.0,d3e5a967-5657-4a42-afcc-6106b6c3c299,(empty),(empty)


Note that `cardinality` here includes `null` values too, so fields with one possible value + `null` will have 2 as the cardinality. 

#### Mostly Null
Most values are Null ( (# of  null values)/(# of all data points) ≅ 1)<br>
Might be a redundant/non-populated field that does not provide much information<br>
! Beware that sometimes "None" can be in the string format and bypass the `pd.isnull` check

In [13]:
overall_stats.loc[overall_stats.null_ratio > 0.9]

Unnamed: 0,field,type,cardinality,uniq_ratio,null_ratio,example1,example2,example3
77,properties.ipAddressFromResourceProvider,float,15,0.004631,0.9373,155.96.237.219,13.113.40.157,11.5.130.220
78,properties.alternateSignInName,float,5,0.001544,0.9904,cperry1@domain.com,acole1@domain.com,ksheppard1@domain.com
79,properties.signInIdentifier,float,5,0.001544,0.9904,cperry1@domain.com,acole1@domain.com,ksheppard1@domain.com
80,properties.appServicePrincipalId,NoneType,1,0.000309,1.0,(empty),(empty),(empty)


#### 5.2.2 Good feature candidates<a class="anchor" id="1.6.3.5"></a>
The thresholds used here are just for demonstration. They can be tuned to suit your data and your requirements better.

In [14]:
good_feat_candidate_criteria = (
    (overall_stats.uniq_ratio < 0.95)
    & ~((overall_stats.cardinality <= 3) & (overall_stats.type != 'bool'))
    & (overall_stats.null_ratio < 0.9)
)
feature_candidates = overall_stats.loc[good_feat_candidate_criteria]

In [15]:
print(f'Number of all columns: {len(overall_stats)}\nPotential feature candidate count: {len(feature_candidates)}')

Number of all columns: 81
Potential feature candidate count: 30


In [16]:
feature_candidates

Unnamed: 0,field,type,cardinality,uniq_ratio,null_ratio,example1,example2,example3
7,properties.processingTimeInMilliseconds,int64,307,0.094782,0.0,164,100,91
8,properties.appId,str,78,0.024082,0.0,9a7e67c7-6f05-42a3-b226-97c7ec3e9696,e47a1d38-5f61-45cd-b1b9-bc92f525c598,9c5b7fe3-0ad2-4ea6-94e5-9e0001f367e3
9,properties.appDisplayName,str,77,0.023773,0.0,Adobe Identity Management,Altoura,Articulate 360
10,callerIpAddress,str,42,0.012967,0.0,44.22.19.201,99.116.100.205,86.154.193.190
11,properties.ipAddress,str,42,0.012967,0.0,44.22.19.201,99.116.100.205,86.154.193.190
12,properties.location.geoCoordinates.longitude,float64,41,0.012658,0.0,-109.530885,-17.674718,131.871582
13,properties.location.geoCoordinates.latitude,float64,40,0.012349,0.0,25.443725,-38.854047,3.551896
14,properties.deviceDetail.deviceId,str,36,0.011115,0.0,0927e60c-8dfa-4ecf-be85-ad63bccf40a1,,6ea76864-5f18-47dd-adb9-2b1dfcbfc425
15,properties.location.city,str,34,0.010497,0.0,Littlemouth,Carrollstad,Port Denisetown
16,properties.autonomousSystemNumber,int64,27,0.008336,0.0,230297,214655,256668


Now we can double-check if these are good candidates under the per-entity scope.

### 5.3. Per-entity Statistics<a class="anchor" id="1.6.4"></a>
Collect entity-level statistics for each feature to further evaluate their “usefulness”.

In [17]:
def get_data_with_selected_features(data, features):
    """Return a new dataframe with only the selected features and json dumps the unhashable fields for `nunique` to work."""
    data_candidate_cols = pd.DataFrame()
    for col in features:
        # json dumps the array fields
        try:
            data[col].nunique()
            data_candidate_cols[col] = data[col]
        except:
            data_candidate_cols[col] = data[col].apply(json_dumps_keep_null)
    return data_candidate_cols

def get_entity_feature_cardinality(data, entity_id_field, trainable_user_activity_limit=100):
    """Return a dataframe of the per-entity cardinality (unique count) of each feature, including only the entities with enough activity to be "trainable" by DFP. 
    (DFP doesn't train a model for entities with little activity. Instead, it uses a shared model for the light-traffic entities.)
    """
    kwargs = {col: (col, 'nunique') for col in data.columns if col != entity_id_field}
    kwargs['count'] = (data.columns[0], 'count')
    per_entity_nunique_all = data.groupby(entity_id_field).agg(**kwargs).reset_index()
    
    per_entity_nunique = per_entity_nunique_all.loc[per_entity_nunique_all['count'] > trainable_user_activity_limit]
    return per_entity_nunique

def get_entity_feature_null_ratio(data, entity_id_field, trainable_user_activity_limit=100):
    """Return a dataframe of the per-entity null ratio of each feature, including only the entities with enough activity to be "trainable" by DFP. 
    (DFP doesn't train a model for entities with little activity. Instead, it uses a shared model for the light-traffic entities.)
    """
    data_with_null_eval = data.join(data.isnull(), rsuffix='_isnull')
    kwargs = {f'{col}': (f'{col}_isnull', 'sum') for col in data.columns if col != entity_id_field}
    kwargs['count'] = (data.columns[0], 'count')
    per_entity_null_count_all = data_with_null_eval.groupby(entity_id_field).agg(**kwargs).reset_index()
    per_entity_null_count = per_entity_null_count_all.loc[per_entity_null_count_all['count'] > trainable_user_activity_limit]
    
    # calculate null ratio
    per_entity_null_ratio = pd.DataFrame()
    per_entity_null_ratio[entity_id_field] = per_entity_null_count[entity_id_field]
    for col in per_entity_null_count.columns:
        if col == entity_id_field or col == 'count':
            continue
        per_entity_null_ratio[col] = per_entity_null_count[col] / per_entity_null_count['count']
    
    return per_entity_null_ratio

def get_feature_per_entity_stats(entity_feature_cardinality, entity_feature_null_ratio, entity_id_field):
    """ Given the cardinality and the null ratio of each feature, return a summary dataframe of the per-entity stats for each feature.
    """
    stat_types = ['max', 'med', 'mean', 'min']
    stat_funcs = [max, np.median, np.mean, min]
    
    per_user_feature_dist = []
    for col in entity_feature_cardinality.columns:
        if col == entity_id_field or col == 'count':
            continue

        vals = entity_feature_cardinality[col]
        uniq_ratios = vals / entity_feature_cardinality['count']
        null_ratios = entity_feature_null_ratio[col]
        
        
        per_user_feature_dist.append([
            col,
            # cardinality
            *[func(vals) for func in stat_funcs],
            # unique ratio
            *[func(uniq_ratios) for func in stat_funcs],
            # null ratio
            *[func(null_ratios) for func in stat_funcs],
        ])

    return pd.DataFrame(
        data=per_user_feature_dist, 
        columns=[
            'field', 
            *[f'cardinality_{stat}' for stat in stat_types], 
            *[f'uniq_ratio_{stat}' for stat in stat_types],
            *[f'null_ratio_{stat}' for stat in stat_types],
        ]
    ).sort_values('cardinality_med', ascending=False).reset_index(drop=True)

In [18]:
data_candidate_cols = get_data_with_selected_features(data, feature_candidates.field)
user_feature_cardinality = get_entity_feature_cardinality(data_candidate_cols, entity_id_field='properties.userPrincipalName')
user_feature_null_ratio = get_entity_feature_null_ratio(data_candidate_cols, entity_id_field='properties.userPrincipalName')
per_entity_stats = get_feature_per_entity_stats(user_feature_cardinality, user_feature_null_ratio, entity_id_field='properties.userPrincipalName')

In [19]:
per_entity_stats

Unnamed: 0,field,cardinality_max,cardinality_med,cardinality_mean,cardinality_min,uniq_ratio_max,uniq_ratio_med,uniq_ratio_mean,uniq_ratio_min,null_ratio_max,null_ratio_med,null_ratio_mean,null_ratio_min
0,properties.processingTimeInMilliseconds,144,72.0,90.0,61,0.631148,0.540323,0.495453,0.241379,0.0,0.0,0.0,0.0
1,properties.resourceId,51,30.0,29.266667,11,0.285714,0.206349,0.187282,0.029101,0.016393,0.0,0.001945,0.0
2,properties.resourceServicePrincipalId,39,23.0,22.333333,10,0.230769,0.166667,0.143457,0.026455,0.464623,0.180328,0.186251,0.014199
3,properties.resourceDisplayName,47,20.0,21.066667,5,0.205128,0.156667,0.133773,0.013228,0.02,0.0,0.004131,0.0
4,properties.appId,44,19.0,19.666667,5,0.196721,0.147541,0.126275,0.013228,0.0,0.0,0.0,0.0
5,properties.appDisplayName,43,19.0,19.6,5,0.196721,0.147541,0.126052,0.013228,0.0,0.0,0.0,0.0
6,properties.deviceDetail.browser,18,7.0,8.2,2,0.09322,0.057377,0.05108,0.005291,0.858491,0.081967,0.160295,0.005291
7,properties.deviceDetail.operatingSystem,5,4.0,3.666667,2,0.038462,0.031746,0.023793,0.005291,0.0,0.0,0.0,0.0
8,callerIpAddress,11,3.0,3.266667,1,0.032787,0.017094,0.01769,0.002646,0.0,0.0,0.0,0.0
9,properties.ipAddress,11,3.0,3.266667,1,0.032787,0.017094,0.01769,0.002646,0.0,0.0,0.0,0.0


We can see that some features look less promising at the per-entity level. The same set of criteria can be reused here to evaluate whether a feature should be included as a candidate or not:
- High cardinality (each row has a unique value, i.e. high uniqueness)
- Low cardinality (all rows share 1 or 2 constant values)
- High null ratio 

#### 5.3.1 Good feature candidates<a class="anchor" id="1.6.4.1"></a>
The thresholds used here are just for demonstration. They can be tuned to suit your data and your requirements better.

In [20]:
per_entity_stats.columns = per_entity_stats.columns.map(lambda col: f'per_ent_{col}' if col != 'field' else 'field')
feature_stats = feature_candidates.merge(per_entity_stats, on='field')  # merge overall and per-entity stats

In [21]:
feature_stats

Unnamed: 0,field,type,cardinality,uniq_ratio,null_ratio,example1,example2,example3,per_ent_cardinality_max,per_ent_cardinality_med,per_ent_cardinality_mean,per_ent_cardinality_min,per_ent_uniq_ratio_max,per_ent_uniq_ratio_med,per_ent_uniq_ratio_mean,per_ent_uniq_ratio_min,per_ent_null_ratio_max,per_ent_null_ratio_med,per_ent_null_ratio_mean,per_ent_null_ratio_min
0,properties.processingTimeInMilliseconds,int64,307,0.094782,0.0,164,100,91,144,72.0,90.0,61,0.631148,0.540323,0.495453,0.241379,0.0,0.0,0.0,0.0
1,properties.appId,str,78,0.024082,0.0,9a7e67c7-6f05-42a3-b226-97c7ec3e9696,e47a1d38-5f61-45cd-b1b9-bc92f525c598,9c5b7fe3-0ad2-4ea6-94e5-9e0001f367e3,44,19.0,19.666667,5,0.196721,0.147541,0.126275,0.013228,0.0,0.0,0.0,0.0
2,properties.appDisplayName,str,77,0.023773,0.0,Adobe Identity Management,Altoura,Articulate 360,43,19.0,19.6,5,0.196721,0.147541,0.126052,0.013228,0.0,0.0,0.0,0.0
3,callerIpAddress,str,42,0.012967,0.0,44.22.19.201,99.116.100.205,86.154.193.190,11,3.0,3.266667,1,0.032787,0.017094,0.01769,0.002646,0.0,0.0,0.0,0.0
4,properties.ipAddress,str,42,0.012967,0.0,44.22.19.201,99.116.100.205,86.154.193.190,11,3.0,3.266667,1,0.032787,0.017094,0.01769,0.002646,0.0,0.0,0.0,0.0
5,properties.location.geoCoordinates.longitude,float64,41,0.012658,0.0,-109.530885,-17.674718,131.871582,11,3.0,3.333333,1,0.040984,0.016949,0.017825,0.005291,0.0,0.0,0.0,0.0
6,properties.location.geoCoordinates.latitude,float64,40,0.012349,0.0,25.443725,-38.854047,3.551896,10,3.0,3.266667,1,0.040984,0.016949,0.01769,0.005291,0.0,0.0,0.0,0.0
7,properties.deviceDetail.deviceId,str,36,0.011115,0.0,0927e60c-8dfa-4ecf-be85-ad63bccf40a1,,6ea76864-5f18-47dd-adb9-2b1dfcbfc425,13,2.0,3.0,1,0.033333,0.016807,0.016332,0.004717,0.0,0.0,0.0,0.0
8,properties.location.city,str,34,0.010497,0.0,Littlemouth,Carrollstad,Port Denisetown,10,2.0,3.0,1,0.032787,0.016667,0.015883,0.004717,0.0,0.0,0.0,0.0
9,properties.autonomousSystemNumber,int64,27,0.008336,0.0,230297,214655,256668,7,2.0,2.466667,1,0.02459,0.016129,0.014127,0.002646,0.0,0.0,0.0,0.0


In [22]:
good_feat_candidate_criteria = (
    # median is a more objective measurement but we can also use mean if we want to take into account everything including the outliers
    (feature_stats.per_ent_uniq_ratio_med < 0.95)
    & ~((feature_stats.per_ent_cardinality_med < 2) & (feature_stats.type != 'bool'))
    & (feature_stats.per_ent_null_ratio_med < 0.9)
)
final_feature_candidates = feature_stats.loc[good_feat_candidate_criteria]

In [23]:
final_feature_candidates

Unnamed: 0,field,type,cardinality,uniq_ratio,null_ratio,example1,example2,example3,per_ent_cardinality_max,per_ent_cardinality_med,per_ent_cardinality_mean,per_ent_cardinality_min,per_ent_uniq_ratio_max,per_ent_uniq_ratio_med,per_ent_uniq_ratio_mean,per_ent_uniq_ratio_min,per_ent_null_ratio_max,per_ent_null_ratio_med,per_ent_null_ratio_mean,per_ent_null_ratio_min
0,properties.processingTimeInMilliseconds,int64,307,0.094782,0.0,164,100,91,144,72.0,90.0,61,0.631148,0.540323,0.495453,0.241379,0.0,0.0,0.0,0.0
1,properties.appId,str,78,0.024082,0.0,9a7e67c7-6f05-42a3-b226-97c7ec3e9696,e47a1d38-5f61-45cd-b1b9-bc92f525c598,9c5b7fe3-0ad2-4ea6-94e5-9e0001f367e3,44,19.0,19.666667,5,0.196721,0.147541,0.126275,0.013228,0.0,0.0,0.0,0.0
2,properties.appDisplayName,str,77,0.023773,0.0,Adobe Identity Management,Altoura,Articulate 360,43,19.0,19.6,5,0.196721,0.147541,0.126052,0.013228,0.0,0.0,0.0,0.0
3,callerIpAddress,str,42,0.012967,0.0,44.22.19.201,99.116.100.205,86.154.193.190,11,3.0,3.266667,1,0.032787,0.017094,0.01769,0.002646,0.0,0.0,0.0,0.0
4,properties.ipAddress,str,42,0.012967,0.0,44.22.19.201,99.116.100.205,86.154.193.190,11,3.0,3.266667,1,0.032787,0.017094,0.01769,0.002646,0.0,0.0,0.0,0.0
5,properties.location.geoCoordinates.longitude,float64,41,0.012658,0.0,-109.530885,-17.674718,131.871582,11,3.0,3.333333,1,0.040984,0.016949,0.017825,0.005291,0.0,0.0,0.0,0.0
6,properties.location.geoCoordinates.latitude,float64,40,0.012349,0.0,25.443725,-38.854047,3.551896,10,3.0,3.266667,1,0.040984,0.016949,0.01769,0.005291,0.0,0.0,0.0,0.0
7,properties.deviceDetail.deviceId,str,36,0.011115,0.0,0927e60c-8dfa-4ecf-be85-ad63bccf40a1,,6ea76864-5f18-47dd-adb9-2b1dfcbfc425,13,2.0,3.0,1,0.033333,0.016807,0.016332,0.004717,0.0,0.0,0.0,0.0
8,properties.location.city,str,34,0.010497,0.0,Littlemouth,Carrollstad,Port Denisetown,10,2.0,3.0,1,0.032787,0.016667,0.015883,0.004717,0.0,0.0,0.0,0.0
9,properties.autonomousSystemNumber,int64,27,0.008336,0.0,230297,214655,256668,7,2.0,2.466667,1,0.02459,0.016129,0.014127,0.002646,0.0,0.0,0.0,0.0


In [24]:
len(final_feature_candidates)

18

Note that `properties.processingTimeInMilliseconds` doesn't seem to be carrying interesting security meaning even if it looks good from the data science perspectives.<br>
Reviewing with domain expert can help filter out the deceivingly promising features!

### 5.4. Feature Correlation<a class="anchor" id="1.6.5"></a>
Evaluate the correlation between feature candidates to remove redundancy.
- A data source can have multiple fields representing similar information
    - E.g., `callerIpAddress` and `properties.ipAddress` are two separate but highly correlated fields in Azure AD logs.<br>
      They both represent the IP address of the actor in the log event and will always share the same value.
- <font color='#76B900'>Minimizing redundancy</font> in the feature space boosts the DFP pipeline’s <font color='#76B900'>efficacy</font> and <font color='#76B900'>efficiency</font>
    - Including fields with overlapping information can distract the model and lead to extra computational cost
- Measuring the <font color='#76B900'>correlation</font> between each pair of features helps identify and remove redundancy
    - Make note of the highly correlated features discuss with security experts to rule out redundant information

In [25]:
# get a dataframe that only includes the final candidate columns
final_candidate_data = data_candidate_cols[final_feature_candidates.field].copy()

#### 5.4.1 Pearson Correlation Coefficients - Numerical Feature Correlation<a class="anchor" id="1.6.5.1"></a>
Measure the correlation between numerical features (value range: [-1, 1])
- Value near ± 1 indicates strong correlation while value near 0 shows no correlation
- \> 0.5 (or < -0.5) is considered correlation strong enough to be aware of<br>

Analyze the numerical columns:

In [26]:
final_candidate_data.corr(method='pearson')

Unnamed: 0,properties.processingTimeInMilliseconds,properties.location.geoCoordinates.longitude,properties.location.geoCoordinates.latitude,properties.autonomousSystemNumber,properties.status.errorCode
properties.processingTimeInMilliseconds,1.0,0.008551,0.015966,-0.112487,-0.038145
properties.location.geoCoordinates.longitude,0.008551,1.0,-0.164056,0.011657,0.035172
properties.location.geoCoordinates.latitude,0.015966,-0.164056,1.0,0.20457,0.062535
properties.autonomousSystemNumber,-0.112487,0.011657,0.20457,1.0,0.126132
properties.status.errorCode,-0.038145,0.035172,0.062535,0.126132,1.0


We can see that there is very little correlation between the numerical columns.

#### 5.4.2 Cramer's V - Categorical Feature Correlation<a class="anchor" id="1.6.5.2"></a>
- Value near 1 indicates strong correlation while value near 0 shows no correlation
- \> 0.5 is considered correlation strong enough to be aware of 

##### Preporcessing the dataframe
To measure the correlation between a numerical and a categorical feature, we can bin the numerical feature values into buckets and treat as categorical.

In [27]:
# Drop properties.processingTimeInMilliseconds as there is no security relevancy
final_candidate_data = final_candidate_data.drop(columns=['properties.processingTimeInMilliseconds'])

# latitude and longitude (numerical, float between -180 to 180) can be binned into size 5 buckets so we have 72 categories instead of infinite possible values
# Note that we use a bin size of 5 here for demonstration, but it can be any value you see fit for your use case.
round_to_closest_5 = lambda x: x//5 * 5
final_candidate_data['latitude_binned'] = final_candidate_data['properties.location.geoCoordinates.latitude'].apply(round_to_closest_5)
final_candidate_data['longitude_binned'] = final_candidate_data['properties.location.geoCoordinates.longitude'].apply(round_to_closest_5)
final_candidate_data = final_candidate_data.drop(columns=['properties.location.geoCoordinates.latitude', 'properties.location.geoCoordinates.longitude'])

In [28]:
def cramers_v(confusion_matrix):
    """ Takes a confusion matrix of two features and return the Cramer's V correlation coefficient.
    Note that Cramér's V tends to overestimate the strength of association hence this funciton is integrating the suggested correction terms.
    
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r, k = confusion_matrix.shape
    phi2_corrected = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    r_corrected = r - ((r-1)**2)/(n-1)
    k_corrected = k - ((k-1)**2)/(n-1)
    v_corrected = np.sqrt(phi2_corrected / min( (k_corrected-1), (r_corrected-1)))
    return v_corrected

def measure_correlation(data):
    """ Go through all possible pairs of features, evaluate the Cramer's correlation value, and return the result in a dataframe."""
    correlation_results = []
    for i, col_i in enumerate(data.columns):
        for j, col_j in enumerate(data.columns):
            if i >= j:
                # no need to repeat for the same pairs
                continue

            conf_matrix = pd.crosstab(data[col_i], data[col_j])
            v = cramers_v(conf_matrix)
            correlation_results.append([col_i, col_j, v])
    return pd.DataFrame(correlation_results, columns=['column1', 'column2', 'cramers_v'])

##### Loop through all pairs of features and evaluate their correlation

In [29]:
correlations = measure_correlation(final_candidate_data).sort_values('cramers_v', ascending=False)

##### Strongly correlated features

In [30]:
correlations.loc[correlations.cramers_v > 0.5]

Unnamed: 0,column1,column2,cramers_v
31,callerIpAddress,properties.ipAddress,1.0
91,resultType,properties.status.errorCode,1.0
130,properties.deviceDetail.browser,properties.userAgent,1.0
0,properties.appId,properties.appDisplayName,0.999842
116,properties.resourceId,properties.resourceServicePrincipalId,0.999806
25,properties.appDisplayName,properties.resourceDisplayName,0.998252
47,properties.ipAddress,properties.autonomousSystemNumber,0.997662
34,callerIpAddress,properties.autonomousSystemNumber,0.997662
112,properties.deviceDetail.operatingSystem,properties.userAgent,0.99557
10,properties.appId,properties.resourceDisplayName,0.991551


Each pair of features in the above table correlates strong enough to be aware of. <br>
- E.g., `properties.appId` and `properties.appDisplayName` both provide information on which app was accessed in an event, one being the ID string while the other being the human-readable name. <br>
  The two fields share a **~1.0** Cramer’s correlation score because there is a one-to-one mapping between them.<br>
  We should choose one from the two to avoid adding redundancy in the feature space.
- E.g. `properties.location.city` and `properties.deviceDetail.operatingSystem` has a correlation of **0.52**, which is not insignificant.<br>
  However, the two variables are independent from the security perspective. We would want to keep both of them as potential feature candidates.<br>

The correlation analysis helps us quickly identify the possible redundancy in the feature space, but **<font color='#76B900'>a strong correlation doesn't indicate dependency</font>**.<br>
It is important to understand the security context and the dependency of features before ruling out any features. (Two features can show a strong correlation in a sampled dataset by coincidence!)

### 5.5. Review with Security Experts<a class="anchor" id="1.6.6"></a>
Validate the feature candidates with security experts.
- A “good feature” identified by data science methods can be irrelevant to the problem being solved
    - E.g., `processingTimeInMilliseconds` is a feature in Azure AD logs which has a proper cardinality and is never null.<br>
      However, it represents the milliseconds taken to process the log, which helps monitor the health of the pipeline but carries very little security context about the event. 
- An identified “bad feature” can carry important security context and just needs some extra feature engineering to be useful
    - E.g., `timestamp` of every log is unique. Hence as a feature, timestamp will have a super high cardinality.<br>
      However, the time an event happened can contribute largely to the “anomalousness” of a behavior.<br>
      As a workaround, we can capture the time information by parsing “hour of day” or “day of week” and use them as derived features. 
- Reviewing feature candidates with security experts helps keep our model <font color='#76B900'>relevant</font> and <font color='#76B900'>effective</font>
    - A deeper understanding of the data and the target domain is always beneficial for all ML applications

## 6. Ideas on Derived Features<a class="anchor" id="1.7"></a>
The previous sections demonstrate a way to select from the raw features. However, raw values do not always capture the information that is helpful for modeling.
- Derived features enlighten the model on <font color='#76B900'>key</font> information that is <font color='#76B900'>hidden</font> in plain sight
- Examples of useful insights to feed into the model: 
    - **Strong relation between fields**<br>
    E.g.: City, state, and country fields together provide information on location.<br>
    Concatenating them into a single <font color='#76B900'>location</font> feature can inform the model about it and avoid collisions between multiple cities with the same names.
    - **Semantics behind the plain value of a field**<br>
    E.g.: App name are strings hence are treated as categorical features by the model. However, `Microsoft Teams` and `Microsoft Teams Services` might be semantically closer compared to `Office 365 Exchange Online`.<br>
    Adding an <font color='#76B900'>app category</font> feature (e.g., `MS Teams`/`Exchange`) can help capture the meaning of the field better.
    - **Anomalous pattern to target**<br>
    Depending on the use case, there may be specific patterns we wish to highlight for the model.<br>
    E.g.: Being compromised, a user account may be used to access a high number of resources in the environment.<br>
    In this case, <font color='#76B900'>incremental app count</font> for the day would be a good feature to being attention to targeted red flags.

## 7. Conclusion<a class="anchor" id="1.8"></a>
- Feature selection and feature engineering are fundamental to all machine learning applications
- A good feature set should cover the key information about an event (actor, time, location, resource accessed, ... etc.) without redundancy
- Any information that is considered useful for a human analyst during threat investigation should be considered to be included in the DFP feature set<br>

With a good set of raw and derived features, DFP can be a powerful tool that helps monitor the activities in the network and detect anomalies at scale.