# RiskIQ PassiveTotal Python Library

#### *Using the Trackers dataset*

## Getting Started

This notebook leverages the RiskIQ Illuminate / PassiveTotal API through the `passivetotal` Python library. 

Documentation for the library, including how to install it and configure API keys, are available here:
https://passivetotal.readthedocs.io/en/latest/getting-started.html

You will need API credentials to authenticate with the API server that provide access to the datasets queried in this notebook. Ask your RiskIQ contact for details or visit https://info.riskiq.net/ to contact the support team.

### Optional Dependencies

This notebook uses the `pandas` Python library primarily to improve the visual output of data tables retrieved from the API. You will need to install that library in your Python (virtual) environment (`pip install pandas`) or change the code examples to return a Python dictionary instead of a dataframe. Simply change `.as_df` to `.as_dict`.

Some examples may use special features in `pandas` to filter or aggregate data, but these can also be implemented in pure Python.

By default, `pandas` will only show a subset of rows in notebooks. To display more, set the `max_rows` option to a higher value.

In [None]:
import pandas as pd
pd.options.display.max_rows=500

### Product Context

[Trackers](https://info.riskiq.net/hc/en-us/articles/360057824494-PassiveTotal-Datasets-Trackers)
are unique codes or values found within web pages and often are used to track user interaction. These codes can be used to correlate a disparate group of websites to a central entity.


### Setup Notebook
*If this returns errors, ensure you have followed the Getting Started document linked above to install necessary dependencies and configure your API keys.*

In [None]:
from passivetotal import analyzer
analyzer.init()

### Table of Contents

* [Tracker History](#Tracker-History): Start with a hostname and get a history of trackers observed on that host.
* [Tracker Observations](#Tracker-Observations): Start with a tracker type and value to discover other sites where it has been observed.
* [Reference Trackers](#Reference-Trackers): Query a derived dataset starting with a host to find other hosts where a site's content has been copied and re-published.

---
## Tracker History

#### Hostname and IP Tracker History
RiskIQ gathers details on trackers during regular web crawls. The specific meaning of a tracker varies based on the "tracker type" assigned by RiskIQ analysts, but generally, one would expect a tracker value to uniquely identify a given site or organization. 

For example, a site admin may use a website monitoring product called New Relic to track the performance of their site. They will embedded Javascript code in their webpage that includes a uniquely-assigned identifier. RiskIQ crawlers will see and index that identifier as a tracker of type "NewRelicID" and associate the observation with the site where it was observed.

Here, we consider the trackers observed on www.irs.gov.

In [None]:
analyzer.set_date_range(days_back=30)

In [None]:
analyzer.Hostname('www.irs.gov').trackers.as_df

> NOTE: If you change the `days_back` value above and re-run the query, you won't get new a different set of results. This is due to caching in the `analyzer.Hostname` objects. You can restart the notebook kernel, or run `analyzer.Hostname('www.irs.gov').reset('trackers')` to clear the cache.

---
The `trackers` property of the `analyzer.Hostname` object returns an object of type `TrackerHistory` that behaves like a normal Python list, but also offers additional features through various properties. See the [reference docs](https://passivetotal.readthedocs.io/en/latest/analyzer.html?highlight=trackers#passivetotal.analyzer.trackers.TrackerHistory) for a complete list. 

In [None]:
for tracker in analyzer.Hostname('www.irs.gov').trackers:
    print(tracker)

Like most `analyzer` objects, each entry in a list of trackers can be treated like a string for easy display, but they also contain a set of properties and attributes for direct access to the data. These properties are explained in the [reference docs](https://passivetotal.readthedocs.io/en/latest/analyzer.html?highlight=trackers#passivetotal.analyzer.trackers.TrackerRecord) for `TrackerRecord` objects. 

In [None]:
analyzer.Hostname('www.irs.gov').trackers.filter(category='NewRelicId')[0].value

> The `trackers` property is also available for [IP addresses](https://passivetotal.readthedocs.io/en/latest/analyzer.html?highlight=trackers#ip-analysis), with similar functionalty, though in most cases we recommend starting with a fully-qualified domain name for best results.

---
## Tracker Observations

Trackers can be an effective way of discovering other internet sites controlled by legitimate entities, but it can also be used for threat investigations and phishing site detection.

When malicious actors copy website content with the intent to setup a phishing site, they often use automated tools that copy the entire HTML of the web page, including the Javascript and link parameters that setup trackers. In those cases, shared tracker values can be used to detect these copycat sites. 

The `analyzer` offers a top-level `Tracker` object you can use to search for all observations of a specific tracker type and value across hosts or IP addresses. 

In [None]:
analyzer.Tracker('NewRelicId','b67fc6a152').observations_by_hostname.as_df

The `analyzer.Tracker` object provides two properties to aid discovery of related sites: `observations_by_hostname` and `observations_by_ip`. Both return a list of observations as a `TrackerSearchResults` object that offers many of the same capabilities as a `TrackerHistory` object. 

You can instantiate a `analyzer.Tracker` object directly as shown above, or obtain an instance from the `tracker` property of a record returned in the `TrackerHistory` of a hostname or IP address.

In [None]:
analyzer.Hostname('www.irs.gov').trackers.filter(category='NewRelicId')[0].tracker

In [None]:
(
    analyzer.Hostname('www.irs.gov')
    .trackers
    .filter(category='NewRelicId')[0]
    .tracker
    .observations_by_hostname
    .totalrecords
)

> This syntax can be a bit strange when you first encounter it. Python style guides generally discourage long lines of code, but when they are unavoidable or justified, the syntax permits enclosing blocks in parentheses.  

These observations show other sites where RiskIQ has observed the same value for the NewRelicId tracker that the IRS has configured on their site. If these observations are subdomains of the 'irs.gov' domain they are likely benign, but if not, they are suspicious and worth further research.

We can leverage features of the `analyzer` module and these specific tracker objects to focus on those suspicious sites.

In [None]:
whitelist = ['irs.gov','translate.goog','t.co']
suspicious_trackers = (
    analyzer.Tracker('NewRelicId','b67fc6a152')
    .observations_by_hostname
    .exclude_domains_in(whitelist)
)
suspicious_trackers.as_df

> `host.registered_domain` works because the `host` attribute of tracker record returns an object of type `analyzer.Hostname`, and those objects offer several properties provided by the `tldextract` Python library, including `tld` and `registered_domain`.

As a further validation, we could examine the age of these domains and the registrant owner using whois data available in the `whois` property of the hostnames.

In [None]:
suspicious_tracker_analysis = []
for tracker in suspicious_trackers.sorted_by('lastseen', True)[0:5]:
    analysis = { 
        'host': str(tracker.host),
        'whois_age': tracker.host.whois.age,
        'whois_org': tracker.host.whois.registrant_org.value
    }
    suspicious_tracker_analysis.append(analysis)
suspicious_tracker_analysis

> The `tracker.host` object will return an `analyzer.Hostname` object. Cast it as a string to get just the text value.

> The second parameter of `sorted_by('lastseen', True)` activates a reverse sort, and together with the slice notion `[0:5]` gives us the top 5 `TrackerSearchRecord` objects. 

## Reference Trackers

RiskIQ researchers have identified several instances where the value of a tracker provides an indication of where an Internet asset was originally hosted or where an Internet asset’s response body was originally copied from. We have merged these identifiers into our tracker dataset under one of several categories (or types).

In the `analyzer` these trackers are available in the `trackers_reference` property of `analyzer.Hostname` and `analyzer.IPAddress` objects.

Among other use cases, this enables you to find websites hosting files that were originally downloaded from a given site, often with malicious intent.

In [None]:
analyzer.Hostname('www.irs.gov').tracker_references.as_df

> This property performs multiple API queries to search both IPs and hosts for several types of trackers. If you need to conserve API queries, instantiate an `analyzer.Tracker` object and use the `observations_by_hostname` or `observations_by_ip` properties directly.

Combining features from `pandas` and the `analyzer` module, we can create a custom dataframe with the RiskIQ Illuminate Reputation Score for each domain. 

In [None]:
whitelist = ['irs.gov','translate.goog','t.co']
tracker_df = (
    analyzer.Hostname('www.irs.gov')
    .tracker_references
    .filter(searchtype='hosts')
    .exclude_domains_in(whitelist)
    .as_df
)
tracker_df['reputation_score'] = tracker_df.apply(
    lambda row: analyzer.Hostname(str(row['host'])).reputation.score, 
    axis=1
)
del(tracker_df['query'])
del(tracker_df['searchtype'])
tracker_df.nlargest(10,'reputation_score')

The `reputation` property of Hostnames and IPAddress objects includes a `rules` property that offers insight into how the score was calculated. We can access the property directly or display it using `pandas`.

In [None]:
analyzer.Hostname('severvice0utkook[.]cf').reputation.to_dataframe(explode_rules=True)

> The `as_df` property is a shortcut to the `to_dataframe()` method available on nearly all `analyzer` objects. In some cases, `to_dataframe()` offers unique behavior specific to the object it is acting on. Here, we use a `pandas.DataFrame.explode()` method to unpack a list of rules and present them as rows, hence the `explode_rules` parameter.