# RiskIQ PassiveTotal Python Library

#### *Using the Hostpairs dataset*

## Getting Started

This notebook leverages the RiskIQ Illuminate / PassiveTotal API through the `passivetotal` Python library. 

Documentation for the library, including how to install it and configure API keys, are available here:
https://passivetotal.readthedocs.io/en/latest/getting-started.html

You will need API credentials to authenticate with the API server that provide access to the datasets queried in this notebook. Ask your RiskIQ contact for details or visit https://info.riskiq.net/ to contact the support team.

### Optional Dependencies

This notebook uses the `pandas` Python library primarily to improve the visual output of data tables retrieved from the API. You will need to install that library in your Python (virtual) environment (`pip install pandas`) or change the code examples to return a Python dictionary instead of a dataframe. Simply change `.as_df` to `.as_dict`.

Some examples may use special features in `pandas` to filter or aggregate data, but these can also be implemented in pure Python.

By default, `pandas` will only show a subset of rows in notebooks. To display more, set the `max_rows` option to a higher value.

In [126]:
import pandas as pd
pd.options.display.max_rows=100

### Product Context

[Host Pairs](https://info.riskiq.net/hc/en-us/articles/360057823834-PassiveTotal-Datasets-Host-Pairs)
are two pieces of infrastructure (a parent and a child) that shared a connection observed from a RiskIQ web crawl. The connection could range from a top-level redirect (HTTP 302) to something more complex like an iframe or script source reference.


### Setup Notebook
*If this returns errors, ensure you have followed the Getting Started document linked above to install necessary dependencies and configure your API keys.*

In [None]:
from passivetotal import analyzer
analyzer.init()

### Table of Contents

* [Hostpair Fundamentals](#Hostpair-Fundamentals): Understanding and querying the hostpairs dataset.
* [Filtering Hostpairs](#Filtering-Hostpair-Results): Leverage `analyzer` features to focus on foreign hosts.
* [Use Case: Find Inbound Redirects](#Use-Case:-Find-Inbound-Redirects): Discover sites redirecting traffic to your focus host.
* [Use Case: Find Site Copies](#Use-Case:-Find-Site-Copies): Discover copy-cat sites that have likely cloned the entire HTML of your focus host.

---
## Hostpair Fundamentals

### Hostpair Parents
Hostpair relationships are oriented around a specific hostname (defined here in the `focus_host` variable). Other hosts with a "parent" relationship to our focus hostname are "upstream" of our focus host. The "upstream" or "parent" hosts are publishing a web page that requests data or resources from, redirects to, or links to a child host.

Here, we consider the hostpairs relationships that are "upstream" of our focus host, www.irs.gov.

First, set a narrow date range to limit the amount of data returned from the API (the hostpairs dataset is also a historical dataset that captures host relationships over time).

In [124]:
analyzer.set_date_range(days_back=30)

Next, we assign a variable for our focus host to make it easy to change later, then access the `hostpair_parents` property of the `analyzer.Hostname()` object.

In [125]:
focus_host = 'www.irs.gov'
analyzer.Hostname(focus_host).hostpair_parents

<passivetotal.analyzer.hostpairs.HostpairHistory at 0x10aa20e80>

Results are returned as an instance of a `HostpairHistory` object. This is a list-like object provided by the `analyzer` package and offers many of the same filtering and sorting capabilities of other Analyzer objects. See the
[reference docs](https://passivetotal.readthedocs.io/en/latest/analyzer.html#hostpairs-record-lists)
for complete details.

We can iterate through the list like a standard Python object.

In [None]:
for pair in analyzer.Hostname(focus_host).hostpair_parents:
    print(pair)

> NOTE: If you change the `days_back` value above and re-run the query, you won't get new a different set of results. This is due to caching in the `analyzer.Hostname` objects. You can restart the notebook kernel, or run `analyzer.Hostname('www.irs.gov').reset('hostpair_parents')` to clear the cache.

Each pair can be printed like a string, but is itself an instance of a `HostpairRecord` that provides distinct properties for all the hostpair data it represents. It can also be rendered as a Python dictionary using the `as_dict` property. Complete details are available in
[the reference docs](https://passivetotal.readthedocs.io/en/latest/analyzer.html#passivetotal.analyzer.hostpairs.HostpairRecord).

For the rest of this notebook, we will use the `as_df` property of a hostpair list to view hostpairs as a Pandas dataframe. See above for more details on prerequisites to use the Pandas package. 

In [None]:
analyzer.Hostname(focus_host).hostpair_parents.as_df

As noted above, the `cause` property is a critical element to understanding the exact nature of a hostpair relationship. We can view the unique set of causes within a hostpairs list with the `causes` property.

In [None]:
analyzer.Hostname(focus_host).hostpair_parents.causes

Optimal use of the hostpairs dataset requires a focus on specific types of causes, often grouped by similar function. These groupings of causes, within the context of a specific direction (parent or child), define investigative use cases. We will explore several of them in this notebook.

---
## Filtering Hostpair Results

One of the most critical techniques for effectively leveraging hostpairs is to focus on "foreign" hosts - hosts that are not under the control of the focus host's organization. Looking at intra-host relationships (within the same domain) or within the same organization can be useful in some cases, but when investigating or detecting suspicious activity, we want to hone in on sites likely controlled by malicious actors. 

The `analyzer` provides properties of the `HostpairRecords` object that enable an analyst to maintain a "whitelist" of known-good or self-owned host and exclude those from a hostpair results.

In [127]:
whitelist = ['irs.gov','treasury.gov','google-analytics.com','revolut.com','civicplus.com']
analyzer.Hostname(focus_host).hostpair_parents.exclude_domains_in(whitelist).as_df

Unnamed: 0,query,direction,firstseen,lastseen,child,parent,cause
0,www.irs.gov,parents,2021-09-15 15:17:06,2021-09-20 19:10:54,www.irs.gov,redirection.irs-human-detection.com,redirect
1,www.irs.gov,parents,2021-09-19 20:18:25,2021-09-20 18:18:11,www.irs.gov,irs.gov-departements.com,redirect
2,www.irs.gov,parents,2021-01-15 05:09:04,2021-09-20 17:37:33,www.irs.gov,www.paycheckpirate.com,redirect
3,www.irs.gov,parents,2021-09-20 14:15:46,2021-09-20 15:33:12,www.irs.gov,api.uisderes.com,redirect
4,www.irs.gov,parents,2021-09-19 15:03:21,2021-09-20 15:17:27,www.irs.gov,redirection.dkim-cloudflare-human-verification...,redirect
...,...,...,...,...,...,...,...
77,www.irs.gov,parents,2020-09-04 11:24:37,2021-08-26 18:46:01,www.irs.gov,douglascountywi.org,redirect
78,www.irs.gov,parents,2020-09-04 04:34:15,2021-08-25 05:58:14,www.irs.gov,www.lakecountyil.gov,redirect
79,www.irs.gov,parents,2016-02-04 10:33:47,2021-08-23 11:33:27,www.irs.gov,www.savingsbonds.gov,meta.refresh
80,www.irs.gov,parents,2021-07-19 09:39:35,2021-08-23 09:58:29,www.irs.gov,payment.irs.benefit.marypoesia.com,img.src


> The `exclude_domains_in` property uses the built-in capabilities of `Hostname` objects to filter on the registered domain name of the hostname in a hostpair, using the Python package `tldextract`. This lets you exclude hosts by domain name without having to parse or extract that portion from the hostname. If you need a different behavior, consider using the `exclude_hosts_in` property.

Occasionally, it can be helpful to further narrow the list by TLD.

In [None]:
analyzer.Hostname(focus_host).hostpair_parents.exclude_domains_in(whitelist).exclude_tlds_in('gov').as_df

---
## Hostpair Use Cases
---
### Use Case: Find Inbound Redirects

The hostpairs dataset can be used to find websites that are redirecting to our focus host. It is unusual (and often suspicious) when a website we do not control redirects to our site. For example, landing pages for credential harvesting and other phishing attacks will often redirect to the site they are copying if you simply visit the domain name directly without following a link, in a (often successful) attempt to lend legitimacy. 

First, we assign a variable to store a reference to our filtered hostpairs recordlist for easy access.

In [128]:
foreign_hostpairs = analyzer.Hostname(focus_host).hostpair_parents.exclude_domains_in(whitelist).exclude_tlds_in('gov')

Next, we define a list of causes that apply to this use case (inbound redirects), then use the `filter_in` property available on all analyzer RecordList-type objects to filter the list to only those causes.

In [129]:
redirect_causes = ['redirect','topLevelRedirect','location.refresh','meta.refresh']
foreign_hostpairs.filter_in(cause=redirect_causes).as_df

Unnamed: 0,query,direction,firstseen,lastseen,child,parent,cause
0,www.irs.gov,parents,2021-09-15 15:17:06,2021-09-20 19:10:54,www.irs.gov,redirection.irs-human-detection.com,redirect
1,www.irs.gov,parents,2021-09-19 20:18:25,2021-09-20 18:18:11,www.irs.gov,irs.gov-departements.com,redirect
2,www.irs.gov,parents,2021-01-15 05:09:04,2021-09-20 17:37:33,www.irs.gov,www.paycheckpirate.com,redirect
3,www.irs.gov,parents,2021-09-20 14:15:46,2021-09-20 15:33:12,www.irs.gov,api.uisderes.com,redirect
4,www.irs.gov,parents,2021-09-19 15:03:21,2021-09-20 15:17:27,www.irs.gov,redirection.dkim-cloudflare-human-verification...,redirect
...,...,...,...,...,...,...,...
56,www.irs.gov,parents,1970-01-01 00:00:00,2021-09-01 18:46:09,www.irs.gov,bit.ly,redirect
57,www.irs.gov,parents,2020-09-04 04:54:54,2021-09-01 09:49:52,www.irs.gov,shreve-lib.org,redirect
58,www.irs.gov,parents,2021-08-31 13:32:12,2021-08-31 23:33:28,www.irs.gov,google-saveurl.com,redirect
59,www.irs.gov,parents,2019-11-30 15:42:55,2021-08-28 12:32:37,www.irs.gov,lnks.gd,location.refresh


We can obtain the distinct list of hostnames with the `hosts` property of the hostpair results.

In [None]:
suspect_hosts = foreign_hostpairs.filter_in(cause=redirect_causes).hosts
suspect_hosts

#### Hostname Triage
We can use the features in the `analyzer` module to easily check attributes of the domain registration, such as the age of the domain, registrant org, and registrar, to further refine our list. Reputation and summary data may also be helpful.

First, create a unique list of registered domains from our list of suspicious hosts.

In [None]:
suspect_domains = set([ host.registered_domain for host in suspect_hosts])
suspect_domains

Next, check the RiskIQ Illuminate reputation score for each domain:

In [None]:
for domain in suspect_domains:
    print(domain, analyzer.Hostname(domain).reputation)

Or, consider attributes of the domain registration (from the Whois record):

In [None]:
for domain in suspect_domains:
    try:
        age = analyzer.Hostname(domain).whois.age
        registrar = analyzer.Hostname(domain).whois.registrar
        org = analyzer.Hostname(domain).whois.organization
        print(f'{domain}: registered for {age} days at {registrar} by {org}')
    except analyzer.AnalyzerError:
        print(f'{domain}: [no whois data available]')

#### Enriched Dataframe with Pandas
Consider constructing a Pandas dataframe to make the data easier to consume in a notebook and obtain options to easily export the data to CSV.

In [132]:
hostpairs_df = foreign_hostpairs.filter_in(cause=redirect_causes).as_df

# Remove some extra columns
del(hostpairs_df['query'])
del(hostpairs_df['direction'])
del(hostpairs_df['child'])

# Create a parent_domain column with just the registered domain
hostpairs_df['parent_domain'] = hostpairs_df.apply(lambda r: str(analyzer.Hostname(r['parent']).registered_domain), axis=1)

# Create a reputation dataframe
reputation_df = pd.concat([analyzer.Hostname(h).reputation.as_df for h in suspect_domains])

# Join the reputation dataframe to the hostpairs dataframe and cleanup extra columns
hostpairs_df = hostpairs_df.merge(reputation_df, left_on='parent_domain', right_on='query')
del(hostpairs_df['query'])
del(hostpairs_df['rules'])
hostpairs_df.sort_values('score',ascending=False, inplace=True)
hostpairs_df

Unnamed: 0,firstseen,lastseen,parent,cause,parent_domain,score,classification
25,2021-09-16 18:18:56,2021-09-16 18:18:56,greatvaluebyte.online,topLevelRedirect,greatvaluebyte.online,74,SUSPICIOUS
55,2021-09-01 20:04:04,2021-09-01 20:04:22,yahooosearchsh.com,redirect,yahooosearchsh.com,72,SUSPICIOUS
32,2021-03-26 00:23:50,2021-09-13 21:46:07,bidenmoney.com,redirect,bidenmoney.com,72,SUSPICIOUS
21,2021-09-17 21:00:45,2021-09-17 21:01:24,enrelief-tax-review-returns.online,redirect,enrelief-tax-review-returns.online,70,SUSPICIOUS
26,2021-09-16 18:03:03,2021-09-16 18:09:09,traffic-visitor.eng-us-claim-finance.com,redirect,eng-us-claim-finance.com,69,SUSPICIOUS
...,...,...,...,...,...,...,...
18,2020-09-04 23:06:07,2021-09-18 14:35:22,www.ci.webster.ny.us,redirect,webster.ny.us,0,UNKNOWN
17,2021-09-17 22:17:12,2021-09-19 15:05:58,en-claims-funds-verify.com,redirect,en-claims-funds-verify.com,0,UNKNOWN
12,2021-08-02 11:59:19,2021-09-20 13:27:32,info.surepayroll.com,location.refresh,surepayroll.com,0,UNKNOWN
11,2021-09-20 14:15:45,2021-09-20 14:15:45,77funds-available.com,redirect,77funds-available.com,0,UNKNOWN


#### Without Pandas
If you prefer not to use `pandas` for this task, or you're planning to send the data to another system, you can use the `as_dict` property provided by all `analyzer` objects to achieve the same outcome.

In [None]:
analysis = []
for pair in foreign_hostpairs.filter_in(cause=redirect_causes):
    try:
        whois = { 
            'available': True,
            'org': pair.parent.whois.organization.value,
            'age': pair.parent.whois.age,
            'registrar': pair.parent.whois.registrar
        }
    except analyzer.AnalyzerError:
        whois = {'available':False}
    analysis.append({
        'hostpair': pair.as_dict,
        'whois': whois,
        'reputation': pair.parent.reputation.as_dict
    })
analysis

#### Reputation Details
The Reputation API also delivers a list of reasons why a domain was scored a certain way. Access the rules with the `rules` property of the `reputation` property on a hostname.

Here, we examine the reputation score associated with the first domain in the (now sorted) dataframe we built above.

In [None]:
suspicious_domain = hostpairs_df.iloc[0]['parent_domain']
print(suspicious_domain)
analyzer.Hostname(suspicious_domain).reputation.rules

#### Find other redirect targets

The `hostpair_children` property searches in the opposite direction - it finds hostpair relationships that are "downstream" or "outbound" from a focus host. Like `hostpair_parents` the exact meaning depends on the `cause` of the relationship. In the case of redirects, children hostpairs are sites the focus host is directing to.

Here, we consider one of the suspicious hosts we discovered in the list above.

In [133]:
analyzer.Hostname('refunds2[.]com').hostpair_children.as_df

Unnamed: 0,query,direction,firstseen,lastseen,child,parent,cause
0,refunds2.com,children,2021-09-20 06:33:29,2021-09-20 06:33:29,www.irs.gov,claim.redirect.refunds2.com,redirect
1,refunds2.com,children,2021-09-18 14:41:34,2021-09-19 07:54:25,fonts.googleapis.com,refunds2.com,link.href
2,refunds2.com,children,2021-09-18 16:22:53,2021-09-19 02:33:16,fonts.googleapis.com,claim.redirect2.refunds2.com,link.href
3,refunds2.com,children,2021-09-18 14:48:09,2021-09-19 02:18:30,fonts.googleapis.com,claim.redirect.refunds2.com,link.href
4,refunds2.com,children,2021-09-19 00:32:55,2021-09-19 01:18:27,www.irs.gov,claim.redirect2.refunds2.com,redirect
5,refunds2.com,children,2021-09-18 16:27:00,2021-09-18 16:27:00,fonts.googleapis.com,claim.redirect3.refunds2.com,link.href
6,refunds2.com,children,2021-09-18 16:25:55,2021-09-18 16:25:55,fonts.googleapis.com,claim.redirect4.refunds2.com,link.href
7,refunds2.com,children,2021-09-18 16:22:50,2021-09-18 16:22:50,unagi.amazon.de,claim.redirect2.refunds2.com,topLevelRedirect
8,refunds2.com,children,2021-09-18 15:13:54,2021-09-18 15:13:54,fonts.googleapis.com,claim.redirect1.refunds2.com,link.href
9,refunds2.com,children,2021-09-18 15:13:53,2021-09-18 15:13:53,fonts.googleapis.com,claim.redirect5.refunds2.com,link.href


> We can use the same techniques to filter specific causes or exclude other domains from this list that we used for the `hostpairs_parents` property above. This time, we're showing all the hostpairs available within the specified date range.

There are a few interesting observations from this list that may provide further avenues of investigation, including:
1. The site appears to have been active only briefly - there is just one day's difference between first and last seen dates. Legitmate sites should have more history, even within the 30 day window we've configured for this notebook.
2. By querying for the domain name, and not the specific hostname, we discovered other subdomains this malicious actor is using. Each could provide a set of IP addresses we could use for further research. 
3. We learn of a new redirect target against Amazon, and specifically the German-language version of Amazon. 

---
### Use Case: Find Site Copies

Many phishing and credential harvesting attacks use automated toolkits that copy the HTML code for a source site directly, with no human review and usually without downloading assets to the local webserver. When the copycat site is loaded, images and other resources will be fetched from the source site. These actions are detected by RiskIQ crawlers and indexed as a parent hostpair relationship between the original site and the copycat site.

Here, we find other sites using images loaded from our focus host `usps.com`. Consider expanding the list of `causes` to include other resource types such as `css.import` for CSS files or `script.src` for javascript.

In [134]:
whitelist_domains = ['usps.com','google-analytics.com','facebook.net','googletagmanager.com']
whitelist_tlds = ['goog','gov','ca']
causes = ['img.src']
filtered_hostpairs = (
    analyzer.Hostname('usps.com')
    .hostpair_parents
    .filter_in(cause=causes)
    .exclude_domains_in(whitelist_domains)
    .exclude_tlds_in(whitelist_tlds)
)
filtered_hostpairs.as_df

Unnamed: 0,query,direction,firstseen,lastseen,child,parent,cause
0,usps.com,parents,2021-09-20 16:16:42,2021-09-20 23:09:37,www.usps.com,cryptoptions.online,img.src
1,usps.com,parents,2021-09-20 05:10:48,2021-09-20 23:08:27,www.usps.com,blog.berabo.com,img.src
2,usps.com,parents,2021-09-20 09:23:37,2021-09-20 23:08:08,www.usps.com,trackusupssdelivery.co.vu,img.src
3,usps.com,parents,2021-08-04 02:04:56,2021-09-20 23:06:37,www.usps.com,wp.grandplaza.sa,img.src
4,usps.com,parents,2021-09-20 22:14:19,2021-09-20 23:00:36,www.usps.com,rentexchangeph.com,img.src
...,...,...,...,...,...,...,...
103,usps.com,parents,2020-10-14 06:24:09,2021-09-17 22:57:44,link.usps.com,ruspsaldahanada.tk,img.src
104,usps.com,parents,2021-08-07 20:27:16,2021-09-17 17:05:50,www.usps.com,www.ncpleg.gov.za,img.src
105,usps.com,parents,2021-09-17 15:37:36,2021-09-17 15:37:37,www.usps.com,mhnetsb.net,img.src
106,usps.com,parents,2021-09-17 13:17:54,2021-09-17 13:17:56,www.usps.com,www.futurelinx.com.au,img.src


It's important to note there are a number of completely legitimate reasons why an image on a site might be loaded from a different host. Further investigation is warrented to validate the findings.

Consider applying the techniques above to evaluate the reputation, registration, and other factors to prioritize further investigations.

In [135]:
analyzer.Hostname('cryptoptions[.]online').reputation.to_dataframe(explode_rules=True)

Unnamed: 0,query,score,classification,name,description,severity,link
0,cryptoptions.online,74,SUSPICIOUS,TLD,Domains in this TLD are more likely to be mali...,4,
1,cryptoptions.online,74,SUSPICIOUS,Name server,Domain is using a name server that is more lik...,3,
2,cryptoptions.online,74,SUSPICIOUS,ASN,Infrastructure hosted by this ASN are more lik...,3,


The `hostpair_children` property will reveal other "downstream" sites that a suspicious site is drawing resources from.

In [136]:
analyzer.Hostname('cryptoptions[.]online').hostpair_children.as_df

Unnamed: 0,query,direction,firstseen,lastseen,child,parent,cause
0,cryptoptions.online,children,2021-09-20 16:16:42,2021-09-20 23:09:37,www.usps.com,cryptoptions.online,img.src
1,cryptoptions.online,children,2021-09-20 16:16:13,2021-09-20 23:08:21,www.usps.com,cryptoptions.online,script.src
2,cryptoptions.online,children,2021-09-20 16:16:15,2021-09-20 23:08:19,www.usps.com,cryptoptions.online,link.href
3,cryptoptions.online,children,2021-09-20 16:16:13,2021-09-20 23:08:17,fast.fonts.net,cryptoptions.online,unknown
4,cryptoptions.online,children,2021-09-20 16:16:10,2021-09-20 23:08:15,tools.usps.com,cryptoptions.online,link.href
5,cryptoptions.online,children,2021-09-20 16:16:10,2021-09-20 23:08:11,www.googleoptimize.com,cryptoptions.online,script.src
6,cryptoptions.online,children,2021-09-20 16:16:10,2021-09-20 23:08:10,tools.usps.com,cryptoptions.online,script.src
7,cryptoptions.online,children,2021-09-20 15:24:39,2021-09-20 15:24:39,www.usps.com,cryptoptions.online,parentPage
