# Simple String Matching
This is an example of some simple DataFrame filtering techniques using string matching in Pandas.

In [34]:
import pandas as pd

Here is some passive DNS and IP resolution data for a few domains from the [Cyber Threat Coalition COVID19 Blacklist](https://blacklist.cyberthreatcoalition.org/).

In [13]:
df

Unnamed: 0,query,resolution,asn,isp_org
0,abncoronaveiligomgeving.live,104.219.248.111,22612,Namecheap
1,acccorona.com,107.180.46.153,26496,"GoDaddy.com, LLC"
2,19covid-gouv12.com,111.90.156.123,201133,Verdina Ltd.
3,acccorona.com,162.213.251.171,22612,Namecheap
4,abncoronaveiligomgeving.live,162.255.119.7,22612,Namecheap
5,academydea.com,165.227.16.98,14061,Digital Ocean
6,07q6zf664zv.mceco.co,172.217.12.179,15169,Google
7,464004iiq76.waxcapital.net,172.217.13.83,15169,Google
8,07q6zf664zv.mceco.co,172.217.16.147,15169,Google
9,464004iiq76.waxcapital.net,172.217.164.115,15169,Google


Usually it's convenient to access DataFrame columns by their attribute accessors (i.e. dot notation). In this case, however, the column name conflicts with the name of a DataFrame method, so it doesn't work:

In [32]:
df.query.head(5)

AttributeError: 'function' object has no attribute 'head'

The alternative is to reference the column by its explicit name like this:

In [31]:
df['query'].head(5)

0    abncoronaveiligomgeving.live
1                   acccorona.com
2              19covid-gouv12.com
3                   acccorona.com
4    abncoronaveiligomgeving.live
Name: query, dtype: object

If we want to use built-in Python string functions on a column, call `.str` after the column name:

In [15]:
df[df['query'].str.endswith('.com')]

Unnamed: 0,query,resolution,asn,isp_org
1,acccorona.com,107.180.46.153,26496,"GoDaddy.com, LLC"
2,19covid-gouv12.com,111.90.156.123,201133,Verdina Ltd.
3,acccorona.com,162.213.251.171,22612,Namecheap
5,academydea.com,165.227.16.98,14061,Digital Ocean
13,abccoronavirus.com,181.214.86.147,52284,CH GROUP CORP
14,15mincovid19test.com,184.168.131.241,26496,"GoDaddy.com, LLC"
15,15mintestcoronavirus.com,184.168.131.241,26496,"GoDaddy.com, LLC"
16,360corona.com,184.168.221.33,26496,"GoDaddy.com, LLC"
17,360corona.com,184.168.221.35,26496,"Mega International Investment Trust Co., Ltd."
18,1stcoronatest.com,184.168.221.37,26496,"GoDaddy.com, LLC"


In [16]:
df[df['query'].str.contains('corona')]

Unnamed: 0,query,resolution,asn,isp_org
0,abncoronaveiligomgeving.live,104.219.248.111,22612,Namecheap
1,acccorona.com,107.180.46.153,26496,"GoDaddy.com, LLC"
3,acccorona.com,162.213.251.171,22612,Namecheap
4,abncoronaveiligomgeving.live,162.255.119.7,22612,Namecheap
13,abccoronavirus.com,181.214.86.147,52284,CH GROUP CORP
15,15mintestcoronavirus.com,184.168.131.241,26496,"GoDaddy.com, LLC"
16,360corona.com,184.168.221.33,26496,"GoDaddy.com, LLC"
17,360corona.com,184.168.221.35,26496,"Mega International Investment Trust Co., Ltd."
18,1stcoronatest.com,184.168.221.37,26496,"GoDaddy.com, LLC"
19,360corona.com,184.168.221.39,26496,"GoDaddy.com, LLC"


You can do the same thing if you want to use regex matches against the column:

In [18]:
df[df['query'].str.match('^360.+com$')]

Unnamed: 0,query,resolution,asn,isp_org
16,360corona.com,184.168.221.33,26496,"GoDaddy.com, LLC"
17,360corona.com,184.168.221.35,26496,"Mega International Investment Trust Co., Ltd."
19,360corona.com,184.168.221.39,26496,"GoDaddy.com, LLC"
23,360corona.com,184.168.221.43,26496,"GoDaddy.com, LLC"


Here's an example of matching on a numeric column:

In [19]:
df[df['asn']==26496]

Unnamed: 0,query,resolution,asn,isp_org
1,acccorona.com,107.180.46.153,26496,"GoDaddy.com, LLC"
14,15mincovid19test.com,184.168.131.241,26496,"GoDaddy.com, LLC"
15,15mintestcoronavirus.com,184.168.131.241,26496,"GoDaddy.com, LLC"
16,360corona.com,184.168.221.33,26496,"GoDaddy.com, LLC"
17,360corona.com,184.168.221.35,26496,"Mega International Investment Trust Co., Ltd."
18,1stcoronatest.com,184.168.221.37,26496,"GoDaddy.com, LLC"
19,360corona.com,184.168.221.39,26496,"GoDaddy.com, LLC"
20,2020-covid-19.com,184.168.221.40,26496,"GoDaddy.com, LLC"
21,2020-covid-19.com,184.168.221.42,26496,"GoDaddy.com, LLC"
22,1stcoronatest.com,184.168.221.42,26496,"GoDaddy.com, LLC"


Here's an example of matching the column against the contents of a list:

In [20]:
df[df['asn'].isin([14061,52284])]

Unnamed: 0,query,resolution,asn,isp_org
5,academydea.com,165.227.16.98,14061,Digital Ocean
13,abccoronavirus.com,181.214.86.147,52284,CH GROUP CORP


If we want to do a NOT match, use the ~ operator:

In [22]:
df[~df['asn'].isin([26496,22612])]

Unnamed: 0,query,resolution,asn,isp_org
2,19covid-gouv12.com,111.90.156.123,201133,Verdina Ltd.
5,academydea.com,165.227.16.98,14061,Digital Ocean
6,07q6zf664zv.mceco.co,172.217.12.179,15169,Google
7,464004iiq76.waxcapital.net,172.217.13.83,15169,Google
8,07q6zf664zv.mceco.co,172.217.16.147,15169,Google
9,464004iiq76.waxcapital.net,172.217.164.115,15169,Google
10,07q6zf664zv.mceco.co,172.217.22.19,15169,Google
11,464004iiq76.waxcapital.net,172.217.3.115,15169,Google
12,07q6zf664zv.mceco.co,173.194.73.121,15169,Google
13,abccoronavirus.com,181.214.86.147,52284,CH GROUP CORP


In [27]:
df[~df['asn'].isin([26496,22612])].isp_org.value_counts()

Google           7
CH GROUP CORP    1
Verdina Ltd.     1
Digital Ocean    1
Name: isp_org, dtype: int64

In [28]:
df[~df['asn'].isin([26496,22612])].isp_org.unique()

array(['Verdina Ltd.', 'Digital Ocean', 'Google', 'CH GROUP CORP'],
      dtype=object)