In [19]:
# Import Jabbic class
from jabbic.main import Jabbic

In [24]:
# Define list of query observations for which matches are to be found in a different dataset.
queries = ['c57c5709cbbe22e107352bddf79f279985f2bb9588ad75ffe0d56894a3edf654',
          'c6a01308f27e003c4a1d723f41cd05dd57004f4bc35c40a768ed7e2417c63e16',
          'c72f3e59b06d46ca364aee721d522f3e765ed4940a6654627e72be232c4f7510']

# Create a Jabbic class object.
jabbic = Jabbic(b_fn='20151031', # filename of csv file where query observations are
                t_fn='20151001', # filename of csv file where the match for each query observation is to be looked for
                f_dir='data_ip_split', # name of the directory where these two files are
                m_dir='models_october_ip_split', # name of the directory where Word2Vec trained models
                kvi=3, # positional index of queries/matches in each row
                anchors=[0, 1, 2, 4, 5], # positional indices of observations to be considered when searching for matches
                sw=0.5, # semantic weight
                queries=queries) # queries (e.g. observations in 20151031.csv file for which we want to find a match in 20151001.csv file)

As seen in both dataframes below, the file_sha column contains the query observations which are all at index 3 in each row.
This means that, for a given file sha in bade data dataframe the match will also be a file sha but in target data.

The anchor points, as defined by indices [0, 1, 2, 4, 5], are observations in all other columns.

In [62]:
jabbic.bd.head(5)

Unnamed: 0,as_name,ip_network,ip_host,file_sha,netloc,path
0,"amazon-aes - amazon.com, inc.",50,19.109.131,c53d7ae782c0ef13e0215cfe995b973a421a5adf4895af...,get.desk2opapps.com,/downloadmanager/getmar
1,yunet-as,109,121.100.32,c540fa5d902cec1ea1da7ef8b463b6bd2c01e775c4c994...,109.121.100.32,/mlccbh/deploy/mlccbh/application/program
2,"chinanet-backbone no.31,jin-rong street",115,231.171.46,c5503ff7bfd2d56aee8e280e897c14e0346967f24a5a3c...,ftp-idc.pconline.com.cn,/356e3ba60a561b80067d3d63e7a0dad2/pub/download...
3,ovh,5,39.99.49,c5620bf52b79d4c4d29f555c21004735af3f0a2d359fcd...,zilliontoolkitusa.info,/download/v356
4,internode-as internode pty ltd,203.0.178,91,c563e474950ffcaca412c31f43ef6e47d78a945113be26...,www.sportspage.com.au,/downloads


In [63]:
# Preview of the target data
jabbic.td.head(5)

Unnamed: 0,as_name,ip_network,ip_host,file_sha,netloc,path
0,"highwinds3 - highwinds network group, inc.",69.0,16.175.10,eb2f9b4ca6a01f6eaae5d9d54a846d2e1066b9ef00b19e...,dl.randkeygen.com,/25/all/hd/in
1,"amazon-aes - amazon.com, inc.",23.0,23.167.169,eb3dba7b53b59bedc4fd2571933594463d250f92e29b8a...,google-chrome.todownload.com,/get/file/id/853326
2,"amazon-aes - amazon.com, inc.",184.73,238.150,eb3dba7b53b59bedc4fd2571933594463d250f92e29b8a...,google-chrome.todownload.com,/get/file/id/853326
3,"amazon-02 - amazon.com, inc.",54.0,213.72.9,eb3e870755cdfaf7b1c84c56a1a3dfaaac2059a849d881...,best-gets.info,/hp
4,"amazon-aes - amazon.com, inc.",54.0,197.245.47,eb439456afd48214d2f9b6bcebbb8f09815c43457ca837...,get.0142g.info,/1443678706/1443678706/1443678706


We are now ready to find the file hashes in target data (20151001.csv) that best represent the query file hashes in base data (20151031.csv) which were defined in the queries variable.

In [47]:
# There are only 3 query file hashes that we are interested in this example.
# This is a small number of query observations and so we look for their matches at once.
# If, for example, we had 300,000 query onbservations, we might have wanted to split them into 20 batches (or more, depending
# on how much RAM memory is available), and then find the matches for each batch of query observations separately.
jabbic.find_matches(n_batches=1)

HBox(children=(IntProgress(value=0, description='Batches processed: ', max=1, style=ProgressStyle(description_â€¦




We can now look at the matches returned by Jabbic for each of the query file hashes, but first let us visualise the dataframe containing the query file hashes and their anchors.

In [49]:
# Queries dataframe
jabbic.q_df

Unnamed: 0,as_name,ip_network,ip_host,file_sha,netloc,path
7,"amazon-02 - amazon.com, inc.",54,148.248.147,c57c5709cbbe22e107352bddf79f279985f2bb9588ad75...,admin.magnodnw.com,/vxlqzveknnmrocggumlvgig-nhtf_h07zrqqxyyy50z7i...
28,ovh,37,59.30.196,c6a01308f27e003c4a1d723f41cd05dd57004f4bc35c40...,37.59.30.196,/download/dlshr
36,"amazon-02 - amazon.com, inc.",54,149.60.150,c72f3e59b06d46ca364aee721d522f3e765ed4940a6654...,www.metaappdl.com,/c


In [50]:
# Look at local matches.
# Each sublist contains:
    # query observation
    # match observation
    # row index of query in base dataframe
    # row index of match in target dataframe
    # anchor points of query observation
    # anchor points of match observation
    # Ratcliff/Obsershelp similarity
jabbic.lm

[['c57c5709cbbe22e107352bddf79f279985f2bb9588ad75ffe0d56894a3edf654',
  '873c02f2750634d6887f401db7ac1eac65242eb8ecff4d74b2c62b2e2725246f',
  0,
  301420,
  'amazon-02 - amazon.com, inc., 54, 148.248.147, admin.magnodnw.com, /vxlqzveknnmrocggumlvgig-nhtf_h07zrqqxyyy50z7i2pcqniypphozqrofnnwrpkg_dnsm35dwodnumgjxax3d4xiulyqzq8rj1f2vp17iwoj8aiugunuyku4o369gwu33oweloy2nfpzn_e36ukg-m2utqcs42-makh3adekctijahsifm_sywwzxkujhlr7de5dzimogokxmkjfgc0g29_cgz8yk2xi0pxhirzlv5hfqpj8nrbcebwxrpkpreg87vlzpt77xhfjktp6ovjc4iajcefqmtega_19jntxxdpmitdz_wyvrcacuq773y4hv9xcsyceq6bnqc9fxydyxhozvah8axpt5ft8rqvitvk1mcwbq8qiokukt-o065cwbovwmz39yfqeurvaopxvtdnfrb-7858g7f_lm54bpclosm-9zu-s7ajnyg8z',
  'amazon-02 - amazon.com, inc., 54, 231.162.96, s3-us-west-2.amazonaws.com, /cyngn-oneclick/builds/2.0.3.0/lib/net-4.0',
  0.54],
 ['c6a01308f27e003c4a1d723f41cd05dd57004f4bc35c40a768ed7e2417c63e16',
  '64161437c15d48cb2c18d82b6267a5e3b4d5cd717b57ea7f110126a0784b3cc3',
  1,
  570035,
  'ovh, 37, 59.30.196, 37.59.30.196, 

Now let us see how to interpret the findings. We take as an example the the last query file hash and its match file hash.

As seen below,

Query file hash is at index 2 in the queries dataframe
Target file hash is at index 2 in the 20151031.csv file

However, the row data for each query and match observation is already stored in jabbic.lm, but as a proof of concept we show
how to access it separately if needed.

In [54]:
jabbic.lm[2]

['c72f3e59b06d46ca364aee721d522f3e765ed4940a6654627e72be232c4f7510',
 '670c44859ab4bbaccbc0dca6d8a6ccedcb1ee5a4feb2bb0ca1d79f1aa4b7f619',
 2,
 643392,
 'amazon-02 - amazon.com, inc., 54, 149.60.150, www.metaappdl.com, /c',
 'amazon-02 - amazon.com, inc., 54, 149.60.150, www.jdtlrtaenraogggdsdraccapitaltour.com, /c',
 0.89]

In [60]:
# Row data of the query observation is at index 2 in the q_df dataframe.
jabbic.q_df.iloc[[2]]

Unnamed: 0,as_name,ip_network,ip_host,file_sha,netloc,path
36,"amazon-02 - amazon.com, inc.",54,149.60.150,c72f3e59b06d46ca364aee721d522f3e765ed4940a6654...,www.metaappdl.com,/c


In [61]:
# Row data of the target file hash is at index 643392 in the 20151031.csv file
jabbic.td.iloc[[643392]]

Unnamed: 0,as_name,ip_network,ip_host,file_sha,netloc,path
643392,"amazon-02 - amazon.com, inc.",54,149.60.150,670c44859ab4bbaccbc0dca6d8a6ccedcb1ee5a4feb2bb...,www.jdtlrtaenraogggdsdraccapitaltour.com,/c


#### Possible interpretation of results in malware detection

We see that the Ratcliff/Obsershelp (R/O) similarity between the anchors of the query onbservations and the anchors of the match observations is 0.89, which means that the the returned match is a good representative, both semantically and relationally, for the query observation. The R/O similarity is used as confidence factor to determine how alike are the anchors (contexts) of the query observation and its match observation. The R/O similarity takes values between 0 and 1, where 1 means that the query and match observations are representative of one another and 0 viceversa.

In the example above we can see that both the query file hash c72f3... and match 670c4... are representative of one another because they have the same as_name, ip_network, ip_host, and path. This information is useful for a number of reasons:
    
    - assume the file hash c72f3..., downloaded on 31 October 2015, is a new file which has never been downloaded before;
    - the aim is to determine whether this file is likely to be malicious or not, and maybe even infer its malware family to learn more about its behaviour;
    - also assume that file hash 670c4..., downloaded on 1 October 2015, is already known to be malicious belonging to malware family 'amonetize';
    - because Jabbic returned file hash 670c4... as most related to c72f3... out of all files downloaded on 1 October 2015, and because the R/O similarity was very high, we can have a high level of confidence that c72f3... is also malicious and even belong to the same malware family as the match file hash 670c4...