PAM is a near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences. PAM largely avoids returning redundant and spurious sequences, unlike API mining approaches based on frequent pattern mining.
This is an implementation of the API miner from our paper:
Parameter-Free Probabilistic API Mining across GitHub
J. Fowkes and C. Sutton. FSE 2016.
Installing in Eclipse
It's also possible to export a runnable jar from Eclipse using the File -> Export... menu option.
Compiling a Runnable Jar
To compile a standalone runnable jar, simply run
in the top-level directory (note that this requires maven). This will create the standalone runnable jar
api-mining-1.0.jar in the api-mining/target subdirectory. The main class is apimining.pam.main.PAM (see below).
PAM uses a probabilistic model to determine which API patterns are the most interesting in a given dataset.
Mining API Patterns
Main class apimining.pam.main.PAM mines API patterns from a specified API call sequence file. It has the following command line options:
- -f API call sequence file to mine (in ARFF format, see below)
- -o output file
- -i max. no. iterations
- -s max. no. structure steps
- -r max. runtime (min)
- -l log level (INFO/FINE/FINER/FINEST)
- -v log to console instead of log file
See the individual file javadocs in apimining.pam.main.PAM for information on the Java interface. In Eclipse you can set command line arguments for the PAM interface using the Run Configurations... menu option.
A complete example using the command line interface on a runnable jar. We can mine the provided dataset
netty.arff as follows:
$ java -jar api-mining/target/api-mining-1.0.jar -i 1000 -f datasets/calls/all/netty.arff -o patterns.txt -v
which will write the mined API patterns to
patterns.txt. Omitting the
-v flag will redirect logging to a log file in
PAM takes as input a list of API call sequences in ARFF file format
The ARFF format is very simple and best illustrated by example. The first few lines from
@relation netty @attribute fqCaller string @attribute fqCalls string @data 'com.torrent4j.net.peerwire.AbstractPeerWireMessage.write','io.netty.buffer.ChannelBuffer.writeByte' 'com.torrent4j.net.peerwire.messages.BitFieldMessage.writeImpl','io.netty.buffer.ChannelBuffer.writeByte' 'com.torrent4j.net.peerwire.messages.BitFieldMessage.readImpl','io.netty.buffer.ChannelBuffer.readable io.netty.buffer.ChannelBuffer.readByte' 'com.torrent4j.net.peerwire.messages.BlockMessage.writeImpl','io.netty.buffer.ChannelBuffer.writeInt io.netty.buffer.ChannelBuffer.writeInt io.netty.buffer.ChannelBuffer.writeBytes' 'com.torrent4j.net.peerwire.messages.BlockMessage.readImpl','io.netty.buffer.ChannelBuffer.readInt io.netty.buffer.ChannelBuffer.readInt io.netty.buffer.ChannelBuffer.readableBytes io.netty.buffer.ChannelBuffer.readBytes'
@relation declaration names the dataset and the following two
@attribute statements declare that the dataset consists of two comma separated attributes:
fqCallerthe fully-qualified name of the client method, enclosed in single quotes
fqCallsa space-separated list of fully-qualified names of API method calls, enclosed in single quotes.
The dataset is listed after the
@data relation: each line contains a specific method (
fqCaller) and its API call
fqCalls). Note that the
fqCaller attribute can be empty for PAM and UPMiner, it is only required for MAPO (see below).
Note that while this example uses Java, PAM is language agnostic and can use API call sequences from any language.
PAM outputs a list of the most interesting API call patterns (i.e. subsequences of the original API call sequences) ordered by their probability under the model.
For example, the first few lines in the output file
patterns.txt for the usage example above are:
prob: 0.04878 [io.netty.channel.Channel.write] prob: 0.04065 [io.netty.channel.ExceptionEvent.getCause, io.netty.channel.ExceptionEvent.getChannel] prob: 0.04065 [io.netty.channel.ChannelHandlerContext.getChannel] prob: 0.03252 [io.netty.channel.Channel.close]
See the accompanying paper for details.
Java API Call Extractor
The class apimining.java.APICallExtractor contains our 'best-effort' API call sequence extractor for Java source files. We used it to create the API call sequence datasets for our paper.
It takes folders of API client source files as input and generates API call sequences files (in ARFF format) for each API library given. For best performance, it requires a folder of namespaces used in the libraries so that it can resolve wildcarded namespaces. These can be collected using the provided Wildcard Namespace Collector class: apimining.java.WildcardNamespaceCollector.
See the individual class javadocs in apimining.java for details of their use.
MAPO and UPMiner
For comparison purposes, we implemented the API miners MAPO and UPMiner from stratch using the Weka hierarchical clusterer. These are provided in the apimining.mapo.MAPO and apimining.upminer.UPMiner classes respectively. They have the following command line options:
- -f API call sequence file to mine (in ARFF format, see above)
- -o output folder
- -s minimum support threshold
See the individual class files for information on the Java interface. Note that these are not particularly fast implementations as Weka's hierarchical clusterer is rather slow and inefficient. Moreover, as both API miners are based on frequent pattern mining algorithms, they can suffer from pattern explosion (this is a known problem with frequent pattern mining).
All datasets used in the paper are available in the
datasets/calls/allcontains API call sequences for each of the 17 Java libraries described in our paper (see Table 1)
datasets/calls/traincontains the subset of API call sequences used as the 'training set' in the paper
Both datasets use the ARFF file format described above. In addition, so that it is possible to replicate our evaluation, we have provided the Java source files for:
- each of the library client classes in
- the library example classes in
- the namespaces necessary for our API Call Extractor in
datasets/source/test_train_split subdirectory details the training/test set assignments for each client class.
Please report any bugs using GitHub's issue tracker.
This algorithm is released under the GNU GPLv3 license. Other licenses are available on request.