`astminer`

A library for mining of path-based representations of code and more, supported by the Machine Learning Methods for Software Engineering group at JetBrains Research.

Supported languages of the input:

Java
Python
C/C++
Javascript

Version history

See changelog

About

astminer was first implemented as a part of pipeline in the code style extraction project and later converted into a reusable tool.

Currently, it supports extraction of:

Path-based representations of files
Path-based representations of methods
Raw ASTs

Supported languages are Java, Python, C/C++, but it is designed to be very easily extensible.

For the output format, see the section below.

Usage

There are two ways to use astminer.

As a standalone CLI tool with pre-implemented logic for common processing and mining tasks
Integrated into your Kotlin/Java mining pipelines as a Gradle dependency.

Using `astminer` CLI

Building or installing `astminer` CLI

astminer CLI can be either built from sources or installed in a pre-built Docker image.

Building locally

./cli.sh will do the job for you by triggering a Gradle build on the first run.

Installing the Docker image

The C++ parser in astminer relies on g++. To avoid misconfiguration with this and likely other future external dependencies, you can use it from a Docker container.

Install the image with the last release by pulling it from Docker Hub:

docker pull voudy/astminer

To rebuild the image locally, run

docker build -t voudy/astminer .

Running `astminer` CLI

Run

./cli.sh optionName parameters

Where optionName is one of the following options:

Preprocess

Run preprocessing on C/C++ project to unfold #define directives. In other tasks, if you feed C/C++ file with macroses, they will be dropped as well as their appearances in code.

./cli.sh preprocess --project path/to/project --output path/to/preprocessedProject

Parse

Extract ASTs from all the files in supported languages.

./cli.sh parse --lang py,java,c,cpp,js --project path/to/project --output path/to/result --storage dot

PathContexts

Extract path contexts from all the files in supported languages and store in form fileName triplesOfPathContexts.

./cli.sh pathContexts --lang py,java,c,cpp,js --project path/to/project --output path/to/results --maxL L --maxW W --maxContexts C --maxTokens T --maxPaths P

Code2vec

Extract data suitable as input for code2vec model. Parse all files written in specified language into ASTs, split into methods, and store in form method|name triplesOfPathContexts.

./cli.sh code2vec --lang py,java,c,cpp,js --project path/to/project --output path/to/results --maxL L --maxW W --maxContexts C --maxTokens T --maxPaths P  --split-tokens --granularity method

Using `astminer` as a dependency

Import

astminer is available in the JetBrains Space package repository. You can add the dependency in your build.gradle file:

repositories {
    maven {
        url "https://packages.jetbrains.team/maven/p/astminer/astminer"
    }
}

dependencies {
    compile 'io.github.vovak:astminer:<VERSION>'
}

If you use build.gradle.kts:

repositories {
    maven(url = uri("https://packages.jetbrains.team/maven/p/astminer/astminer"))
}

dependencies {
    compile("io.github.vovak", "astminer", <VERSION>)
}

Local development

To use a specific version of the library, navigate to the required branch and build local version of astminer:

./gradlew publishToMavenLocal

After that, add mavenLocal() into the repositories section in your gradle configuration.

Examples

If you want to use astminer as a library in your Java/Kotlin based data mining tool, check the following examples:

A few simple usage examples can be run with ./gradlew run.
A somewhat more verbose example of usage in Java is available as well.

Please consider trying Kotlin for your data mining pipelines: from our experience, it is much better suited for data collection and transformation instruments.

Output format

For path-based representations, astminer supports two output formats. In both of them, we store 4 .csv files:

node_types.csv contains numeric ids and corresponding node types with directions (up/down, as described in paper);
tokens.csv contains numeric ids and corresponding tokens;
paths.csv contains numeric ids and AST paths in form of space-separated sequences of node type ids;
path_contexts.csv contains labels and sequences of path contexts (triples of two tokens and a path between them).

If the replica of code2vec format is used, each line in path_contexts.csv starts with a label, then it contains a sequence of space-separated triples. Each triple contains start token id, path id, end token id, separated with commas.

If csv format is used, each line in path_contexts.csv contains label, then comma, then a sequence of ;-separated triples. Each triple contains start token id, path id, end token id, separated with spaces.

Other languages

Support for a new programming language can be implemented in a few simple steps.

If there is an ANTLR grammar for the language:

Add the corresponding ANTLR4 grammar file to the antlr directory;
Run the generateGrammarSource Gradle task to generate the parser;
Implement a small wrapper around the generated parser. See JavaParser or PythonParser for an example of a wrapper.

If the language has a parsing tool that is available as Java library:

Add the library as a dependency in build.gradle.kts;
Implement a wrapper for the parsing tool. See FuzzyCppParser for an example of a wrapper.

Contribution

We believe that astminer could find use beyond our own mining tasks.

Please help make astminer easier to use by sharing your use cases. Pull requests are welcome as well. Support for other languages and documentation are the key areas of improvement.

Citing astminer

A paper dedicated to astminer (more precisely, to its older version PathMiner) was presented at MSR'19. If you use astminer in your academic work, please cite it.

@inproceedings{kovalenko2019pathminer,
  title={PathMiner: a library for mining of path-based representations of code},
  author={Kovalenko, Vladimir and Bogomolov, Egor and Bryksin, Timofey and Bacchelli, Alberto},
  booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
  pages={13--17},
  year={2019},
  organization={IEEE Press}
}

Name		Name	Last commit message	Last commit date
Latest commit History 853 Commits
.github/workflows		.github/workflows
gradle/wrapper		gradle/wrapper
scripts/fuzzy		scripts/fuzzy
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.space.kts		.space.kts
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
changelog.md		changelog.md
cli.md		cli.md
cli.sh		cli.sh
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat

License

kisate/astminer

Folders and files

Latest commit

History

Repository files navigation

astminer

Version history

About

Usage

Using astminer CLI

Building or installing astminer CLI

Building locally

Installing the Docker image

Running astminer CLI

Preprocess

Parse

PathContexts

Code2vec

Using astminer as a dependency

Import

Local development

Examples

Output format

Other languages

Contribution

Citing astminer

About

Resources

License

Stars

Watchers

Forks

Languages

`astminer`

Using `astminer` CLI

Building or installing `astminer` CLI

Running `astminer` CLI

Using `astminer` as a dependency