Python script to mine for GitHub issues + comments and classify them using analysis and detection of information types of open source software issue discussions. Our tool automates the process of querying GitHub for particular issues and feeding them into classification models. It supports both interactive and non-interactive modes.
- Ensure Python
3.9.1
and corrosponding pip3.9
are installed - Install requirements:
pip3.9 install -r requirements.txt
- Download nltk stop word packages (only need to do it once per environment)
- Download
nltk
stop words packageswordnet
andstopwords
via the python terminal
import nltk nltk.download('wordnet') nltk.download('stopwords')
- If you run intl SSL error, make sure python ssl certificates are installed:
bash /Applications/Python\ 3.9/Install\ Certificates.command
- Download
- Add a GitHub Personal Access Token in the access token file:
access_token.json
. Replace<YOUR_GITHUB_PERSONAL_ACCESS_TOKEN>
with your token.
Instructions on how to create a GitHub Personal Access Token.
There are two ways to run this program, either through the interactive command line, or by directly passing in command line argument via argparser
:
Below is the -h
help/man page:
Usage: python mine-issues.py [-h] [-i] [-v] [-m MAX_RESULTS] [-s SORT_BY] [-p PREFIX_FILENAME] query
positional arguments:
query
optional arguments:
-h, --help show this help message and exit
-i, --interactive **toggle the interactive CLI**
-v, --verbose print additional logs
-m MAX_RESULTS, --max-results MAX_RESULTS
(int) max results to query
-s SORT_BY, --sort-by SORT_BY
sort by one of: [comments, best-match]
-p PREFIX_FILENAME, --prefix-filename PREFIX_FILENAME
(string) file name prefix for result output files
Run python3.9 mine-issues.py <QUERY>
with the following optional parameters:
-i
or --interactive
: will cause the script to trigger the interactive CLI and ignore all other params. See screenshot above of the interface.
-v
or --verbose
: will print out extra logging such as printing out the entire result object in neat JSON format.
-m
or --max-results
: filter the number of results we want to retrieve from the search query (1000 by default).
-s
or --sort-by
: Pick either comments
or best-match
to sort the search query result by (comments
by default).
-p
or --prefix-filename
: String to prefix to the resulting file name (results_
by default).
-f
or --filter
: String of one of the 16 categories to filter out from results. Can provide multiple args (i.e -f foo -f bar
...)
Categories that can be filtered:
['Expected Behaviour', 'Motivation', 'Observed Bug Behaviour', 'Bug Reproduction', 'Investigation and Exploration', 'Solution Discussion', 'Contribution and Commitment', 'Task Progression', 'Testing', 'Future Plan', 'New Issues and Requests', 'Solution Usage', 'WorkArounds', 'Issue Content Management', 'Action on Issue', 'Social Conversation']
Running tests to ensure that the script is functioning properly. Travis CI build also runs this as part of build status checks.
- Ensure that
pytest
is properly set up for your python3.9
env. - Simply run
pytest
to check that all tests are passing.
/models
- Contains the serialized files of the classification model.
/utils
- Contains utility function files, such as IO, filtering results and processing comments.
/test
- Contains test file being ran by pytest
.
/result
- Folder to output result files to. Contains a .gitignore
to ignore all files in this folder to prevent results from being committed.
/config
- Contains configuration files for the app to run, such as the personal access token.
Please cite our tool using the bibliographic information on Zenodo.