This tool is part of the master thesis "Clustering Software Projects at Large-scale using Time-Series". COYOTE is a clustering tool, that works on time-series created from git history logs.
The following instructions were performed on Linux Mint 19.2 Cinnamon. They are expected to work with Ubuntu as well.
sudo apt-get install python2.7 python-pip python-dev build-essential python-tk python-setuptools
pip install pandas
sudo pip install --upgrade setuptools
sudo pip install matplotlib
sudo pip install sklearn
sudo pip install docopt
COYOTE requires some files to work. These files contain data collected during the thesis and are not published in this git repository. You may contact the authors to receive these files.
Make sure to copy the feature tables to your directory.
- "featuretable/features_18M.csv" (large dataset)
- "featuretable/features_org.csv" (organisation dataset)
- "featuretable/features_util.csv" (utility dataset)
- "featuretable/features_neg.csv" (negative instances dataset)
- "featuretable/features_val_p.csv" (well-engineered project of the validation dataset)
- "featuretable/features_val_np.csv" (not well-engineered project of the validation dataset)
You may changes the paths to these files in dataset_utils.py.
Open a terminal in the root directory of COYOTE. Executing
python coyote.py
will show you your options.
Extract a featuretable from a timeseries. Executing the command
python coyote.py ./timeseries.csv ./features.csv
will read timeseries (as created by https://github.com/Ionman64/PHANTOM) and transform them to features.
Clusters projects (represented by feature vectors) into two classes: P (Project) and NP (Non-Project). Executing
python coyote.py cluster --config=./cfg.json --accuracy_file=./acc.csv --prediction_file=./pred.csv
will
-
train COYOTE using the organisation, utility and negative instances feature tables,
-
validate COYOTE against the validation feature tables
results are saved to the accuracy file and to the "ll_data" directory
-
predict the labels of the large dataset
results are saved to the prediction_file.
You may omit the --prediction_file flag to only perform step 1 and 2.
NOTE: Predicting the large dataset required more than 8GB of RAM (i.e. my Laptop has 8GB and will exit with a memory error).
COYOTE analyses the features from the feature tables before using them for training. The configuration file states correlation thresholds for the measures on the training datasets. Any feature violating the correlation threshold will not be used for training.
Below you will find an example configuration. This configuration was created using COYOTE explore.
{"util": {"merges": 0.85, "commits": 0.95, "commiters": 0.75, "integrations": 0.95, "integrators": 0.45}, "org": {"merges": 0.9, "commits": 0.9, "commiters": 0.75, "integrations": 0.9, "integrators": 0.75}}
You may save it and use it to cluster
- use https://github.com/Ionman64/PHANTOM to create timeseries of the dataset.
- extract the features from the timeseries using COYOTE extract.
COYOTE is hard-coded against the large dataset. However, you may want to tweak the program to predict another dataset.
- update dataset_utils.py
Find the following section in "dataset_utils.py"
def load_18M():
features, _ = __load_feature_table_and_labels("featuretable/features_18M.csv", label="-")
return features
and update the path to the feature table of your dataset, for example like this
def load_18M():
features, _ = __load_feature_table_and_labels("featuretable/my_dataset.csv", label="-")
return features
Explores a range of correlation thresholds to find the best threshold configuration. Executing
python coyote.py --explore./exp.csv --config=./cfg.json
will
- train COYOTE several times with different thresholds
- store the accuracy (precision, recall, f-measure) of all trained classifiers in exp.csv
- determine the best configuration an save it in cfg.json
-
Clone COYOTE
git clone https://github.com/joshuaju/COYOTE.git
-
Go into directory
cd COYOTE
and execute:
mkdir featuretable mkdir ll_data
-
Place the required files in "featuretable" directory (see "Setup" above)
-
Extract features from timeseries
python coyote extract ./timeseries/example_ts.csv ./featuretable/example_ft.csv
-
Explore measures to find the best configuration
python coyote.py explore --explore=./exp.csv --config=./cfg.json
-
Analyse exp.csv to see all results and take a look at the configuration, which will show you your best classifieres
-
Cluster the large dataset with your configuration. You may change "dataset_utils.py" to point to your own dataset. See "How to predict other datasets?" above.
python coyote.py cluster --config=cfg.json --accuracy_file=./acc.csv --prediction_file=./pred.csv
Note: If you get a MemoryError you may need more RAM. For example, the large dataset cannot be clustered with 8GB of RAM.
-
Analyse acc.csv to see the accuracy of the trained classifier
-
Analyse pred.csv to see how COYOTE classified the projects