This project contains the source code of GitHub README content classifier from the paper "Categorizing the Content of GitHub README Files" (Gede Artha Azriadi Prana, Christoph Treude, Ferdian Thung, Thushari Atapattu, David Lo), published in 2018 in Empirical Software Engineering. DOI: 10.1007/s10664-018-9660-3
This project is written in Python 3.
- Set up file paths in
config/config.cfg
. By default, CSV files listing the section titles and their labels are ininput/
.dataset_1.csv
contains the section titles and labels for the development set, whereasdataset_2.csv
contains the section titles and labels for the evaluation set. The README files corresponding to the CSV files are ininput/ReadMes/
directory. - Empty all database tables by running the script
script/loading/empty_all_tables.py
- Run
script/loading/load_section_dataset_25pct.py
to extract and load section overview (title text, labels) and content of development set into database. - Run
script/loading/load_section_dataset_75pct.py
to extract and load section overview (title text, labels) and content of evaluation set into database. - Run the
script/experiment/*
scripts as required. E.g.script/experiment/classifier_75pct_tfidf.py
for the SVM version.
- Run
script/classifier/load_combined_set_and_train_model
to extract and load contents and titles listed in combined development and evaluation sets (by default, defined asdataset_combined.csv
inconfig/config.cfg
) into the database. - Run
script/classifier/load_and_classify_target
to extract and load contents of the README files in the directory specified intarget_readme_file_dir
variable inconfig/config.cfg
. - By default, the resulting section labels will be saved in
output/output_section_codes.csv
. Classifier will also identify which codes exist for each file, and which codes don't yet exist (i.e. potential for README expansion). This information will be saved inoutput/output_file_codes.csv
- Run
script/loading/load_section_dataset_combined.py
to extract and load section overview (title text, labels) and content of combined development and evaluation sets (by default, defined asdataset_combined.csv
inconfig/config.cfg
) into the database. - Place the README files whose sections are to be classified in the directory specified in
target_readme_file_dir
variable inconfig/config.cfg
. - Run
script/loading/load_target_section_data.py
to load the section heading and content data into database. - Run
script/classifier/classifier_train_model.py
. This script will train SVM model using combined dataset in*combined
tables. The resulting model, TFIDF vectorizer, and matrix label binarizer will be saved inmodel/
directory. - Run
script/classifier/classifier_classify_target.py
. This script will use the saved model, vectorizer, and binarizer to classify target README files in the directory specified intarget_readme_file_dir
variable inconfig/config.cfg
. - By default, the resulting section labels will be saved in
output/output_section_codes.csv
. Classifier will also identify which codes exist for each file, and which codes don't yet exist (i.e. potential for README expansion). This information will be saved inoutput/output_file_codes.csv
All scripts will log output (such as F1 score, execution times) into log/
directory. Preprocessed README files (with numbers, mailto:
links etc. abstracted out) are saved in temp/
directory. Patterns used for heuristics are listed in doc/Patterns.ods
.