- Data Science Project for Galvanize, Seattle, August 2016.
- Lucie Klimosova (LinkedIn).
- See the web application live on bugtriaging.com.
Large software projects can produce hundreds of bug reports every day. Many of these reports contain incomplete, incorrect, or redundant information.
Bug reporters and developers would benefit from a "smarter" bug reporting system. This projects shows that severity and priority predictions can be assigned to tickets automatically. It also attempts to prevent duplicate creation by suggesting similar bug reports.
The Eclipse and Mozilla defect tracking dataset... 
- Mozilla subset (mostly Firefox, Thunderbird and Mozilla projects).
- Almost 400,000 bug reports resolved between 1998 and 2013.
- Additional information obtained from Bugzilla API
The bug report is provided as a series of incremental updates appearing during its lifetime. Thus, the report can be analyzed both at the time of creation and at the time of closure.
There are many areas for improvement.
- 72% of severity normal (default), 8% changes over the lifetime.
- 98% of priority empty (default), only 9% filled in over the lifetime.
- Similar change rate applies to product, component and version.
- The bug is assigned to many developers during the lifetime.
- 26% bug reports end up being closed as duplicates.
Modeling and Results
Severity and Priority
Default values (normal severity, empty priority) were treated as unlabeled.
Three different models were trained:
- Multinomial Naive Bayes on TfIdf of short descritption (250 characters).
- Multinomial Naive Bayes on TfIdf of long description.
- Gradient Boosting on other features, including
- "Skill" of the reporter - how many bug reports she has created before.
- How many people are on the cc list.
- Is the bug report assigned immediately.
Linear stacking was used to combine the probability predictions of the three models:
- Gradient Boosting on predicted probabilities for each class.
Better than or comparable with previously published researches. 
blocker | 0.74 | 0.29 | 0.42 | 957
critical | 0.80 | 0.78 | 0.79 | 8816 major | 0.61 | 0.77 | 0.68 | 8770 minor | 0.59 | 0.47 | 0.52 | 4437 trivial | 0.68 | 0.43 | 0.53 | 1827 avg / total | 0.68 | 0.68 | 0.67 | 24807
p1 | 0.61 | 0.74 | 0.67 | 2449 p2 | 0.52 | 0.63 | 0.57 | 2378 p3 | 0.57 | 0.41 | 0.48 | 1497 p4 | 0.70 | 0.18 | 0.28 | 574 p5 | 0.70 | 0.25 | 0.37 | 250
avg / total | 0.58 | 0.57 | 0.55 | 7148
This problem exists for pairs of observations rather than for a single observation.
- Typically highly imbalanced - each bug report has one or two duplicates and there are thousands or more non-duplicates for every set of duplicates.
- The goal is to return the actual duplicate as high in the search as possible.
Assumptions that help reduce the number of potential candidates searched:
- The original should exist and should not be resolved at the time of duplicate creation.
- Both the duplicate and the original are reported for the same product.
- Cosine similarity of TfIdf of short descritption (250 characters).
- Cosine similarity of TfIdf of long descritption.
- Similarity of component and operating system.
- Reporter (the assumption is that the same person is not likely to report the same thing twice).
I have also experimented with locality sensitive hashing (LSH) implmented in NearPy and doc2vec and text summarization implemented in gensim to see if it could be used instead of simple cosine similarity on TfIdf matrices but my results were not better.
- Actual duplicate returned in top 3: 10%
- Actual duplicate returned in top 10: 15%
- Actual duplicate returned in top 50: 26%
The results are not impressive but it is important to keep in mind that there were 7,000 candidates considered on average. The model would perform much better on smaller set of observations (hundreds to thousands of bug reports).
To learn more about the code and how to use it, look at technical documentation.
Improve existing predictions:
- Explore more features available in the API (newer bug reports have information including keywords or votes).
- Improve the duplicate search, spend more time on text similarity detection.
Make the bug triaging solution more complete by adding more functionality:
- Predict component and product.
- Suggest assignees.
- Predict resolution time.
 Lamkanfi, A., Pérez, J. and Demeyer, S., 2013, May. The eclipse and mozilla defect tracking dataset: a genuine dataset for mining bug information. In Proceedings of the 10th Working Conference on Mining Software Repositories (pp. 203-206). IEEE Press.
 Tian, Y., Lo, D. and Sun, C., 2012, October. Information retrieval based nearest neighbor classification for fine-grained bug severity prediction. In 2012 19th Working Conference on Reverse Engineering (pp. 215-224). IEEE.
 Lamkanfi, A., Demeyer, S., Giger, E. and Goethals, B., 2010, May. Predicting the severity of a reported bug. In 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010) (pp. 1-10). IEEE.
 Menzies, T. and Marcus, A., 2008, September. Automated severity assessment of software defect reports. In Software Maintenance, 2008. ICSM 2008. IEEE International Conference on (pp. 346-355). IEEE.