# Predicting Long-Lived Bugs

# 1. Experiment A 

This seventh experiment has used a dataset of bug reports extracted from Eclipse Bugzilla Tracking System. The protocol parameters and values employed in this experiment is shown in the following table:

| Parameter                  |         Value        |
|----------------------------|:--------------------:|
| OSS Project                |        Eclipse       |
| Number of Bug Reports      |        12.200        |
| Fixed Threshold            |             8        |
| Variable Threshold         |          8,64        |
| Method for Balancing Class | Downsampling (Manual)|
| Classifiers                | knn, svm, rf         |
| Resampling Techniques      | None                 |

Every bug which its report have indicated that the number of days to resolve is less than or equal to **threshold fixed** was considered a **non-long lived bug** and that which the number of days to resolve is greater than this threshold was considered as a **long-live bug**. 

# 2. Data Analysis

## 2.1 Setup Python Packages

In [2]:
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


%matplotlib inline 

plt.style.use('default')
sns.set_context("paper")

## 2.2 Describe Bug Reports File

In [4]:
reports_file = 'datasets/20190207_eclipse_bug_reports.csv'

!echo "Header of reports file"
!head -n 2 $reports_file

!echo "\nNumber of reports in the file:"
!wc -l $reports_file


Header of reports file
bug_id,creation_date,component_name,product_name,short_description,long_description,assignee_name,reporter_name,resolution_category,resolution_code,status_category,status_code,update_date,quantity_of_votes,quantity_of_comments,resolution_date,days_to_resolve,severity_category,severity_code
COMMUNITY-455431,2014-12-17,Servers,COMMUNITY,Need SSH access to build.eclipse.org for uwe.stieber@windriver.com,I'm a committer on the tools.cdt.tcf project and got asked to take over some release engineering stuff from my project lead Martin Oberhuber. In order to do this I need to be able to SSH login to build.eclipse.org. Please re-enable the real shell access for my user (Uwe Stieber uwe.stieber@windriver.com).,webmaster,uwe.st,fixed,1,resolved,4,2014-12-17,0,3,2014-12-17,0,normal,2

Number of reports in the file:
  197505 datasets/20190207_eclipse_bug_reports.csv


In [5]:
reports = pd.read_csv(reports_file)
rows_and_cols = reports.shape
print('There are {} rows and {} columns.\n'.format(
        rows_and_cols[0], rows_and_cols[1]
    )
)

reports_information = reports.info()
print(reports_information)

reports.head()

There are 12200 rows and 19 columns.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12200 entries, 0 to 12199
Data columns (total 19 columns):
bug_id                  12200 non-null object
creation_date           12200 non-null object
component_name          12200 non-null object
product_name            12200 non-null object
short_description       12197 non-null object
long_description        12011 non-null object
assignee_name           12200 non-null object
reporter_name           12200 non-null object
resolution_category     12200 non-null object
resolution_code         12200 non-null int64
status_category         12200 non-null object
status_code             12200 non-null int64
update_date             12200 non-null object
quantity_of_votes       12200 non-null int64
quantity_of_comments    12200 non-null int64
resolution_date         12200 non-null object
days_to_resolve         12200 non-null int64
severity_category       12200 non-null object
severity_code           12200 

Unnamed: 0,bug_id,creation_date,component_name,product_name,short_description,long_description,assignee_name,reporter_name,resolution_category,resolution_code,status_category,status_code,update_date,quantity_of_votes,quantity_of_comments,resolution_date,days_to_resolve,severity_category,severity_code
0,COMMUNITY-455431,2014-12-17,Servers,COMMUNITY,Need SSH access to build.eclipse.org for uwe.s...,I'm a committer on the tools.cdt.tcf project a...,webmaster,uwe.st,fixed,1,resolved,4,2014-12-17,0,3,2014-12-17,0,normal,2
1,JDT-31738,2003-02-12,UI,JDT,Weird behavior setting project libraries,Open the properties for a project then Java Bu...,martinae,bogofilter+eclipse.org,fixed,1,resolved,4,2003-02-18,0,4,2003-02-18,6,normal,2
2,ORION-389073,2012-09-07,Git,ORION,Pull gives me an auth fail error without promp...,With the latest changes in git credentials (to...,simon_kaegi,susan,fixed,1,resolved,4,2012-09-12,0,7,2012-09-12,5,major,4
3,JETTY-306226,2010-03-17,client,JETTY,HttpClient should allow changing to the keysto...,(Originally JETTY-1190 @ Codehaus JIRA)\n\nCur...,mgorovoy,mgorovoy,fixed,1,resolved,4,2010-10-13,0,3,2010-05-12,56,normal,2
4,WTP_SOURCE_EDITING-185183,2007-05-02,wst.xsd,WTP_SOURCE_EDITING,Update copyright headers for WSDL and XSD comp...,Update copyright headers for WSDL and XSD comp...,kchong,kchong,fixed,1,closed,6,2007-07-05,0,4,2007-05-02,0,normal,2


# 2.3  Evaluation Metrics

In [6]:
results_file = 'datasets/20190213232312_predicting-metrics-manual.csv'
results = pd.read_csv(results_file)
rows_and_cols = results.shape
print('There are {} rows and {} columns.\n'.format(
        rows_and_cols[0], rows_and_cols[1]
    )
)

results_information = results.info()
print(reports_information)

results.head()

There are 6 rows and 24 columns.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 24 columns):
dataset               6 non-null object
classifier            6 non-null object
resampling            6 non-null object
balancing             6 non-null object
fixed_threshold       6 non-null int64
variable_threshold    6 non-null int64
train_size            6 non-null int64
train_size_class_0    6 non-null int64
train_size_class_1    6 non-null int64
test_size             6 non-null int64
test_size_class_0     6 non-null int64
test_size_class_1     6 non-null int64
feature               6 non-null object
n_terms               6 non-null int64
tp                    6 non-null int64
fp                    6 non-null int64
tn                    6 non-null int64
fn                    6 non-null int64
acc_class_0           6 non-null float64
acc_class_1           6 non-null float64
balanced_acc          6 non-null float64
precision             6 non-null fl

Unnamed: 0,dataset,classifier,resampling,balancing,fixed_threshold,variable_threshold,train_size,train_size_class_0,train_size_class_1,test_size,...,tp,fp,tn,fn,acc_class_0,acc_class_1,balanced_acc,precision,recall,fmeasure
0,Eclipse,knn,none,manual,8,8,5354,2677,2677,5354,...,1153,1059,1524,1618,0.521248,0.485041,0.503145,0.521248,0.416095,0.462773
1,Eclipse,svmRadial,none,manual,8,8,5354,2677,2677,5354,...,1464,1178,1213,1499,0.554126,0.447271,0.500699,0.554126,0.494094,0.522391
2,Eclipse,rf,none,manual,8,8,5354,2677,2677,5354,...,1418,1171,1259,1506,0.547702,0.455335,0.501518,0.547702,0.484952,0.51442
3,Eclipse,knn,none,manual,8,64,2730,1365,1365,2728,...,697,637,667,727,0.522489,0.478479,0.500484,0.522489,0.489466,0.505439
4,Eclipse,svmRadial,none,manual,8,64,2730,1365,1365,2728,...,742,596,622,768,0.554559,0.447482,0.501021,0.554559,0.491391,0.521067
