# Predicting Long-Lived Bugs

# 1. Setup Python Packages

In [6]:
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


%matplotlib inline 

plt.style.use('default')
sns.set_context("paper")

# 2. Experiment

This eighth experiment has used a dataset of bug reports extracted from Eclipse Bugzilla Tracking System. The protocol parameters and values employed in this experiment is shown in the following table:

| Parameter                                          |                      Value                            |
|----------------------------------------------------|:-----------------------------------------------------:|
| OSS                                                |        eclipse                                        |
| Number of bug reports                              |        12.200                                         |
| Days to resolve range                              |   from 0 to 730                                       |
| Number of rug reports within days to resolve range |        10.970                                         |
| Textual features                                   | summary + description                                 |
| Number of terms                                    |           200                                         |
| Fixed Threshold                                    |            64                                         |
| Variable threshold range                           |       from 4 to 64 (step 4)                                |
| Method for balancing class                         | none, downsampling (manual), downsampling (R), smote  |
| Classifiers                                        | knn                                                   |
| Resampling techniques                              | none, bootstrap, cv5x2, repeated cv5x2, loocv, loogcv |

Every bug which its report have indicated that the number of days to resolve is less than or equal to **fixed threshold** was considered a **non-long lived bug** and that which the number of days to resolve is greater than this threshold was considered as a **long-live bug**. 

## 3.1  Evaluation Metrics

In [7]:
results_file = 'datasets/20190213110244-predicting-metrics.csv'
results = pd.read_csv(results_file)
rows_and_cols = results.shape
print('There are {} rows and {} columns.\n'.format(
        rows_and_cols[0], rows_and_cols[1]
    )
)

results_information = results.info()
print(results_information)

results.sort_values('balanced_acc', ascending=False)

There are 384 rows and 24 columns.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384 entries, 0 to 383
Data columns (total 24 columns):
dataset               384 non-null object
classifier            384 non-null object
resampling            384 non-null object
balancing             384 non-null object
threshold             384 non-null int64
fixed_threshold       384 non-null int64
train_size            384 non-null int64
train_size_class_0    384 non-null int64
train_size_class_1    384 non-null int64
test_size             384 non-null int64
test_size_class_0     384 non-null int64
test_size_class_1     384 non-null int64
feature               384 non-null object
n_term                384 non-null int64
tp                    384 non-null int64
fp                    384 non-null int64
tn                    384 non-null int64
fn                    384 non-null int64
acc_class_0           384 non-null float64
acc_class_1           384 non-null float64
balanced_acc          384 non-

NameError: name 'reports_information' is not defined

In [None]:
def plot_line(data, x, y):
    sns.color_palette("bright")
    sns.set(font_scale=1.5)
    sns.set_style("whitegrid")
    g = sns.FacetGrid(data=data, hue="resampling", col="balancing", col_wrap=2, height=10)
    g = g.map(sns.lineplot, 'threshold', 'acc_class_1')
    g.set(xlim=(4, 64))
    g.set(xticks=range(4, 64, 4))
    g.add_legend()
    return 

In [None]:
plot_line(results, 'threshold', 'acc_class_1')