# Project Description

Given any high impact bug, identify its priority.

Methods in research of Software Engineering focus on predicting, localizing, and triaging bugs, but do not consider their impact or weight on the users and on the developers.

For this reason, we want to distinguish different kinds of bugs by placing in them in different priority categories: **Critical**, **Blocker**, **Major**, **Minor**, **Trivial**.

Below are their definitions$^1$:

* Blocker: we can not move forward until this task completed or this bug fixed. Blocker JIRAs block the completion of a design or development iteration, or the releaseof the product.
    * development or design of a critical component or activity on the project can not move forward until this is resolved.
    * breaks the build process.
    * causes the UI to not start up or function at all.
    * renders an area of finished functionality unusable for most users.
    * not completed task which is needed for the kernel to function.
* Critical: this task or issue will most likely be resolved for release.
    * needed for the next iteration
    * major browser incompatibility
    * renders an isolated particular feature completely unusable
    * affects kernel - front-end communication
    * major accessibility issue
    * major security issue
* Major:(default) Issue should be resolved for release.
    * majority of bugs and tasks
    * if unsure, chose this
* Minor:issue might be resolved for release.
    * issues and bugs which can be fixed with entry level knowledge
    * does not affect functionality in any way, presentation may be broken
* Trivial: issue may or may not be resolved.
    * this should be for borderline tasks and issues. For that moment when you're not sure whether you should write something up or not.

<br><font color=red>
*Note: Consider using regularization to penalize certain parameters in order to obtain more balanced data.*
</font>

Possible research questions:

- Can we predict the priority of a bug given their summary, description, and type?
- Can suggest an assignee to fix the bug given a bug and its priority?
    - Can we predict the time it will take for the bug to be resolved?

<hr>

# Table of Contents

`//TODO`

In [31]:
run_data_cleaning = True
run_nn = False
testing = True # will only run the first 10 tuples

In [3]:
import numpy as np

# Data Extraction
import csv
import pandas as pd

# SentiStrength
import subprocess
import shlex
import os.path
import sys

# Machine Learning
import tensorflow as tf
import sklearn

# 1. Data <a class="anchor" id="data"></a>

We will work with a large dataset of high impact bugs, which was created by manually reviewing four thousand issue reports in four open source projects (Ambari, Camel, Derby and Wicket). The projects were extracted from JIRA, a platform for managing reported issues.

There are 1000 examples per project; there will be 4000 examples to work with in total.

These projects were selected because they met the following criteria for project selection:

* Target projects have a large number of (at least several thousand) reported issues , which enables the use for prediction model building and/or machine learning.

* Target projects use JIRA as an issue tracking system.

* Target projects are different from each other in application domains.

## 1.1. Data Extraction <a class="anchor" id="data-extraction"></a>

In [4]:
def get_data(path):

    # read in data
    df = pd.read_csv(path, sep=',', encoding='ISO-8859-1')
    data = np.array(df)    
    
    # get headers with feature labels
    with open(path, 'r', newline='', encoding="ISO-8859-1") as f:
        reader = csv.reader(f)
        feature_headers = next(reader)
        
    # merge data and feature headers
    data = np.insert(data, 0, feature_headers, 0)  
        
    return data

## 1.2. Manual Feature Selection <a class="anchor" id="manual-feature-selection"></a>

First, the most relevant features are selected manually. The original features included in the dataset were the following:

<img src="./resources/images/features-in-dataset.png" width="300">

However, after careful consideration, only the following columns will be taken into account for further analysis:

<img src="./resources/images/manual-feature-selection-table.jpg" width="600">

In [5]:
def manually_selected_features(data):
    
    print("> Getting manually selecting features...")
    
    # cols to keep: 1, 5, 6, 13, 14, 19
    # we're also keeping the priority column (5) for now
    cols_to_delete = (0, 2, 3, 4, 7, 8, 9, 10, 11, 12, 15, 16, 17, 18, 20, 21,
                        22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
    
    data = np.delete(data, cols_to_delete, axis=1)
    
    return data

## 1.3. Data Cleaning <a class="anchor" id="data-cleaning"></a>

In [21]:
def clean_data(data):
    
    # ---------- Only keep columns selected manually ----------
    
    print('> Cleaning data...')
    print("\n  Tuples before data cleaning: " + str(data[1:].shape[0]) + '\n')
    data = manually_selected_features(data)
    
    # ---------- Remove rows where data is missing ----------

    rows_to_delete = []

    for i, row in enumerate(data):
        for j, val in enumerate(row):
            if (str(row[j]).strip() == 'null'):
                # print("deleting row " + str(i) + ": " + str(row))
                rows_to_delete.append(i)
                break
    
    data = np.delete(data, rows_to_delete, 0)
    # np.savetxt('../dataset/all_data_null_removed.csv', data, delimiter=',', fmt="%s")

    print("\n  Tuples after data cleaning: " + str(data[1:].shape[0]) + '\n')
        
    # ---------- Split total data into design matrix and feature headers ----------
    
    # extract labels (y) column from total dataset
    # labels = one_hot(data[1:,1])
    
    # transform labels into integer encodings
    labels = [str(val).strip() for val in  data[:,1]]
    labels = LabelEncoder().fit_transform(labels)
    
    data = np.delete(data, 1, 1) # deleting labels column from data matrix
    data = np.c_[data, labels] # add labels column to the end

    # strip white space from features array and ignore headers in data matrix
    feature_headers = [str(header).strip() for header in  data[0]] # remove white space around strings
    data = data[1:] # excluding headers from data matrix

    # Quantify issue "type" (0) and "reporter" (1)
    data[:,0] = quantify_to_int(data[:,0])
    data[:,1] = quantify_to_int(data[:,1])

    # Apply sentiment analysis to "summary" (2) and "description" (3) features
    data[:,2] = get_sentiment_feature(data[:,2])
    data[:,3] = get_sentiment_feature(data[:,3])
    
    return data, feature_headers

## 1.4. Generating New Features <a class="anchor" id="generating-new-features"></a>

### 1.4.1. Sentiment Analysis <a class="anchor" id="sentiment-analysis"></a>

[SentiStrength](http://sentistrength.wlv.ac.uk) estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths:

<br><center>$-1$ *(not negative)* to $-5$ *(extremely negative)*</center>

<center>$1$ *(not positive)* to $5$ *(extremely positive)*</center><br>

For this project, they will both be added in order to be used as a new column for data analysis.

In [7]:
# allows SentiStrength to be called and ran on a single line of text.
def rate_sentiment(senti_string):
    
    if senti_string == '': return 0
    
    # Set the proper paths
    sentistrength_location = "./resources/SentiStrength/SentiStrength.jar" # The location of SentiStrength on your computer
    sentistrength_language_folder = "./resources/SentiStrength/data/" # The location of the unzipped SentiStrength data files on your computer
    
    # Tests the paths are correct.
    # An error will be displayed if there is an issue.
    if not os.path.isfile(sentistrength_location):
        print("SentiStrength not found at: ", sentistrength_location)
    if not os.path.isdir(sentistrength_language_folder):
        print("SentiStrength data folder not found at: ", sentistrength_language_folder)
       
    # Open a subprocess using shlex to get the command line string into the correct args list format
    p = subprocess.Popen(shlex.split("java -jar '" + sentistrength_location + "' stdin sentidata '" + sentistrength_language_folder + "'"),stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
    # Communicate via stdin the string to be rated. Note that all spaces are replaced with "+"
    b = bytes(senti_string.replace(" ","+"), 'utf-8') # Can't send string in Python 3, must send bytes
    stdout_byte, stderr_text = p.communicate(b)
    stdout_text = stdout_byte.decode("utf-8")  # Convert from byte
    # -------- Edit - Nov 9 2017 --------
    stdout_list = stdout_text.split("\t")      # Split by tab: ['2', '-1','\n']
    del stdout_list[-1]                        # Get rid of the last newline element: ['2', '-1']
    results = list(map(int, stdout_list))      # Convert the characters to integers
    results = results[0] + results[1]          # Combine the positive and the negative
    # -------- END: Edit - Nov 9 2017 --------
    #stdout_text = stdout_text.rstrip().replace("\t"," ") # Remove the tab spacing between the positive and negative ratings. e.g. 1    -5 -> 1 -5
    #return stdout_text + " " + senti_string
    
    return results

In [8]:
# Test to ensure that it works correctly
print(rate_sentiment(""))

0


Given a column of the data, return another that will include a representation of the input.

In [9]:
# https://stackoverflow.com/questions/3173320/text-progress-bar-in-the-console
def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█'):
    
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
    """
    
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r')
    
    # Print New Line on Complete
    if iteration == total: 
        print()

In [10]:
def get_sentiment_feature(strings):
    
    print("> Applying sentiment analysis...")
    l = len(strings)
    results = np.zeros(l)
    
    # Initial call to print 0% progress
    printProgressBar(0, l, prefix = '  Progress:', suffix = 'Complete', length = 50)
    
    for i, element in enumerate(strings):
        results[i] = rate_sentiment(element.strip())
        printProgressBar(i + 1, l, prefix = '  Progress:', suffix = 'Complete', length = 50)       
    
    return results

### 1.4.3. Quantify Features <a class="anchor" id="quantify-features"></a>

In [11]:
from sklearn.preprocessing import LabelEncoder
def quantify_to_int(array):
    
    print("> Quantifying feature...")

    label_encoder = LabelEncoder()
    results = label_encoder.fit_transform(array)
                
    return results

### 1.4.4. Adjust Given Labels

The labels are provided in string format; however, we will need to convert them into one_hot vectors in order to use them as different classes in the Neural Network.

In [12]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
def one_hot(array):
    
    print("> Transforming labels into one-hot vectors...")
    
    onehot_encoder = OneHotEncoder(sparse=False)
    
    # assuming array has already been transformed into integer encodings
    # now, convert to binary (one-hot)
    array = array.reshape(len(array), 1)
    results = onehot_encoder.fit_transform(array)
            
    return results

# 2. Implementation <a class="anchor" id="implementation"></a>

## 2.1 Fetch and Clean Data

In [13]:
all_data_path = "../dataset/all_data.csv"
clean_data_path = "../dataset/clean_data.csv"

In [32]:
rows = 11
if (not testing): rows=4001

data, feature_headers = clean_data(get_data(all_data_path)[:11])

# Saving the clean data into a csv file for future use
np.savetxt(clean_data_path, data, delimiter=',')

tuples_to_print = 5
print("\n  Features considered: " + str(feature_headers) + " = " + str(len(feature_headers)))
print("\n  First " + str(tuples_to_print) + " tuples example:\n" + str(data[:tuples_to_print]))

> Cleaning data...

  Tuples before data cleaning: 10

> Getting manually selecting features...

  Tuples after data cleaning: 10

> Quantifying feature...
> Quantifying feature...
> Applying sentiment analysis...
  Progress: |██████████████████████████████████████████████████| 100.0% Complete
> Applying sentiment analysis...
  Progress: |██████████████████████████████████████████████████| 100.0% Complete

  Features considered: ['type', 'reporter', 'summary', 'description', 'description_words', '2'] = 6

  First 5 tuples example:
[[0 9 -2.0 0.0 245 0]
 [1 3 0.0 1.0 36 0]
 [1 2 0.0 2.0 23 0]
 [0 4 -1.0 1.0 187 0]
 [0 7 0.0 0.0 70 1]]


## 2.2. Fetch Clean Data

In [15]:
from sklearn.model_selection import train_test_split
def split_data(data, labels, train_perc):
    
    test_perc = round(1-train_perc, 2)
    x_train, x_test, y_train, y_test = train_test_split(data, labels, train_size=train_perc, test_size=test_perc, random_state=42)

    return x_train, x_test, y_train, y_test

In [17]:
df = pd.read_csv(clean_data_path, sep=',', encoding='ISO-8859-1')
clean_data = np.array(df)

# get rid of rows containing "nan" in clean data file
rows_to_delete = []
for i, row in enumerate(clean_data):
    for j, val in enumerate(row):
        if (str(row[j]).strip() == 'nan'):
            rows_to_delete.append(i)
            break
clean_data = np.delete(clean_data, rows_to_delete, 0)

# don't include the last column; where the labels are
data = (clean_data[:,:-1])

# reshape from (m,) to (m,1), then convert into one-hot vector (m,k)
y = one_hot((clean_data[:,-1]).reshape((-1, 1)))
print("  data shape: " + str(data.shape))
print("  y shape: " + str(y.shape))

train_perc = .7 # percentage of total data used for training
x_train, x_test, y_train, y_test = split_data(data, y, train_perc) # randomly splitting up the data
m = x_train.shape[0] # number of tuples for training
n = data.shape[1] # number of features
k = len(y[0]) # number of classes

> Transforming labels into one-hot vectors...
  data shape: (3905, 12)
  y shape: (3905, 5)


In [18]:
y_rand = one_hot(np.floor(np.random.rand(len(y_test),1)*5).astype(int))
print("  y_rand shape: " + str(y_rand.shape))

> Transforming labels into one-hot vectors...
  y_rand shape: (1172, 5)


## 2.2 Neural Network

In [19]:
def apply_activation_function(X, W, b, func='softmax'):
    
    if (func == 'softmax'): # softmax
       
        return tf.nn.softmax(tf.add(tf.matmul(X, W), b))
    
    if (func == 'relu'): # relu
        
        return tf.nn.relu(tf.add(tf.matmul(X, W), b))

    else: # sigmoid
    
        return tf.sigmoid(tf.add(tf.matmul(X, W), b))    

In [20]:
def get_cost(y, y_, epsilon):
    # return (-tf.reduce_mean(y * tf.log(y_ + epsilon) + (1 - y) * tf.log(1 - y_ + epsilon)))
    return tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_, labels=y))
    # return tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=y_))

In [22]:
# Using multiple layers
def get_output_layer(n_hidden_layers, X, n, k, n_perceptrons):
    
    layer_weights = []
    
    # input layer to first hidden layer
    layer_weights.append({'W': tf.Variable(tf.random_normal([n, n_perceptrons])),
                          'b': tf.Variable(tf.random_normal([n_perceptrons]))})
    
    # generate this many hidden layers
    for i in range(n_hidden_layers):
        layer_weights.append({'W': tf.Variable(tf.random_normal([n_perceptrons, n_perceptrons])),
                              'b': tf.Variable(tf.random_normal([n_perceptrons]))})

    # last hidden layer to output layer
    layer_weights.append({'W': tf.Variable(tf.random_normal([n_perceptrons, k])),
                          'b': tf.Variable(tf.random_normal([k]))})
            
    # calculate output-first hidden inner layer
    aggregated_val = apply_activation_function(X, layer_weights[0]['W'], layer_weights[0]['b'])
    
    # print("  aggregated_val.shape: " + str(aggregated_val.shape))
    
    # calculate all hidden layers and output layer
    for i in range(1, len(layer_weights)):
        aggregated_val = apply_activation_function(aggregated_val, layer_weights[i]['W'], layer_weights[i]['b'])
    
    # return final layer
    return aggregated_val

In [23]:
def run_model(n_hidden_layers, X, y, n, epsilon, learning_rate, epochs, k, init_perceptrons, total_perceptrons, step):
   
    # to store the different accuracy values for each number of perceptrons used
    total_accuracy = []
    
    # if we are only trying with one set of perceptrons, adjust the upper bound for the "range" function below
    if (init_perceptrons == total_perceptrons):
        stop_cond = init_perceptrons + 1
    # otherwise, set the upper bound taking into accout both the initial perceptrons, and the total wanted
    else:
        stop_cond = init_perceptrons + total_perceptrons + 1

    # perform the training for each number of perceptrons specified
    for n_nodes in range(init_perceptrons, stop_cond, step):

        print("> Using ", n_nodes, " perceptrons and " + str(n_hidden_layers) + " hidden layers ...")

        y_ = get_output_layer(n_hidden_layers, X, n, k, n_nodes)
        cost_function = get_cost(y, y_, epsilon)
        
        # using gradient descent to minimize the cost
        optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)

        correct_prediction = tf.equal(tf.argmax(y_, 1), tf.argmax(y, 1)) # checking how many were predicted correctly
        benchmark_prediction = tf.equal(tf.argmax(y_rand, 1), tf.argmax(y, 1))
        
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        benchmark_accuracy = tf.reduce_mean(tf.cast(benchmark_prediction, tf.float32))

        # --- TRAINING ---

        # collecting cost for each epoch for plotting
        total_cost = []
        init_op = tf.global_variables_initializer()

        with tf.Session() as sess:

            sess.run(init_op)

            for epoch in range(epochs):

                _, c = sess.run([optimizer, cost_function], feed_dict={X:x_train, y:y_train})
                total_cost.append(c)

                if (epoch+1) % 1000 == 0:
                    print("  EPOCH:", (epoch+1), "Cost =", "{:.15f}".format(c))

            a = sess.run(accuracy, feed_dict={X: x_test, y: y_test})
            b_a = sess.run(benchmark_accuracy, feed_dict={y: y_test})
            total_accuracy.append(a)
            print("  >> Accuracy = " + "{:.5f}%".format(a*100) + " vs. Random = " + "{:.5f}%".format(b_a*100))
            

**Set variables and placeholders**

In [24]:
n_hidden_layers = 5
learning_rate = 0.01
epochs = 10000 # cycles of feed forward + backpropagation
epsilon = 0.000001 # used to avoid "nan" values from log in cost function

# used to observe the change in accuracy as number of perceptrons increases
init_perceptrons = 250
total_perceptrons = 250
step = 25

# declare training data placeholders
X = tf.placeholder(tf.float32, [None, n]) # input x1, x2, x3, ..., x12 (12 nodes)
y = tf.placeholder(tf.float32, [None, k]) # output (5 nodes)

In [25]:
if run_nn:
    # run model
    total_acc = run_model(n_hidden_layers, X, y, n, epsilon, learning_rate, epochs, k, init_perceptrons,
                          total_perceptrons, step)

> Using  250  perceptrons and 5 hidden layers ...
  EPOCH: 1000 Cost = 1.616793155670166
  EPOCH: 2000 Cost = 1.525791883468628
  EPOCH: 3000 Cost = 1.377161026000977
  EPOCH: 4000 Cost = 1.310011029243469
  EPOCH: 5000 Cost = 1.287080168724060
  EPOCH: 6000 Cost = 1.276657462120056
  EPOCH: 7000 Cost = 1.270870923995972
  EPOCH: 8000 Cost = 1.267209649085999
  EPOCH: 9000 Cost = 1.264721989631653
  EPOCH: 10000 Cost = 1.262904405593872
  >> Accuracy = 65.10239% vs. Random = 21.50171%


# 3. Evaluation <a class="anchor" id="evaluation"></a>

*Note: Include comparisson with other related work*

| Paper | Method |
|-------|--------|
| [Automated Identification of High Impact Bug<br>Reports Leveraging Imbalanced Learning Strategies](http://ieeexplore.ieee.org.uproxy.library.dc-uoit.ca/stamp/stamp.jsp?arnumber=7552013&tag=1 "Paper") |  Naive Bayes Multinominal |

# 4. Conclusion <a class="anchor" id="conclusion"></a>

# 5. References

1: https://confluence.ets.berkeley.edu/confluence/display/MYB/JIRA%2C+Definitions