# Project Description

Given any high impact bug, identify its priority.

Methods in research of Software Engineering focus on predicting, localizing, and triaging bugs, but do not consider their impact or weight on the users and on the developers.

For this reason, we want to distinguish different kinds of bugs by placing in them in different priority categories: **Critical**, **Blocker**, **Major**, **Minor**, **Trivial**.

<hr>

# Table of Contents

1. [Data](#data)<br>
    1.1. [Data Extraction](#data-extraction)<br>
    1.2. [Manual Feature Selection](#manual-feature-selection)<br>
    1.3. [Data Cleaning](#data-cleaning)<br>
    1.4. [Generating New Features](#generating-new-features)<br>
    &nbsp;&nbsp;&nbsp;1.4.1. [Sentiment Analysis](#sentiment-analysis)<br>
    &nbsp;&nbsp;&nbsp;1.4.2. [Number of Versions Affected](#number-versions)<br>
    &nbsp;&nbsp;&nbsp;1.4.3. [Quantify Features](#quantify-features)<br>

2. [Implementation](#implementation)<br>
3. [Evaluation](#evaluation)<br>
4. [Conclusion](#conclusion)<br>

In [2]:
import numpy as np

# Data Extraction
import csv
import pandas as pd

# SentiStrength
import subprocess
import shlex
import os.path
import sys

# Machine Learning
import tensorflow as tf
import sklearn

# 1. Data <a class="anchor" id="data"></a>

We will work with a large dataset of high impact bugs, which was created by manually reviewing four thousand issue reports in four open source projects (Ambari, Camel, Derby and Wicket). The projects were extracted from JIRA, a platform for managing reported issues.

There are 1000 examples per project; there will be 4000 examples to work with in total.

These projects were selected because they met the following criteria for project selection:

* Target projects have a large number of (at least several thousand) reported issues , which enables the use for prediction model building and/or machine learning.

* Target projects use JIRA as an issue tracking system.

* Target projects are different from each other in application domains.

## 1.1. Data Extraction <a class="anchor" id="data-extraction"></a>

In [2]:
def get_data(path):

    # read in data
    df = pd.read_csv(path, sep=',', encoding='ISO-8859-1')
    data = np.array(df)    
    
    # get headers with feature labels
    with open(path, 'r', newline='', encoding="ISO-8859-1") as f:
        reader = csv.reader(f)
        feature_headers = next(reader)
        
    # merge data and feature headers
    data = np.insert(data, 0, feature_headers, 0)  
        
    return data

## 1.2. Manual Feature Selection <a class="anchor" id="manual-feature-selection"></a>

First, the most relevant features are selected manually. The original features included in the dataset were the following:

<img src="./resources/images/features_in_dataset.png" alt="Features in Dataset" width="360">

However, after careful consideration, only the following columns will be taken into account for further analysis:

| NAME | DESCRIPTION |
|:-----|:------------|
| `status` | Status of an issue (Resolved or Closed) |
| `assignee` | Assignee's Name |
| `time_fixed` | Time to fix an issue (assigned to resolved) |
| `summary` | Summary of an issue |
| `description` | Descriptions of an issue |
| `affected version` | Versions affected by an issue |
| `fixed_version` | Versions of a fixed issue |
| `votes` | Number of votes |
| `watches` | Number of watchers |
| `description_words` | Number of words used in description |
| `assignee_count` | Number of assignees |
| `comment_count` | Number of comments for an issue |
| `commenter_count` | Number of developers who comment on an issue |
| `commit_count` | Number of commits to resolve an issue |

Criteria for manual feature filtering:
- Actually provided in the dataset
- Possible influence on bug priority prediction
- Lack of data
- Non-quantifiable data

In [3]:
def manually_selected_features(data):
    
    print("> Getting manually selecting features...")
    
    # note: we're keeping the priority column for now
    data = np.delete(data,(0,1,3,4,6,7,8,10,11,15,16,23,24,25,26,27,28,30,31), axis=1)
    
    return data

## 1.3. Data Cleaning <a class="anchor" id="data-cleaning"></a>

In [4]:
def clean_data(data):
    
    # ---------- Only keep columns selected manually ----------
    
    print('> Cleaning data...')
    data = manually_selected_features(data)
    
    # ---------- Remove rows where data is missing ----------

    rows_to_delete = []

    for i, row in enumerate(data):
        for j, val in enumerate(row):
            if (str(row[j]).strip() == 'null'):
                # print("deleting row " + str(i) + ": " + str(row))
                rows_to_delete.append(i)
                break
    
    data = np.delete(data, rows_to_delete, 0)
    # np.savetxt('../dataset/all_data_null_removed.csv', data, delimiter=',', fmt="%s")

    print("\n  Number of tuples after data cleaning: " + str(data[1:].shape[0]) + '\n')
        
    # ---------- Split total data into design matrix and feature headers ----------
    
    # extract labels (y) column from total dataset
    # labels = one_hot(data[1:,1])
    
    # transform labels into integer encodings
    labels = [str(val).strip() for val in  data[:,1]]
    labels = LabelEncoder().fit_transform(labels)
    
    data = np.delete(data, 1, 1) # deleting labels column from data matrix
    data = np.c_[data, labels] # add labels column to the end

    # strip white space from features array and ignore headers in data matrix
    feature_headers = [str(header).strip() for header in  data[0]] # remove white space around strings
    data = data[1:] # excluding headers from data matrix

    # rename some features for more descriptive names
    feature_headers = [ft.replace('assigned', 'time_fix') for ft in feature_headers]
    feature_headers = [ft.replace('commenter', 'commenter_count') for ft in feature_headers]

    # Quantify "status" (0) and "assignee" (1)
    data[:,0] = quantify_status(data[:,0])
    data[:,1] = quantify_assignee(data[:,1])

    # Converting time_fix from string to float
    data[:,2] = data[:,2].astype(np.float)

    # Apply sentiment analysis to "summary" (3) and "description" (4) features
    data[:,3] = get_sentiment_feature(data[:,3])
    data[:,4] = get_sentiment_feature(data[:,4])
    
    return data, feature_headers

## 1.4. Generating New Features <a class="anchor" id="generating-new-features"></a>

### 1.4.1. Sentiment Analysis <a class="anchor" id="sentiment-analysis"></a>

[SentiStrength](http://sentistrength.wlv.ac.uk) estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths:

<br><center>$-1$ *(not negative)* to $-5$ *(extremely negative)*</center>

<center>$1$ *(not positive)* to $5$ *(extremely positive)*</center><br>

For this project, they will both be added in order to be used as a new column for data analysis.

In [5]:
# allows SentiStrength to be called and ran on a single line of text.
def rate_sentiment(senti_string):
    
    if senti_string == '': return 0
    
    # Set the proper paths
    sentistrength_location = "./resources/SentiStrength/SentiStrength.jar" # The location of SentiStrength on your computer
    sentistrength_language_folder = "./resources/SentiStrength/data/" # The location of the unzipped SentiStrength data files on your computer
    
    # Tests the paths are correct.
    # An error will be displayed if there is an issue.
    if not os.path.isfile(sentistrength_location):
        print("SentiStrength not found at: ", sentistrength_location)
    if not os.path.isdir(sentistrength_language_folder):
        print("SentiStrength data folder not found at: ", sentistrength_language_folder)
       
    # Open a subprocess using shlex to get the command line string into the correct args list format
    p = subprocess.Popen(shlex.split("java -jar '" + sentistrength_location + "' stdin sentidata '" + sentistrength_language_folder + "'"),stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
    # Communicate via stdin the string to be rated. Note that all spaces are replaced with "+"
    b = bytes(senti_string.replace(" ","+"), 'utf-8') # Can't send string in Python 3, must send bytes
    stdout_byte, stderr_text = p.communicate(b)
    stdout_text = stdout_byte.decode("utf-8")  # Convert from byte
    # -------- Edit - Nov 9 2017 --------
    stdout_list = stdout_text.split("\t")      # Split by tab: ['2', '-1','\n']
    del stdout_list[-1]                        # Get rid of the last newline element: ['2', '-1']
    results = list(map(int, stdout_list))      # Convert the characters to integers
    results = results[0] + results[1]          # Combine the positive and the negative
    # -------- END: Edit - Nov 9 2017 --------
    #stdout_text = stdout_text.rstrip().replace("\t"," ") # Remove the tab spacing between the positive and negative ratings. e.g. 1    -5 -> 1 -5
    #return stdout_text + " " + senti_string
    
    return results

In [6]:
# Test to ensure that it works correctly
print(rate_sentiment(""))

0


Given a column of the data, return another that will include a representation of the input.

In [7]:
# https://stackoverflow.com/questions/3173320/text-progress-bar-in-the-console
def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█'):
    
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
    """
    
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r')
    
    # Print New Line on Complete
    if iteration == total: 
        print()

In [8]:
def get_sentiment_feature(strings):
    
    print("> Applying sentiment analysis...")
    l = len(strings)
    results = np.zeros(l)
    
    # Initial call to print 0% progress
    printProgressBar(0, l, prefix = '  Progress:', suffix = 'Complete', length = 50)
    
    for i, element in enumerate(strings):
        results[i] = rate_sentiment(element.strip())
        printProgressBar(i + 1, l, prefix = '  Progress:', suffix = 'Complete', length = 50)       
    
    return results

### 1.4.2. Number of Versions Affected <a class="anchor" id="number-versions"></a>

In [9]:
# def get_num_versions(v1, v2):
    
#     print("> Getting number of versions...")
    
#     v1, v2 = v1.astype(np.float), v2.astype(np.float)
#     return (np.subtract(v2, v1))

### 1.4.3. Quantify Features <a class="anchor" id="quantify-features"></a>

In [10]:
from sklearn.preprocessing import LabelEncoder
def quantify_status(array):
    
    print("> Quantifying \"status\" feature...")

    label_encoder = LabelEncoder()
    results = label_encoder.fit_transform(array)
                
    return results

In [11]:
from sklearn.preprocessing import LabelEncoder
def quantify_assignee(array):
    
    print("> Quantifying \"assignee\" feature...")
    
    label_encoder = LabelEncoder()
    results = label_encoder.fit_transform(array)
            
    return results

### 1.4.4. Adjust Given Labels

The labels are provided in string format; however, we will need to convert them into one_hot vectors in order to use them as different classes in the Neural Network.

In [13]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
def one_hot(array):
    
    print("> Transforming labels into one-hot vectors...")
    
    onehot_encoder = OneHotEncoder(sparse=False)
    
    # assuming array has already been transformed into integer encodings
    # now, convert to binary (one-hot)
    array = array.reshape(len(array), 1)
    results = onehot_encoder.fit_transform(array)
            
    return results

# 2. Implementation <a class="anchor" id="implementation"></a>

## 2.1 Fetch and Clean Data

In [4]:
all_data_path = "../dataset/all_data.csv"
clean_data_path = "../dataset/clean_data.csv"

In [14]:
data, feature_headers = clean_data(get_data(all_data_path))

# Saving the clean data into a file for future use
np.savetxt(clean_data_path, data, delimiter=',')

print("\n  Features considered: " + str(feature_headers) + " = " + str(len(feature_headers)))
print("\n  Tuple example:\n" + str(data))

> Cleaning data...
> Getting manually selecting features...

  Number of tuples after data cleaning: 3907

> Quantifying "status" feature...
> Quantifying "assignee" feature...
> Applying sentiment analysis...
  Progress: |██████████████████████████████████████████████████| 100.0% Complete
> Applying sentiment analysis...
  Progress: |██████████████████████████████████████████████████| 100.0% Complete

  Features considered: ['status', 'assignee', 'time_fix', 'summary', 'description', 'votes', 'watches', 'description_words', 'assingnee_count', 'comment_count', 'commenter_count', 'commit_count', '5'] = 13

  Tuple example:
[[0 0 353.0791898 ..., 2 0.0 2]
 [2 87 0.0 ..., 14 1.0 2]
 [2 48 0.001157407 ..., 1 0.0 2]
 ..., 
 [2 170 41.776875 ..., 2 0.0 2]
 [2 5 0.005347222 ..., 2 0.0 2]
 [2 5 2.795162037 ..., 2 0.0 2]]


## 2.2. Fetch Clean Data

In [161]:
from sklearn.model_selection import train_test_split
def split_data(data, labels, train_perc):
    
    test_perc = round(1-train_perc, 2)
    x_train, x_test, y_train, y_test = train_test_split(data, labels, train_size=train_perc, test_size=test_perc, random_state=42)

    return x_train, x_test, y_train, y_test

In [172]:
df = pd.read_csv(clean_data_path, sep=',', encoding='ISO-8859-1')
clean_data = np.array(df)

data = (clean_data[:,:-1])

# reshape from (m,) to (m,1), then convert into one-hot vector (m,k)
y = one_hot((clean_data[:,-1]).reshape((-1, 1)))
print("  data shape: " + str(data.shape))
print("  y shape: " + str(y.shape))

train_perc = .7 # percentage of total data used for training
x_train, x_test, y_train, y_test = split_data(data, y, train_perc) # randomly splitting up the data
m = x_train.shape[0] # number of tuples for training
n = data.shape[1] # number of features
k = len(y[0]) # number of classes

> Transforming labels into one-hot vectors...
  data shape: (3906, 12)
  y shape: (3906, 5)


In [173]:
y_rand = one_hot(np.floor(np.random.rand(len(y_test),1)*5).astype(int))
print("  y_rand shape: " + str(y_rand.shape))

> Transforming labels into one-hot vectors...
  y_rand shape: (1172, 5)


## 2.2 Neural Network

In [182]:
def apply_activation_function(X, W, b, func='relu'):
    
    if (func == 'softmax'): # softmax
       
        return tf.nn.softmax(tf.add(tf.matmul(X, W), b))
    
    if (func == 'relu'): # relu
        
        return tf.nn.relu(tf.add(tf.matmul(X, W), b))

    else: # sigmoid
    
        return tf.sigmoid(tf.add(tf.matmul(X, W), b))    

In [183]:
def get_cost(y, y_, epsilon):
    # return (-tf.reduce_mean(y * tf.log(y_ + epsilon) + (1 - y) * tf.log(1 - y_ + epsilon)))
    return tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y_, labels=y))
    # return tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=y_))

In [184]:
# Using one hidden layer
def get_output_layer(X, n, k, n_perceptrons):
    
    # declare the weights connecting the input to the hidden layer
    hidden_layer = {'W': tf.Variable(tf.random_normal([n, n_perceptrons])),
                    'b': tf.Variable(tf.random_normal([n_perceptrons]))}

    # declare the weights connecting the hidden layer to the output layer
    output_layer = {'W': tf.Variable(tf.random_normal([n_perceptrons, k])),
                    'b': tf.Variable(tf.random_normal([k]))}
    
    # calculate hidden layer
    hidden_out = apply_activation_function(X, hidden_layer['W'], hidden_layer['b'])

    # calculate output layer
    y_ = apply_activation_function(hidden_out, output_layer['W'], output_layer['b'])
    
    return y_

In [185]:
# Using multiple layers
def get_output_layer2(X, n, k, n_perceptrons, n_hidden_layers=1):
        
    layer_weights = {}
        
    for i in range(n_hidden_layers): # generate this many hidden layers
        
        if (i == 0): # from input layer to first hidden layer
        
            layer_weights[i] = {'W': tf.Variable(tf.random_normal([n, n_perceptrons])),
                                'b': tf.Variable(tf.random_normal([n_perceptrons]))}
            
        elif (i == (n_hidden_layers-1)): # from last hidden layer to output
            
            layer_weights[i] = {'W': tf.Variable(tf.random_normal([n_perceptrons, k])),
                                'b': tf.Variable(tf.random_normal([k]))}

        else: # just some hidden layer
            
            layer_weights[i] = {'W': tf.Variable(tf.random_normal([n_perceptrons, n_perceptrons])),
                                'b': tf.Variable(tf.random_normal([n_perceptrons]))}
            
    # calculate output-first hidden inner layer
    aggregated_val = apply_activation_function(X, layer_weights[0], layer_weights[0])
    
    # calculate all hidden layers and output layer
    for i in range(1, len(layer_weights)):
        aggregated_val = apply_activation_function(aggregated_val, layer_weights[i]['W'], layer_weights[i]['b'])
    
    # return final layer
    return aggregated_val

<font color=red>`run_model()`: **IN PROGRESS**</font>

In [186]:
def run_model(X, y, n, epsilon, learning_rate, epochs, k, init_perceptrons, total_perceptrons, step):
   
    # to store the different accuracy values for each number of perceptrons used
    total_accuracy = []
    
    # if we are only trying with one set of perceptrons, adjust the upper bound for the "range" function below
    if (init_perceptrons == total_perceptrons):
        stop_cond = init_perceptrons + 1
    # otherwise, set the upper bound taking into accout both the initial perceptrons, and the total wanted
    else:
        stop_cond = init_perceptrons + total_perceptrons + 1

    # perform the training for each number of perceptrons specified
    for n_nodes in range(init_perceptrons, stop_cond, step):

        print("> Using ", n_nodes, " perceptrons in the hidden layer ...")

        y_ = get_output_layer(X, n, k, n_nodes)
        cost_function = get_cost(y, y_, epsilon)
        
        # using gradient descent to minimize the cost
        optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)

        correct_prediction = tf.equal(tf.argmax(y_, 1), tf.argmax(y, 1)) # checking how many were predicted correctly
        benchmark_prediction = tf.equal(tf.argmax(y_rand, 1), tf.argmax(y, 1))
        
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
        benchmark_accuracy = tf.reduce_mean(tf.cast(benchmark_prediction, tf.float32))

        # --- TRAINING ---

        # collecting cost for each epoch for plotting
        total_cost = []
        init_op = tf.global_variables_initializer()

        with tf.Session() as sess:

            sess.run(init_op)

            for epoch in range(epochs):

                _, c = sess.run([optimizer, cost_function], feed_dict={X:x_train, y:y_train})
                total_cost.append(c)

                if (epoch+1) % 1000 == 0:
                    print("  EPOCH:", (epoch+1), "Cost =", "{:.15f}".format(c))

            a = sess.run(accuracy, feed_dict={X: x_test, y: y_test})
            b_a = sess.run(benchmark_accuracy, feed_dict={y: y_test})
            total_accuracy.append(a)
            print("  > Accuracy = " + "{:.5f}%".format(a*100) + " vs. Random = " + "{:.5f}%".format(b_a*100))
            

**Set variables and placeholders**

In [187]:
# n_hidden_layers = 1 # default is 1
learning_rate = 0.01
epochs = 10000 # cycles of feed forward + backprop
epsilon = 0.000001 # used to avoid "nan" values from log in cost function

# used to observe the change in accuracy as number of perceptrons increases
init_perceptrons = 50
total_perceptrons = 50
step = 25

# declare training data placeholders
X = tf.placeholder(tf.float32, [None, n]) # input x1, x2, x3, ..., x12 (12 nodes)
y = tf.placeholder(tf.float32, [None, k]) # output (5 nodes)

In [188]:
# run model
total_acc = run_model(X, y, n, epsilon, learning_rate, epochs, k, init_perceptrons, total_perceptrons, step)

> Using  50  perceptrons in the hidden layer ...
  EPOCH: 1000 Cost = 1.138464570045471


KeyboardInterrupt: 

# 3. Evaluation <a class="anchor" id="evaluation"></a>

*Note: Include comparisson with other related work*

| Paper | Method |
|-------|--------|
| [Automated Identification of High Impact Bug<br>Reports Leveraging Imbalanced Learning Strategies](http://ieeexplore.ieee.org.uproxy.library.dc-uoit.ca/stamp/stamp.jsp?arnumber=7552013&tag=1 "Paper") |  Naive Bayes Multinominal |

# 4. Conclusion <a class="anchor" id="conclusion"></a>