# Project Description

Given any high impact bug, identify its priority.

Methods in research of Software Engineering focus on predicting, localizing, and triaging bugs, but do not consider their impact or weight on the users and on the developers.

For this reason, we want to distinguish different kinds of bugs by placing in them in different priority categories: **Critical**, **Blocker**, **Major**, **Minor**, **Trivial**.

<hr>

# Table of Contents

1. [Data](#data)<br>
    1.1. [Data Extraction](#data-extraction)<br>
    1.2. [Manual Feature Selection](#manual-feature-selection)<br>
    1.3. [Data Cleaning](#data-cleaning)<br>
    1.4. [Generating New Features](#generating-new-features)<br>
    &nbsp;&nbsp;&nbsp;1.4.1. [Sentiment Analysis](#sentiment-analysis)<br>
    &nbsp;&nbsp;&nbsp;1.4.2. [Fix Duration](#fix-duration)<br>
    &nbsp;&nbsp;&nbsp;1.4.3. [Quantify Features](#quantify-features)<br>

2. [Implementation](#implementation)<br>
3. [Evaluation](#evaluation)<br>
4. [Conclusion](#conclusion)<br>

In [1]:
import numpy as np
import pandas as pd

# Data Extraction
import csv

# SentiStrength
import subprocess
import shlex
import os.path
import sys

# Machine Learning
import tensorflow as tf
import sklearn

# 1. Data <a class="anchor" id="data"></a>

We will work with a large dataset of high impact bugs, which was created by manually reviewing four thousand issue reports in four open source projects (Ambari, Camel, Derby and Wicket). The projects were extracted from JIRA, a platform for managing reported issues.

There are 1000 examples per project; there will be 4000 examples to work with in total.

These projects were selected because they met the following criteria for project selection:

* Target projects have a large number of (at least several thousand) reported issues , which enables the use for prediction model building and/or machine learning.

* Target projects use JIRA as an issue tracking system.

* Target projects are different from each other in application domains.

## 1.1. Data Extraction <a class="anchor" id="data-extraction"></a>

In [2]:
def get_data(path):

    df = pd.read_csv(path, sep=',')
    data = np.array(df)
    
    with open(path, newline='') as f:
        reader = csv.reader(f)
        feature_headers = next(reader)
    
    data = np.insert(data, 0, feature_headers, 0)  
        
    return data

## 1.2. Manual Feature Selection <a class="anchor" id="manual-feature-selection"></a>

First, the most relevant features are selected manually. The original features included in the dataset were the following:

<img src="./resources/images/features_in_dataset.png" alt="Features in Dataset" width="360">

However, after careful consideration, only the following columns will be taken into account for further analysis:

| NAME | DESCRIPTION |
|:-----|:------------|
| `status` | Status of an issue (Resolved or Closed) |
| `assignee` | Assignee's Name |
| `summary` | Summary of an issue |
| `description` | Descriptions of an issue |
| `affected version` | Versions affected by an issue |
| `fixed_version` | Versions of a fixed issue |
| `votes` | Number of votes |
| `watches` | Number of watchers |
| `description_words` | Number of words used in description |
| `assignee_count` | Number of assignees |
| `comment_count` | Number of comments for an issue |
| `commenter_count` | Number of developers who comment on an issue |
| `commit_count` | Number of commits to resolve an issue |

Criteria for manual feature filtering:
- Actually provided in the dataset
- Possible influence on bug priority prediction
- Lack of data
- Non-quantifiable data

In [3]:
def manually_selected_features(data):
    
    print("> Getting manually selecting features...")
    
    # want columns 13 columns: 2, 5, 9, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 29
    # note: we're keeping the priority column for now
    data = np.delete(data,(0,1,3,4,6,7,8,10,11,12,23,24,25,26,27,28,30,31), axis=1)
    
    return data

## 1.3. Data Cleaning <a class="anchor" id="data-cleaning"></a>

In [4]:
# Remove rows where data is missing
def clean_data(data):
    
    print('> Cleaning data...')

    # print("> SIZE BEFORE CLEANING: " + str(data.shape))
    for i, row in enumerate(data):
        for j, val in enumerate(row):
            if (row[j] == 'null'):
                # print("> REMOVING ROW:\n" + str(row))
                data = np.delete(data, i, 0)
                break

    # print("> SIZE AFTER CLEANING: " + str(data.shape))
    print("> Number of examples after data cleaning: " + str(data[1:].shape[0]))
    
    return data

## 1.4. Generating New Features <a class="anchor" id="generating-new-features"></a>

### 1.4.1. Sentiment Analysis <a class="anchor" id="sentiment-analysis"></a>

[SentiStrength](http://sentistrength.wlv.ac.uk) estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths:

<br><center>$-1$ *(not negative)* to $-5$ *(extremely negative)*</center>

<center>$1$ *(not positive)* to $5$ *(extremely positive)*</center><br>

For this project, they will both be added in order to be used as a new column for data analysis.

In [5]:
# Set the proper paths
sentistrength_location = "./resources/SentiStrength/SentiStrength.jar" # The location of SentiStrength on your computer
sentistrength_language_folder = "./resources/SentiStrength/data/" # The location of the unzipped SentiStrength data files on your computer

In [6]:
# The following code tests that the above three locations are correct.
# An error will be displayed if there is an issue.
if not os.path.isfile(sentistrength_location):
    print("SentiStrength not found at: ", sentistrength_location)
if not os.path.isdir(sentistrength_language_folder):
    print("SentiStrength data folder not found at: ", sentistrength_language_folder)

In [7]:
# allows SentiStrength to be called and ran on a single line of text.
def rate_sentiment(senti_string):
       
    # Open a subprocess using shlex to get the command line string into the correct args list format
    p = subprocess.Popen(shlex.split("java -jar '" + sentistrength_location + "' stdin sentidata '" + sentistrength_language_folder + "'"),stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
    # Communicate via stdin the string to be rated. Note that all spaces are replaced with "+"
    b = bytes(senti_string.replace(" ","+"), 'utf-8') # Can't send string in Python 3, must send bytes
    stdout_byte, stderr_text = p.communicate(b)
    stdout_text = stdout_byte.decode("utf-8")  # Convert from byte
    # -------- Edit - Nov 9 2017 --------
    stdout_list = stdout_text.split("\t")      # Split by tab: ['2', '-1','\n']
    del stdout_list[-1]                        # Get rid of the last newline element: ['2', '-1']
    results = list(map(int, stdout_list))      # Convert the characters to integers
    results = results[0] + results[1]          # Combine the positive and the negative
    # -------- END: Edit - Nov 9 2017 --------
    #stdout_text = stdout_text.rstrip().replace("\t"," ") # Remove the tab spacing between the positive and negative ratings. e.g. 1    -5 -> 1 -5
    #return stdout_text + " " + senti_string
    
    return results

In [8]:
# Test to ensure that it works correctly
print(rate_sentiment("i am happy but not really"))

1


### 1.4.2. Fix Duration <a class="anchor" id="fix-duration"></a>

### 1.4.3. Quantify Features <a class="anchor" id="quantify-features"></a>

* **`status`**: {`resolved`, `closed`}
* **`assignee`**: `string`

# 2. Implementation <a class="anchor" id="implementation"></a>

*Note: Automatic Feature Selection will be included here*

### Defining helper functions

In [9]:
def apply_sigmoid(X, W, b):
    return tf.sigmoid(tf.add(tf.matmul(X, W), b))

In [10]:
def get_cost(y, y_, epsilon):
    return (-tf.reduce_mean(y * tf.log(y_ + epsilon) + (1 - y) * tf.log(1 - y_ + epsilon)))

In [11]:
# TODO: Implement this function with optional multiple layers
def get_output_layer(X, n, k, n_perceptrons):
    
    # declare the weights connecting the input to the hidden layer
    hidden_layer = {'W': tf.Variable(tf.random_normal([n, n_perceptrons])),
                    'b': tf.Variable(tf.random_normal([n_perceptrons]))}

    # declare the weights connecting the hidden layer to the output layer
    output_layer = {'W': tf.Variable(tf.random_normal([n_perceptrons, k])),
                    'b': tf.Variable(tf.random_normal([k]))}
    
    # calculate hidden layer
    hidden_out = apply_activation_function(X, hidden_layer['W'], hidden_layer['b'])

    # calculate output layer
    y_ = apply_activation_function(hidden_out, output_layer['W'], output_layer['b'])
    
    return y_

In [12]:
from sklearn.model_selection import train_test_split
def split_data(data, labels, train_perc):
    
    x_train, x_test, y_train, y_test = train_test_split(data, labels, train_size=train_perc, random_state=42)

    return x_train, x_test, y_train, y_test

### Fetch data

In [18]:
# NOTE: only using 10 examples for now
path_all_data = "../dataset/data_test.csv"
data = clean_data(manually_selected_features(get_data(path_all_data)))

# extract labels (y) column from total dataset
labels = data[1:,1]
data = np.delete(data, 1, 1)

feature_labels = [label.strip() for label in  data[0]] # remove white space around strings
data = data[1:]

train_perc = .7 # percentage of total data used for training
x_train, x_test, y_train, y_test = split_data(data, labels, train_perc)
m = x_train.shape[0]
n = data.shape[1] # number of features
k = 5 # number of classes

print("> Features considered: " + str(feature_labels))

> Getting manually selecting features...
> Cleaning data...
> Number of examples after data cleaning: 10
> Features considered: ['status', 'assignee', 'summary', 'description', 'affected_version', 'fixed_version', 'votes', 'watches', 'description_words', 'assingnee_count', 'comment_count', 'commenter', 'commit_count']


### Set variables and placeholders

In [14]:
learning_rate = 0.01
epochs = 10000 # cycles of feed forward + backprop
epsilon = 0.000001 # used to avoid "nan" values from log

# used to observe the change in accuracy as number of perceptrons increases
init_perceptrons = 2
total_perceptrons = 2
step = 25

# declare training data placeholders
X = tf.placeholder(tf.float32, [None, n]) # input x1 and x2 (2 nodes)
y = tf.placeholder(tf.float32, [None, k]) # output (1 node {1, 0})

# 3. Evaluation <a class="anchor" id="evaluation"></a>

*Note: Include comparisson with other related work*

| Paper | Method |
|-------|--------|
| [Automated Identification of High Impact Bug<br>Reports Leveraging Imbalanced Learning Strategies](http://ieeexplore.ieee.org.uproxy.library.dc-uoit.ca/stamp/stamp.jsp?arnumber=7552013&tag=1 "Paper") |  Naive Bayes Multinominal |

# 4. Conclusion <a class="anchor" id="conclusion"></a>