# Project Description

Given any high impact bug, identify its priority.

Methods in research of Software Engineering focus on predicting, localizing, and triaging bugs, but do not consider their impact or weight on the users and on the developers.

For this reason, we want to distinguish different kinds of bugs by placing in them in different priority categories: **Critical**, **Blocker**, **Major**, **Minor**, **Trivial**.

<hr>

# Table of Contents

1. [Data](#data)<br>
    1.1. [Data Extraction](#data-extraction)<br>
    1.2. [Manual Feature Selection](#manual-feature-selection)<br>
    1.3. [Data Cleaning](#data-cleaning)<br>
    1.4. [Generating New Features](#generating-new-features)<br>
    &nbsp;&nbsp;&nbsp;1.4.1. [Quantify Features](#quantify-data)<br>
    &nbsp;&nbsp;&nbsp;1.4.2. [Sentiment Analysis](#sentiment-analysis)<br>
    &nbsp;&nbsp;&nbsp;1.4.3. [Fix Duration](#fix-duration)<br>

2. [Implementation](#implementation)<br>
3. [Evaluation](#evaluation)<br>
4. [Conclusion](#conclusion)<br>

In [8]:
import numpy as np
import pandas as pd

# Data Extraction
import csv

# SentiStrength
import subprocess
import shlex
import os.path
import sys

# 1. Data <a class="anchor" id="data"></a>

We will work with a large dataset of high impact bugs, which was created by manually reviewing four thousand issue reports in four open source projects (Ambari, Camel, Derby and Wicket). The projects were extracted from JIRA, a platform for managing reported issues.

There are 1000 examples per project; there will be 4000 examples to work with in total.

These projects were selected because they met the following criteria for project selection:

* Target projects have a large number of (at least several thousand) reported issues , which enables the use for prediction model building and/or machine learning.

* Target projects use JIRA as an issue tracking system.

* Target projects are different from each other in application domains.

## 1.1. Data Extraction <a class="anchor" id="data-extraction"></a>

In [9]:
path_all_data = "../dataset/data_test.csv" # only using 10 examples

df = pd.read_csv(path_all_data, sep=',')
data = np.array(df)

# TODO: Only use columns where the hand-selected features lie

## 1.2. Manual Feature Selection <a class="anchor" id="manual-feature-selection"></a>

First, the most relevant features are selected manually. The original features included in the dataset were the following:

<img src="./resources/images/features_in_dataset.png" alt="Features in Dataset" width="360">

However, after careful consideration, only the following columns will be taken into account for further analysis:

| NAME | DESCRIPTION |
|:-----|:------------|
| `status` | Status of an issue (Resolved or Closed) |
| `assignee` | Assignee's Name |
| `summary` | Summary of an issue |
| `description` | Descriptions of an issue |
| `affected version` | Versions affected by an issue |
| `votes` | Number of votes |
| `watches` | Number of watchers |
| `description_words` | Number of words used in description |
| `assignee_count` | Number of assignees |
| `comment_count` | Number of comments for an issue |
| `commenter_count` | Number of developers who comment on an issue |
| `commit_count` | Number of commits to resolve an issue |

Criteria for manual feature filtering:
- Possible influence on bug priority prediction
- Lack of data
- Non-quantifiable data

## 1.3. Data Cleaning <a class="anchor" id="data-cleaning"></a>

In [10]:
# Remove rows where data is missing
print("> SIZE BEFORE CLEANING: " + str(data.shape))
for i, row in enumerate(data):
    for j, val in enumerate(row):
        if (row[j] == 'null'):
            print("> REMOVING ROW:\n" + str(row))
            data = np.delete(data, i, 0)
            break
            
print("> SIZE AFTER CLEANING: " + str(data.shape))

> SIZE BEFORE CLEANING: (10, 32)
> REMOVING ROW:
[226 'Bug' 'Resolved' 'Fixed' '   ' 'Minor' 'Suresh Srinivas'
 '2012/05/11 18:29:32 +0100' 'null' 'Unassigned'
 '2012/05/15 03:31:24 +0100' 3.376296296 'null'
 'Make the daemon names and other field names consistent   '
 'Following names need to be consistent: Hdfs -&gt; HDFS Mapreduce -&gt; MapReduce Zookeeper -&gt; ZooKeeper HADOOP -&gt; Hadoop   '
 '0.9.0' '0.9.0' 0 1 20 0 1 1 1 0 0 0 0 0 0 0 nan]
> SIZE AFTER CLEANING: (9, 32)


## 1.4. Generating New Features <a class="anchor" id="generating-new-features"></a>

### 1.4.1. Quantify Features <a class="anchor" id="quantify-features"></a>

### 1.4.2. Sentiment Analysis <a class="anchor" id="sentiment-analysis"></a>

[SentiStrength](http://sentistrength.wlv.ac.uk) estimates the strength of positive and negative sentiment in short texts, even for informal language. It has human-level accuracy for short social web texts in English, except political texts. SentiStrength reports two sentiment strengths:

<br><center>$-1$ *(not negative)* to $-5$ *(extremely negative)*</center>

<center>$1$ *(not positive)* to $5$ *(extremely positive)*</center><br>

For this project, they will both be added in order to be used as a new column for data analysis.

In [11]:
# Set the proper paths
SentiStrengthLocation = "./resources/SentiStrength/SentiStrength.jar" # The location of SentiStrength on your computer
SentiStrengthLanguageFolder = "./resources/SentiStrength/data/" # The location of the unzipped SentiStrength data files on your computer

In [12]:
# The following code tests that the above three locations are correct.
# An error will be displayed if there is an issue.
if not os.path.isfile(SentiStrengthLocation):
    print("SentiStrength not found at: ", SentiStrengthLocation)
if not os.path.isdir(SentiStrengthLanguageFolder):
    print("SentiStrength data folder not found at: ", SentiStrengthLanguageFolder)

In [13]:
# The code below allows SentiStrength to be called and ran on a single line of text.
def RateSentiment(sentiString):
    # Open a subprocess using shlex to get the command line string into the correct args list format
    p = subprocess.Popen(shlex.split("java -jar '" + SentiStrengthLocation + "' stdin sentidata '" + SentiStrengthLanguageFolder + "'"),stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
    # Communicate via stdin the string to be rated. Note that all spaces are replaced with "+"
    b = bytes(sentiString.replace(" ","+"), 'utf-8') # Can't send string in Python 3, must send bytes
    stdout_byte, stderr_text = p.communicate(b)
    stdout_text = stdout_byte.decode("utf-8")  # Convert from byte
    # -------- Edit - Nov 9 2017 --------
    stdout_list = stdout_text.split("\t")      # Split by tab: ['2', '-1','\n']
    del stdout_list[-1]                        # Get rid of the last newline element: ['2', '-1']
    results = list(map(int, stdout_list))      # Convert the characters to integers
    results = results[0] + results[1]          # Combine the positive and the negative
    # -------- END: Edit - Nov 9 2017 --------
    #stdout_text = stdout_text.rstrip().replace("\t"," ") # Remove the tab spacing between the positive and negative ratings. e.g. 1    -5 -> 1 -5
    #return stdout_text + " " + sentiString
    return results

In [14]:
# Test to ensure that it works correctly
print(RateSentiment("i am happy but not really"))

1


### 1.4.3. Fix Duration <a class="anchor" id="fix-duration"></a>

# 2. Implementation <a class="anchor" id="implementation"></a>

*Note: Automatic Feature Selection will be included here*

# 3. Evaluation <a class="anchor" id="evaluation"></a>

*Note: Include comparisson with other related work*

| Paper | Method |
|-------|--------|
| [Automated Identification of High Impact Bug<br>Reports Leveraging Imbalanced Learning Strategies](http://ieeexplore.ieee.org.uproxy.library.dc-uoit.ca/stamp/stamp.jsp?arnumber=7552013&tag=1 "Paper") |  Naive Bayes Multinominal |

# 4. Conclusion <a class="anchor" id="conclusion"></a>