
# Project: Back-End (BE) and Front-End (FE) bug classification

 ## Table of Contents <ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
</ul>

<a id='intro'></a>
## Introduction
> The fact that a bug is a front end (FE) bug or a back end (BE) bug can be helpfull for software production teams. Most of the companies have FE and BE teams, so tagging a bug as an FE or BE bug can help to a better bug assignment. In this work, we try to classify bugs to FE and BE classes. For this purpose, we use **[Eclips platform](https://github.com/logpai/bugrepo/blob/master/EclipsePlatform/eclipse_platform.zip)** data set. These are the steps that we are following to perform this task:
> <ol>
    <li>Undestanding the Data set</li>
    <li>Manually calssifying a limitted sample of the dataset</li>
    <li>Performing some machine learning algorithm on the sample and finding out the accuracy of them</li>
    <li>Asking some developers to classify another sample of dataset (Evaluation set) </li>
    <li>Using the tarined machine learning on the evaluation set </li>
</ol>

<a id='wrangling'></a>
## Data Wrangling

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

%matplotlib inline

# Setting random seed to get the same sample set everytime we run the code
np.random.seed(55)

In [26]:
#Reading the dataset from the csv file
df = pd.read_csv('eclipse_platform.csv')

df.head()

Unnamed: 0,Issue_id,Priority,Component,Duplicated_issue,Title,Description,Status,Resolution,Version,Created_time,Resolved_time
0,1,P3,Team,,Usability issue with external editors (1GE6IRL),- Setup a project that contains a *.gif resour...,CLOSED,FIXED,2.0,2001-10-10 21:34:00 -0400,2012-02-09 15:57:47 -0500
1,2,P5,Team,,Opening repository resources doesnt honor type...,Opening repository resource always open the de...,RESOLVED,FIXED,2.0,2001-10-10 21:34:00 -0400,2002-05-07 10:33:56 -0400
2,3,P5,Team,,Sync does not indicate deletion (1GIEN83),KM (10/2/2001 5:55:18 PM); \tThis PR about the...,RESOLVED,FIXED,2.0,2001-10-10 21:34:00 -0400,2010-05-07 10:28:53 -0400
3,4,P5,Team,,need better error message if catching up over ...,- become synchronized with some project in a r...,RESOLVED,FIXED,2.0,2001-10-10 21:34:00 -0400,2002-03-01 16:27:31 -0500
4,5,P3,Team,,ISharingManager sharing API inconsistent (1GAU...,For getting/setting the managed state of a res...,RESOLVED,WONTFIX,2.0,2001-10-10 21:34:00 -0400,2008-08-15 08:04:36 -0400


In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85156 entries, 0 to 85155
Data columns (total 11 columns):
Issue_id            85156 non-null int64
Priority            85156 non-null object
Component           85156 non-null object
Duplicated_issue    14404 non-null float64
Title               85156 non-null object
Description         85027 non-null object
Status              85156 non-null object
Resolution          85156 non-null object
Version             85156 non-null object
Created_time        85156 non-null object
Resolved_time       85156 non-null object
dtypes: float64(1), int64(1), object(9)
memory usage: 7.1+ MB


In [30]:
# Selecting 200 bug report to calssify them manually 
df_sample = df.sample(n=200)

df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 7419 to 42783
Data columns (total 11 columns):
Issue_id            200 non-null int64
Priority            200 non-null object
Component           200 non-null object
Duplicated_issue    33 non-null float64
Title               200 non-null object
Description         200 non-null object
Status              200 non-null object
Resolution          200 non-null object
Version             200 non-null object
Created_time        200 non-null object
Resolved_time       200 non-null object
dtypes: float64(1), int64(1), object(9)
memory usage: 18.8+ KB


In [31]:
# Generate a file name based on time (we don't want to overwrite the file after each run)
filename = ('sample_200_' + str(datetime.datetime.now())+ '.csv').replace(':','_').replace(' ','_')
df_sample.to_csv(filename)

In [32]:
# Reading the manually classified file
df = pd.read_csv('sample_1000.csv', index_col='Issue_id')

In [4]:
df = df[~df['Tag'].isna()]

In [5]:
df = df[['Priority','Component','Title','Description','Tag']]