# Bug Severity Predictor for Mozilla

In this project, I'll build a severity predictor for the [Mozilla project](https://www.mozilla.org/en-US/) that uses the description of a bug report stored a in [Bugzilla Tracking System](https://bugzilla.mozilla.org/home) to predict its severity. 

The severity in the Mozilla project indicates how severe the problem is – from blocker ("application unusable") to trivial ("minor cosmetic issue"). Also, this field can be used to indicate whether a bug is an enhancement request. In my project, I have considered five severity levels: **trivial**, **minor**, **major**, **critical**, and **blocker**. I have ignored the default severity level (often **"normal"**) because this level is considered as a choice made by users when they are not sure about the correct severity level. 

## Data Preparation

One of the first steps in any machine learning project is the data preparation that includes the data loading, noting, and cleaning the information that will be included in the working dataset. So, this notebook is all about preparing the data and noting patterns about the features you are given and the distribution of data. 

### Project setup

The cell below will download the necessary Python packages to execute the code throughout this notebook.

In [1]:
# standard packages
import os

# local packages.
from data_preparation import clean_data_fn, load_data_fn, convert_to_ordinal_fn

### Read in the data

The cell below will download the necessary data and extract the files into the folder **data/raw**.

This data is a version of a dataset created by me, Ricardo Torres, and Mario Côrtes at the University of Campinas for long-lived bug prediction research. You can read all about the data collection at [Mendeley Data](https://data.mendeley.com/datasets/v446tfssgj/2).

> **Citation for data:** Gomes, Luiz; Torres, Ricardo; Côrtes, Mario (2021), “A Dataset for Long-lived Bug Prediction in FLOSS ”, Mendeley Data, V2, doi: 10.17632/v446tfssgj.2

In [2]:
reports_input_url= 'https://data.mendeley.com/public-files/datasets/v446tfssgj/files/8666b62f-ef75-45e5-89cd-f49795b9cbee/file_downloaded'
raw_reports_path = os.path.join('..', 'data', 'raw')

In [3]:
if not os.path.exists(raw_reports_path):
    os.makedirs(reports_raw_path)

raw_reports_path = os.path.join(raw_reports_path, 'mozilla_bug_report_data.csv')
!wget -O {raw_reports_path} {reports_input_url}

--2021-01-25 10:31:52--  https://data.mendeley.com/public-files/datasets/v446tfssgj/files/8666b62f-ef75-45e5-89cd-f49795b9cbee/file_downloaded
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving data.mendeley.com (data.mendeley.com)... 162.159.133.86, 162.159.130.86
Connecting to data.mendeley.com (data.mendeley.com)|162.159.133.86|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com/829a4fd4-ba89-4bc2-b4f8-5c18f49a699d [following]
--2021-01-25 10:31:53--  https://md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com/829a4fd4-ba89-4bc2-b4f8-5c18f49a699d
Resolving md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com (md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com)... 52.218.88.184
Connecting to md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com (md-datasets-public-files-prod.s3.eu-west-1.amazonaws.com)|52.218.88.184|:443... connected.
HTTP request sent, awai

In [4]:
!head -2 {raw_reports_path}

bug_id,creation_date,component_name,product_name,short_description,long_description,assignee_name,reporter_name,resolution_category,resolution_code,status_category,status_code,update_date,quantity_of_votes,quantity_of_comments,resolution_date,bug_fix_time,severity_category,severity_code
BUGZILLA-294734,2005-05-18,Bugzilla-General,BUGZILLA,Emergency 2.16.10 Release,"2.16.9 is broken -- many users can't enter bugs on it particularly not from a


In [5]:
reports_data = load_data_fn(raw_reports_path)

In [6]:
# filtering out bugs with normal severity level.
reports_data = reports_data.loc[reports_data['severity_category'] != 'normal']

#### Basic data exploration

In [7]:
reports_data.head()

Unnamed: 0,long_description,severity_category
0,is broken many users can t enter bugs on it p...,blocker
2,adding support for custom headers and cookie n...,blocker
9,the patch in bug regressed the fix from bug th...,major
15,from bugzilla helper user agent mozilla x u li...,major
20,i found it odd that relogin cgi didn t clear o...,minor


In [8]:
reports_data['severity_category'].value_counts()

major       737
critical    605
minor       540
trivial     302
blocker     204
Name: severity_category, dtype: int64

### Data conversion

The cell below will convert the categorial severity level to ordinal according to the following conversion table:

| severity category | ordinal code | 
| :---------------- | -----------: |
| trivial | 0 |
| minor | 1 |
| major | 2 |
| critical | 3 |
| blocker | 4 |

In [9]:
reports_data['severity_code'] = reports_data['severity_category'].apply(convert_to_ordinal_fn) 

In [10]:
reports_data.head()

Unnamed: 0,long_description,severity_category,severity_code
0,is broken many users can t enter bugs on it p...,blocker,4
2,adding support for custom headers and cookie n...,blocker,4
9,the patch in bug regressed the fix from bug th...,major,2
15,from bugzilla helper user agent mozilla x u li...,major,2
20,i found it odd that relogin cgi didn t clear o...,minor,1


In [11]:
reports_data['severity_code'].value_counts()

2    737
3    605
1    540
0    302
4    204
Name: severity_code, dtype: int64

### Export data cleaned

In [12]:
cleaned_reports_path = os.path.join('..', 'data', 'cleaned')

In [13]:
if not os.path.exists(cleaned_reports_path):
    os.makedirs(cleaned_reports_path)

In [14]:
cleaned_reports_path = os.path.join(cleaned_reports_path, 'mozilla_bug_report_data.csv')
reports_data[['long_description', 'severity_code']].to_csv(cleaned_reports_path, index=False)