# Bug Severity Predictor for Mozilla

In this project, I'll build a severity predictor for the [Mozilla project](https://www.mozilla.org/en-US/) that uses the description of a bug report stored a in [Bugzilla Tracking System](https://bugzilla.mozilla.org/home) to predict its severity. 

The severity in the Mozilla project indicates how severe the problem is – from blocker ("application unusable") to trivial ("minor cosmetic issue"). Also, this field can be used to indicate whether a bug is an enhancement request. In my project, I have considered five severity levels: **trivial(0)**, **minor(1)**, **major(2)**, **critical(3)**, and **blocker(4)**. I have ignored the default severity level (often **"normal"**) because this level is considered as a choice made by users when they are not sure about the correct severity level. 

## Feature engineering

This step in machine learning workflow will extract features from raw data via data mining techniques.These features can be used to improve the performance of 
machine learning algorithms. In my project, I'll use the [Pre-training of Deep Bidirectional Transformers for Language Understanding (BERT)](https://arxiv.org/abs/1810.04805) to extract feature for my predicting model.

## Project setup

The cell below declares the required packages. 

In [1]:
import logging 
import os

import numpy as np
import pandas as pd
import torch
import transformers as ppb

from sklearn.model_selection import train_test_split
from feature_engineering import extract_features_fn
#from google.colab import drive 
#drive.mount('/drive')

## Read in the data

The cell below load the cleaned bug reports dataset. This dataset has the following attributes:

| **Attribute** | **Description** |
| :------------ | :-------------- |
| long_description |  The description of a report written when the bug report was opened. |
| severity_code | The target label that represents the bug severity level.|

In [2]:
batch_len=1000 # to preserve computational resources only 1000 bug reports were used.
reports_input_path = os.path.join('..', 'data', 'clean')
reports_data = pd.read_csv(os.path.join(reports_input_path, 'mozilla_bug_report_data.csv'))[:batch_len]

In [3]:
reports_data.head()

Unnamed: 0,long_description,severity_code
0,is broken many users can t enter bugs on it p...,4
1,adding support for custom headers and cookie n...,4
2,the patch in bug regressed the fix from bug th...,2
3,from bugzilla helper user agent mozilla x u li...,2
4,i found it odd that relogin cgi didn t clear o...,1


## Extracting features

These cells below extracts features from the dataset to be inputed in the model to predict the bug severity level. I've choose the [DistilBERT](https://medium.com/huggingface/distilbert-8cf3380435b5) as the feature extractor of my project.

> "DistilBERT processes the sentence and passes along some information it extracted from it on to the > next model. DistilBERT is a smaller version of [BERT](https://arxiv.org/abs/1810.04805) developed and open sourced by the 
> team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its
> performance." (Jay Alammar in [A Visual Guide to Using BERT for the First Time](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/))


In [4]:
# import pre-trained DistilBERT model and tokenizer
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, 
                                                    ppb.DistilBertTokenizer, 
                                                    'distilbert-base-uncased')

In [5]:
# load pretrained weigths in model/tokenizer objects.
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model     = model_class.from_pretrained(pretrained_weights)

In [6]:
# extract features using the function extract_features_fn from feature_engineering 
# local package.
features, labels  = extract_features_fn(reports_data, model, tokenizer)

In [7]:
# split features and labels in training and testing sets.
train_features, test_features, train_labels, test_labels = train_test_split(features, labels, test_size=.25
                                                                            , stratify=labels, random_state=42)

## Exporting the extracted features

The cell below saves the training and testing sets in disk for training and testing steps
machine workflow.

In [8]:
#processed_output_path = os.path.join('/','drive', 'My Drive', 'data', 'processed')
processed_output_path = os.path.join('..','data', 'processed')
if not os.path.exists(processed_output_path):
    os.makedirs(processed_output_path)
    
torch.save(np.column_stack((train_features, train_labels)), 
        os.path.join(processed_output_path, 'mozilla_bug_report_train_data.pt'))
torch.save(np.column_stack((test_features, test_labels)), 
        os.path.join(processed_output_path, 'mozilla_bug_report_test_data.pt'))