# Issue Classification Example

In this example we perform a logical regression on a dataset of github issues to predict the labels on newly entered tickets.

In a second step we try to find issues that deal with similar problems.

In [None]:
# required libraries
from github import Github
import os
import pandas as pd
import numpy as np
import pickle

# we are using some less optimal code, suppress the warnings for now
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

## Data Preparation

TODO: In the final stage, we need to make sure that we keep track of the unique labels and how they are represented in the NN (by index or otherwise). Changes to the set (labels added/removed) need to be anticipated. This can happen when issues arrive that we haven't seen before and that include labels that have been unknown at training time 

In [None]:
# retrieve issues from github

force_fetch = False

access_token = USER = os.getenv('GH_API_ACCESS')
token = Github(access_token)
repo = token.get_repo('quarkusio/quarkus')

# load issue if they don't exist (or forced)
if (force_fetch or not os.path.exists('../data/issues.pkl')): 
    issues = repo.get_issues(state='open')  
    
    cols = columns = ['number', 'title', 'body', 'labels', 'state']
    df = pd.DataFrame(columns = cols)
    unique_labels = set()

    for issue in issues:    
        label_names = []
        for label in issue.labels:        
            label_names.append(label.name)
            if not label.name.startswith("triage"): # this clause is specific to the underlying data of this specific repo
                unique_labels.add(label.name)
        new_record = pd.DataFrame([[issue.number, issue.title, issue.body, label_names, issue.state]], columns=cols)
        df = pd.concat([df, new_record], ignore_index=True)    

    pickle.dump(df, open('../data/issues.pkl', 'wb'))
    pickle.dump(unique_labels, open('../data/labels.pkl', 'wb'))
else:
    print("Loading issues from file...")
    unique_labels = pickle.load(open("../data/labels.pkl", 'rb'))
    df = pickle.load(open("../data/issues.pkl", 'rb'))

# let's see what we have
print("Number of issues in total: ", len(df))
print("Unique labels ({0})".format(len(unique_labels)))    
df.head()


Once we have the raw data, we need to prepare project the dependentant variables (labels) into the DF and tackle the tokenization of the text (title, body) 

Let's start with the labels

In [None]:
# project the keys of the known labels
for key in unique_labels:
    df.insert(len(df.columns), key, 0)

# # project the values of the of actual labels used on each issue
for index, row in df.iterrows():    
    if isinstance(row["labels"], list): # omit empty labels        
        for label_used in row["labels"]:
            if label_used in unique_labels:
                df.loc[index, [label_used]] = 1                        
                        
df.head()

# let's what we got
first_label = next(iter(unique_labels))
df[first_label].value_counts().plot(kind='pie', )