# Data Aquisition

As a first step, we need to fetch the raw data and bring into a shape that we can use for later stages (i.e. model training).

In [48]:
# required libraries
from github import Github
import os
import pandas as pd
import numpy as np
import pickle

# we are using some less optimal code, suppress the warnings for now
import warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)

## Data Preparation

TODO: In the final stage, we need to make sure that we keep track of the unique labels and how they are represented in the NN (by index or otherwise). Changes to the set (labels added/removed) need to be anticipated. This can happen when issues arrive that we haven't seen before and that include labels that have been unknown at training time 

In [49]:
# retrieve issues from github

force_fetch = False

access_token = USER = os.getenv('GH_API_ACCESS')
token = Github(access_token)
repo = token.get_repo('quarkusio/quarkus')

# load issue if they don't exist (or forced)
if (force_fetch or not os.path.exists('../data/issues.pkl')): 
    issues = repo.get_issues(state='open')  
    
    cols = columns = ['title', 'body', 'labels']
    df = pd.DataFrame(columns = cols)
    unique_labels = set()

    for issue in issues:    
        label_names = []
        for label in issue.labels:        
            label_names.append(label.name)
            if label.name.startswith("area/"): # narrow down to `area/*` labels
                unique_labels.add(label.name)
        new_record = pd.DataFrame([[issue.title, issue.body, label_names]], columns=cols)
        df = pd.concat([df, new_record], ignore_index=True)    

    pickle.dump(df, open('../data/raw/issues.pkl', 'wb'))
    pickle.dump(unique_labels, open('../data/raw/labels.pkl', 'wb'))
else:
    print("Loading issues from file...")
    unique_labels = pickle.load(open("../data/raw/labels.pkl", 'rb'))
    df = pickle.load(open("../data/raw/issues.pkl", 'rb'))

# let's see what we have
print("Number of issues in total: ", len(df))
print("Unique labels ({0})".format(len(unique_labels)))    
df.head()


Number of issues in total:  2312
Unique labels (101)


Unnamed: 0,title,body,labels
0,Native build fails with signed jars containing...,Building native applications which depend on s...,"[kind/bug, area/native-image]"
1,Drop duplicate source plugin entry in release ...,,[]
2,Adopt --strict-image-heap that will land in Gr...,oracle/graal#7393 adds a new --strict-image-he...,[area/core]
3,Adopt new option `--strict-image-heap` that wi...,### Description\n\nhttps://github.com/oracle/g...,"[kind/enhancement, triage/needs-triage]"
4,Add note about unsupported @Lock in Spring Dat...,Relates to: #35891,"[area/documentation, triage/backport?, triage/..."


Once we have the raw data, we project the dependentant variables (labels) into the DF and tackle the tokenization of the text (title, body) in a seperate step.

Let's start with the labels, but narrow it down to just `area/*` labels to simplify this example

In [50]:
# project the keys of the known labels
for key in unique_labels:    
        df.insert(len(df.columns), key, 0)

# # project the values of the of actual labels used on each issue
for index, row in df.iterrows():    
    if isinstance(row["labels"], list): # omit empty labels        
        for label_used in row["labels"]:
            if label_used in unique_labels:
                df.loc[index, [label_used]] = 1                        
                    
# let's see what we got
print(df.iloc[:,6:len(df.columns)-1])

# write a separate df for inference that contains the label encodings
pickle.dump(df, open('../data/prepared/issues.pkl', 'wb'))

      area/keycloak  area/infinispan  area/documentation  area/jbang  \
0                 0                0                   0           0   
1                 0                0                   0           0   
2                 0                0                   0           0   
3                 0                0                   0           0   
4                 0                0                   1           0   
...             ...              ...                 ...         ...   
2307              0                0                   0           0   
2308              0                0                   0           0   
2309              0                0                   0           0   
2310              0                0                   0           0   
2311              0                0                   0           0   

      area/kogito  area/core  area/elasticsearch  area/codestarts  area/stork  \
0               0          0                   0      