## Background

The goal of the MLapp project is to provide the following:

1. Illustrate how to build machine learning powered developer tools using the [GitHub Api](https://developer.github.com/v3/) and Flask.  We would like to show data scientists how to build exciting data products using machine learning on the GitHub marketplace, that developers can use.  Specifically, we will build an illustrative data product that will automatically label issues.  

2. Use Argo & Kubeflow to process the data, train the model, and serve the predictions. 

3. Gather feedback and iterate 


The scope of this notebook is to addresses part of goal #1, by illustrating how we can acquire a dataset of GitHub issue labels and train a classifier.  

The top issues on GitHub by count are illustrated in [this spreadsheet](https://docs.google.com/spreadsheets/d/1NPacnVsyZMBneeewvPGhCx512A1RPYf8ktDN_RpKeS4/edit?usp=sharing).  To keep things simple, we will build a model to classify an issue as a `bug`, `feature` or `question`.  We use hueristics to collapse a set of issue labels into these three categories, which can be viewed [in this query](https://console.cloud.google.com/bigquery?sq=123474043329:01abf8866144486f932c756730ddaff1).  

The heueristic for these class labels are contained within the below case statement:

```{sql}
  CASE when labels like '%bug%' and labels not like '%not bug%' then True else False end as Bug_Flag,
  CASE when labels like '%feature%' or labels like '%enhancement%' or labels like '%improvement%' or labels like '%request%' then True else False end as Feature_Flag,
  CASE when labels like '%question%' or labels like '%discussion%' then True else False end as Question_Flag,
```
    the above case statement is located within [this query](https://console.cloud.google.com/bigquery?sq=123474043329:01abf8866144486f932c756730ddaff1)
    

The following alternative projects were tried before this task that we did not pursue further:
 - Transfer learning using the [GitHub Issue Summarizer](https://github.com/hamelsmu/Seq2Seq_Tutorial) to enable the prediction of custom labels on existing repos.  Found that this did not work well as there is a considerable amount of noise with regards to custom labels in repositories and often not enough data to adequately predict this.  
 - Tried to classify more than the above three classes, however the human-labeled issues are very subjective and it is not clear what is a question vs. a bug.  
 - Tried multi-label classification since labels can co-occur.  There is very little overlap between `bug`, `feature` and `question` labels, so we decided to simplify things and make this a multi-class classificaiton problem instead.  


Note: the code in this notebook was executed on a [p3.8xlarge](https://aws.amazon.com/ec2/instance-types/p3/) instance on AWS.

## Outline 

This notebook will follow these steps:

1. Download and partition dataset
2. Pre-process dataset
2. Build model architecture
3. Train the model


# Download and Partition Dataset

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

pd.set_option('max_colwidth', 1000)

In [4]:
df = pd.concat([pd.read_csv(f'https://storage.googleapis.com/codenet/issue_labels/00000000000{i}.csv.gz')
                for i in range(10)])

#split data into train/test
traindf, testdf = train_test_split(df, test_size=.15)

traindf.to_pickle('traindf.pkl')
testdf.to_pickle('testdf.pkl')

#print out stats about shape of data
print(f'Train: {traindf.shape[0]:,} rows {traindf.shape[1]:,} columns')
print(f'Test: {testdf.shape[0]:,} rows {testdf.shape[1]:,} columns')

Train: 2,698,578 rows 10 columns
Test: 476,220 rows 10 columns


In [15]:
class_names = ['bug', 'feature', 'question']

### Discussion of the data

In [7]:
# preview data
traindf.head(3)

Unnamed: 0,url,repo,title,body,num_labels,labels,c_bug,c_feature,c_question,class_int
112473,"""https://github.com/DockStation/dockstation/issues/6""",DockStation/dockstation,feature request: image layers window hide,"some feedback for this window:\r \r ! image https://cloud.githubusercontent.com/assets/9369080/25952392/3769f842-3669-11e7-8a91-4ed2de640246.png \r \r \r it's really cool, but blocking. i think users would prefer to be able hide it the \ ok\ was not hiding it in my case - don't know if it should?",2,"[""enhancement"", ""wish""]",False,True,False,1
33329,"""https://github.com/julianschritt/secreth_telegrambot/issues/2""",julianschritt/secreth_telegrambot,grant group admins permissions to /cancelgame and /startgame,ending a game requires the person who started it to end it. even kicking and re-adding the bot doesn't close an existing game. so if the person who started a game is afk for a long time the bot becomes useless for that group.,1,"[""enhancement"", ""enhancement""]",False,True,False,1
95978,"""https://github.com/remotestorage/remotestorage-bookmarks-chrome/issues/3""",remotestorage/remotestorage-bookmarks-chrome,deprecate in favor of memm?,"this thing is super old and nobody's actively maintaining it. i don't know if someone's actively using it, but i think it would make sense to direct people coming here to https://github.com/lesion/memm, which works and is maintained by @lesion, who's also a core contributor to rs.js.",1,"[""question""]",False,False,True,2


Discussion of the data:  

- url:        url where you can find this issue
- repo:       owner/repo name
- title:      title of the issue
- body:       body of the issue, not including comments
- num_labels: number of issue labels
- labels:     an array of labels applied to the issue reprsented as a string
- c_bug:      boolean flag that indicates if the issue label corresponds to a bug
- c_feature:  boolean flag that indicates if the issue label corresponds to a feature
- c_question: boolean flag that indicates if the issue label corresponds to a question
- class_int:  integer between 0 - 2 that corresponds to the class label.  **0=bug, 1=feature, 2=question**

### Summary Statistics

Class frequency **0=bug, 1=feature, 2=question**

In [14]:
traindf.groupby('class_int').size()

class_int
0    1211335
1    1231499
2     255744
dtype: int64

number of unique repos

In [28]:
print(f' Avg # of issues per repo: {len(traindf) / traindf.repo.nunique():.1f}')
print(f" Avg # of issues per org: {len(traindf) / traindf.repo.apply(lambda x: x.split('/')[-1]).nunique():.1f}")

 Avg # of issues per repo: 7.6
 Avg # of issues per org: 8.7


In [38]:
pareto_df = pd.DataFrame({'pcnt': df.groupby('repo').size() / len(df), 'count': df.groupby('repo').size()})
pareto_df.sort_values('pcnt', ascending=False).head(25)

Unnamed: 0_level_0,pcnt,count
repo,Unnamed: 1_level_1,Unnamed: 2_level_1
Microsoft/vscode,0.005128,16281
rancher/rancher,0.00245,7779
MicrosoftDocs/azure-docs,0.001963,6233
godotengine/godot,0.001952,6198
ansible/ansible,0.00195,6192
hashicorp/terraform,0.001649,5235
kubernetes/kubernetes,0.001621,5147
lionheart/openradar-mirror,0.00133,4221
elastic/kibana,0.001201,3813
magento/magento2,0.001129,3583


# Pre-Process Data

To process the raw text data, we will use [ktext](https://github.com/hamelsmu/ktext)

In [50]:
from ktext.preprocess import processor
import dill as dpickle
import numpy as np

Clean, tokenize, and apply padding / truncating such that each document length = 75th percentile for the dataset.
Retain only the top keep_n words in the vocabulary and set the remaining words to 1 which will become common index for rare words.

**Warning:** the below block of code can take a long time to execute.

#### Learn the vocabulary from the training dataset

In [40]:
%%time

train_body_raw = traindf.body.tolist()
train_title_raw = traindf.title.tolist()

# Clean, tokenize, and apply padding / truncating such that each document length = 75th percentile for the dataset.
#  also, retain only the top keep_n words in the vocabulary and set the remaining words
#  to 1 which will become common index for rare words 

# process the issue body data
body_pp = processor(hueristic_pct_padding=.75, keep_n=8000)
train_body_vecs = body_pp.fit_transform(train_body_raw)

# process the title data
title_pp = processor(hueristic_pct_padding=.75, keep_n=4500)
train_title_vecs = title_pp.fit_transform(train_title_raw)

 See full histogram by insepecting the `document_length_stats` attribute.
 See full histogram by insepecting the `document_length_stats` attribute.


CPU times: user 8min 8s, sys: 32.3 s, total: 8min 40s
Wall time: 20min 24s


#### Apply transformations to Test Data

In [41]:
%%time

test_body_raw = testdf.body.tolist()
test_title_raw = testdf.title.tolist()

test_body_vecs = body_pp.transform_parallel(test_body_raw)
test_title_vecs = title_pp.transform_parallel(test_title_raw)



CPU times: user 57.5 s, sys: 31 s, total: 1min 28s
Wall time: 4min 16s


#### Extract Labels

Add an additional dimension to the end to facilitate compatibility with Keras.

In [53]:
train_labels = np.expand_dims(traindf.class_int.values, -1)
test_labels = np.expand_dims(testdf.class_int.values, -1)

#### Check shapes

In [54]:
assert train_body_vecs.shape[0] == train_title_vecs.shape[0] == train_labels.shape[0]
assert test_body_vecs.shape[0] == test_title_vecs.shape[0] == test_labels.shape[0]

#### Save pre-processors and data to disk

In [55]:
# Save the preprocessor
with open('body_pp.dpkl', 'wb') as f:
    dpickle.dump(body_pp, f)

with open('title_pp.dpkl', 'wb') as f:
    dpickle.dump(title_pp, f)

# Save the processed data
np.save('train_title_vecs.npy', train_title_vecs)
np.save('train_body_vecs.npy', train_body_vecs)
np.save('test_body_vecs.npy', test_body_vecs)
np.save('test_title_vecs.npy', test_title_vecs)
np.save('train_labels.npy', train_labels)
np.save('test_labels.npy', test_labels)

# Build Model Architecture