In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

Data is created by this [sql query](https://console.cloud.google.com/bigquery?sq=123474043329:01abf8866144486f932c756730ddaff1) in BigQuery

### Download and partition data

In [30]:
base_url = 'https://storage.googleapis.com/codenet/issue_labels/'
df = pd.concat([pd.read_csv(base_url+f'00000000000{i}.csv.gz') for i in range(10)])
df = df[df['num_concurrent_classes'] <= 1]
df.to_pickle('labeled_issues_df.pkl')

Note on filtering based on **num_concurrent_classes** field:  

0 concurrent classes mean cases where issues are not classified as `Bug`, `Feature` or `Question`.  1 concurrent classes are cases where only 1 of `Bug`, `Feature` or `Question` are classified.  Given the very small overlap betwen the three classes, we can filter out these examples in order to simplify our problem (turning multi-label classification into multi-class classification).

In [31]:
# see the size of the serialized dataframe
!ls -lah labeled_issues_df.pkl

-rw-r--r-- 1 root root 4.6G Mar 26 16:52 labeled_issues_df.pkl


In [32]:
#read in data sample
traindf, testdf = train_test_split(df, test_size=.15)


#print out stats about shape of data
print(f'Train: {traindf.shape[0]:,} rows {traindf.shape[1]:,} columns')
print(f'Test: {testdf.shape[0]:,} rows {testdf.shape[1]:,} columns')

# preview data
traindf.head(3)

Train: 4,984,719 rows 11 columns
Test: 879,657 rows 11 columns


Unnamed: 0,url,repo,title,body,num_labels,labels,num_concurrent_classes,c_bug,c_feature,c_question,c_other
68065,"""https://github.com/ESMCI/cime/issues/2435""",ESMCI/cime,can we stop copying log files to the case dire...,i was talking to @cecilehannay about disk spac...,3,"[""in progress"", ""tp: CIMElib"", ""ty: enhancement""]",1,False,True,False,False
75697,"""https://github.com/prescottprue/react-redux-f...",prescottprue/react-redux-firebase,isloaded helper always returns false when a de...,if you provide a default value to datatojs t...,1,"[""enhancement""]",1,False,True,False,False
567001,"""https://github.com/denvaar/base64/issues/4""",denvaar/base64,any improvement in general,any improvement/optimization to the code in ge...,1,"[""hacktoberfest""]",0,False,False,False,True


In [71]:
traindf.to_pickle('traindf.pkl')
testdf.to_pickle('testdf.pkl')

### Discussion of The Dataset

This data are issues and labels, with labels for certain categories grouped together.  The intention is to use this data to build a classifier that can auto-label issues.  There are three classes `Bug`, `Feature` or `Question` indicated by the `c_bug`, `c_feature`, and `c_question`, respectively in the pandas DataFrame illustrated above.  

These labels were formed by grouping together human annotated labeled for issues, using the following hueristics (the below is SQL code):

```{sql}
  CASE when labels like '%bug%' and labels not like '%not bug%' then True else False end as c_bug,
  
  CASE when labels like '%feature%' or labels like '%enhancement%' or labels like '%improvement%' or labels like '%request%' then True else False end as c_feature,
  
  CASE when labels like '%question%' or labels like '%discussion%' then True else False end as c_question,
```

If the issue does not fall into one of these three categories the `c_other` flag is set to True.  This can be used as fourth class to classify.  

See Distribution of Labels

In [34]:
traindf[['c_bug', 'c_feature', 'c_question', 'c_other']].sum()

c_bug         1212503
c_feature     1231369
c_question     255189
c_other       2285658
dtype: int64

## Pre-Process Data For Deep Learning

See this [repo](https://github.com/hamelsmu/ktext) for documentation on the ktext package

In [36]:
from ktext.preprocess import processor

In [37]:
train_body_raw = traindf.body.tolist()
train_title_raw = traindf.title.tolist()
#preview output of first element
train_body_raw[0]

"i was talking to @cecilehannay about disk space issues on cheyenne. she pointed out that it's often useful to keep case directories around for a long time, but they take up much more space than they need to because the log files are copied into your case directory. can we stop copying log files into your case directory, just leaving them in your archive directory?"

In [38]:
%%time
# Clean, tokenize, and apply padding / truncating such that each document length = 70
#  also, retain only the top 8,000 words in the vocabulary and set the remaining words
#  to 1 which will become common index for rare words 
body_pp = processor(hueristic_pct_padding=.8, keep_n=8000)
train_body_vecs = body_pp.fit_transform(train_body_raw)

 See full histogram by insepecting the `document_length_stats` attribute.


CPU times: user 20min 2s, sys: 2min 58s, total: 23min
Wall time: 40min 47s


In [39]:
%%time
# Instantiate a text processor for the titles
title_pp = processor(hueristic_pct_padding=.8, keep_n=5000)

# process the title data
train_title_vecs = title_pp.fit_transform(train_title_raw)

 See full histogram by insepecting the `document_length_stats` attribute.


CPU times: user 3min 22s, sys: 2min 14s, total: 5min 37s
Wall time: 5min 53s


**Look at one example of processed issue titles**

In [40]:
print('\noriginal string:\n', train_title_raw[0])
print('after pre-processing:\n', train_title_vecs[0])


original string:
 can we stop copying log files to the case directory?
after pre-processing:
 [  31  238  534 2330  162   60    2    5  347  305]


### Apply Transforms to Test Data

In [52]:
test_body_raw = testdf.body.tolist()
test_title_raw = testdf.title.tolist()

test_body_vecs = body_pp.transform_parallel(test_body_raw)
test_title_vecs = title_pp.transform_parallel(test_title_raw)



Serialize all of this to disk for later use

In [53]:
import dill as dpickle
import numpy as np

# Save the preprocessor
with open('body_pp.dpkl', 'wb') as f:
    dpickle.dump(body_pp, f)

with open('title_pp.dpkl', 'wb') as f:
    dpickle.dump(title_pp, f)

# Save the processed data
np.save('train_title_vecs.npy', train_title_vecs)
np.save('train_body_vecs.npy', train_body_vecs)
np.save('test_body_vecs.npy', test_body_vecs)
np.save('test_title_vecs.npy', test_title_vecs)

### Extract Labels

In [76]:
train_labels = traindf[['c_bug', 'c_feature', 'c_question', 'c_other']].astype(int).values.argmax(axis=1)
test_labels = testdf[['c_bug', 'c_feature', 'c_question', 'c_other']].astype(int).values.argmax(axis=1)

In [77]:
np.save('train_labels.npy', train_labels)
np.save('test_labels.npy', test_labels)

### Check the data

In [78]:
assert train_body_vecs.shape[0] == train_title_vecs.shape[0] == train_labels.shape[0]
assert test_body_vecs.shape[0] == test_title_vecs.shape[0] == test_labels.shape[0]

### Notes about artifacts and datasets

In [79]:
! pwd

/ds/MLapp/notebooks


In [80]:
!ls -lah

total 13G
drwxr-xr-x 3 root root 6.0K Mar 26 22:36 .
drwxr-xr-x 4 root root 6.0K Mar 26 05:47 ..
drwxr-xr-x 2 root root 6.0K Mar 26 05:49 .ipynb_checkpoints
-rw-r--r-- 1 root root  20K Mar 26 22:32 1_Download_and_Preprocess.ipynb
-rw-r--r-- 1 root root  44K Mar 26 05:47 Demo.ipynb
-rw-r--r-- 1 root root 249M Mar 26 22:01 body_pp.dpkl
-rw-r--r-- 1 root root 4.6G Mar 26 16:52 labeled_issues_df.pkl
-rw-r--r-- 1 root root 470M Mar 26 22:14 test_body_vecs.npy
-rw-r--r-- 1 root root 6.8M Mar 26 22:36 test_labels.npy
-rw-r--r-- 1 root root  34M Mar 26 22:14 test_title_vecs.npy
-rw-r--r-- 1 root root 712M Mar 26 22:29 testdf.pkl
-rw-r--r-- 1 root root  23M Mar 26 22:01 title_pp.dpkl
-rw-r--r-- 1 root root 2.6G Mar 26 22:02 train_body_vecs.npy
-rw-r--r-- 1 root root  39M Mar 26 22:36 train_labels.npy
-rw-r--r-- 1 root root 191M Mar 26 22:01 train_title_vecs.npy
-rw-r--r-- 1 root root 3.9G Mar 26 22:29 traindf.pkl
