# Preprocess Data
This notebook contains materials to parse raw python files into function and docstring pairs, tokenize both function and dosctring into tokens, and split these pairs into a train, valid and test set.  

*This step is optional, as we provide links to download pre-processed data at various points in the tutorial.  However, you might find it useful to go through these steps in order to understand how the data is prepared.*

If you are using the recommended approach of using a `p3.8xlarge` instance for this entire tutorial you can use this docker container to run this notebook: [hamelsmu/ml-gpu](https://hub.docker.com/r/hamelsmu/ml-gpu/).

Alternatively, if you wish to speed up *this notebook* by using an instance with lots of cores (because everything in this notebook is CPU bound), you can use this container [hamelsmu/ml-cpu](https://hub.docker.com/r/hamelsmu/ml-gpu/).

In [15]:
%load_ext autoreload
%autoreload 2

import ast
import glob
import re
from pathlib import Path

import astor
import pandas as pd
import spacy
from tqdm import tqdm
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from IPython.display import display
from general_utils import apply_parallel, flattenlist
from IPython.display import display
EN = spacy.load('en')

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


In [16]:
! python -V
! mkdir data
! mkdir data/processed_data
! mkdir data/lang_model
! mkdir data/lang_model_emb
! mkdir data/seq2seq

Python 3.6.3 :: Anaconda custom (64-bit)
mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘data/processed_data’: File exists
mkdir: cannot create directory ‘data/lang_model’: File exists
mkdir: cannot create directory ‘data/lang_model_emb’: File exists
mkdir: cannot create directory ‘data/seq2seq’: File exists


## Download and read  raw python files

The first thing we will want to do is to gather python code.  There is an open dataset that Google hosts on [BigQuery](https://cloud.google.com/bigquery/) that has code from open source projects on Github.  You can use [bigquery](https://cloud.google.com/bigquery/) to get the python files as a tabular dataset by executing the following SQL query in the bigquery console:

```{sql}
SELECT 
 max(concat(f.repo_name, ' ', f.path)) as repo_path,
 c.content
FROM `bigquery-public-data.github_repos.files` as f
JOIN `bigquery-public-data.github_repos.contents` as c on f.id = c.id
JOIN (
      --this part of the query makes sure repo is watched at least twice since 2017
      SELECT repo FROM(
        SELECT 
          repo.name as repo
        FROM `githubarchive.year.2017` WHERE type="WatchEvent"
        UNION ALL
        SELECT 
          repo.name as repo
        FROM `githubarchive.month.2018*` WHERE type="WatchEvent"
        )
      GROUP BY 1
      HAVING COUNT(*) >= 2
      ) as r on f.repo_name = r.repo
WHERE 
  f.path like '%.py' and --with python extension
  c.size < 15000 and --get rid of ridiculously long files
  REGEXP_CONTAINS(c.content, r'def ') --contains function definition
group by c.content
```


Here is a link to the [SQL Query](https://bigquery.cloud.google.com/savedquery/506213277345:009fa66f301240e5ad9e4006c59a4762) incase it is helpful.  The raw data contains approximate 1.2 million distinct python code files.

**To make things easier for this tutorial, the folks on the Google [Kubeflow team](https://kubernetes.io/blog/2017/12/introducing-kubeflow-composable/) have hosted the raw data for this tutorial in the form of 10 csv files, available at the url: https://storage.googleapis.com/kubeflow-examples/code_search/raw_data/00000000000{i}.csv as illustrated in the below code:**

In [17]:
%%time
import json
# Read the data into a pandas dataframe, and parse out some meta-data
# [function_name, code_tokens, doc_tokens, doc_origin, code, beginline, endline]
with open('ans.json', 'r') as f:
    raw_data = json.load(f)
wdf = pd.DataFrame(raw_data, columns=['function_name', 'function_tokens', 'docstring_tokens', 'doc_origin', 'code', 'beginline', 'endline'])
wdf['function_tokens'] = wdf['function_tokens'].apply(lambda x:' '.join(x))
wdf['docstring_tokens'] = wdf['docstring_tokens'].apply(lambda x:' '.join(x))
wdf['doc_origin'] = wdf['doc_origin'].apply(lambda x:'\n'.join(x))
wdf['code'] = wdf['code'].apply(lambda x:''.join(x))
display(wdf.head())
print(wdf.shape)

with open('nodoc.json', 'r') as f:
    raw_data = json.load(f)
ndf = pd.DataFrame(raw_data, columns=['function_name', 'function_tokens', 'docstring_tokens', 'doc_origin', 'code', 'beginline', 'endline'])
ndf['function_tokens'] = ndf['function_tokens'].apply(lambda x:' '.join(x))
ndf['docstring_tokens'] = ndf['docstring_tokens'].apply(lambda x:' '.join(x))
ndf['doc_origin'] = ndf['doc_origin'].apply(lambda x:'\n'.join(x))
ndf['code'] = ndf['code'].apply(lambda x:''.join(x))
display(ndf.head())
print(ndf.shape)

Unnamed: 0,function_name,function_tokens,docstring_tokens,doc_origin,code,beginline,endline
0,baseInRange,function baseInRange number start end return n...,base implementation _ .inrange n't coerce argu...,/**\n * The base implementation of `_.inRange`...,"function baseInRange(number, start, end) {\n ...",16,18
1,Entrypoint,function Entrypoint name this name name this c...,mit license http www.opensource.org licenses m...,/*\n\tMIT License http://www.opensource.org/li...,function Entrypoint(name) {\n\tthis.name = nam...,7,10
2,__blank__,function buildModuleUrl Check Color defined de...,widget displaying information description @ali...,/**\n * A widget for displaying informatio...,function (value) {\n // Set the...,86,111
3,__blank__,function var a iD osmNode id a loc 0 0 var b i...,--- > b > c,\n a ---> b ===> c\n\n,function () {\n //\n // a ---> ...,142,160
4,__blank__,function var a iD osmNode id a loc 0 0 var b i...,--- > b > c,\n a ---> b ===> c\n\n,function () {\n //\n // a ---> ...,162,180


(35000, 7)


Unnamed: 0,function_name,function_tokens,docstring_tokens,doc_origin,code,beginline,endline
0,_interopRequireDefault,function _interopRequireDefault obj return obj...,,,function _interopRequireDefault(obj) { return ...,9,9
1,extern,function t var walt n import extern Add from e...,,,"function extern(k, i) {\n return k + i;...",15,17
2,__blank__,function mod t is mod instance exports test 4,,,function (mod) {\n t.is(mod.instance.export...,18,20
3,__blank__,function t var walt n shadowing variables shou...,,,function (mod) {\n t.is(mod.instance.export...,26,28
4,__blank__,function t var src n For pointers n const tabl...,,,"function () {\n return (0, _.compile)('func...",33,35


(1000, 7)
CPU times: user 1.02 s, sys: 115 ms, total: 1.13 s
Wall time: 1.13 s


## Functions to parse data and tokenize

Our goal is to parse the python files into (code, docstring) pairs.  Fortunately, the standard library in python comes with the wonderful [ast](https://docs.python.org/3.6/library/ast.html) module which helps us extract code from files as well as extract docstrings.  

We also use the [astor](http://astor.readthedocs.io/en/latest/) library to strip the code of comments by doing a round trip of converting the code to an [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree) and then from AST back to code. 

The below convience function `apply_parallel` parses the code in parallel using process based threading.  Adjust the `cpu_cores` parameter accordingly to your system resources!

## Flatten code, docstring pairs and extract meta-data

Flatten (code, docstring) pairs

Extract meta-data and format dataframe.  

We have not optimized this code.  Pull requests are welcome!

## Remove Duplicates

In [18]:
%%time
# remove observations where the same function appears more than once
before_dedup = len(wdf)
wdf = wdf.drop_duplicates(['code', 'function_tokens'])
after_dedup = len(wdf)

print(f'Removed {before_dedup - after_dedup:,} duplicate rows')
before_dedup = len(ndf)
ndf = ndf.drop_duplicates(['code', 'function_tokens'])
after_dedup = len(ndf)

print(f'Removed {before_dedup - after_dedup:,} duplicate rows')

Removed 2,258 duplicate rows
Removed 180 duplicate rows
CPU times: user 80.9 ms, sys: 0 ns, total: 80.9 ms
Wall time: 80 ms


In [19]:
print(wdf.shape, ndf.shape)

(32742, 7) (820, 7)


## Separate function w/o docstrings

## Partition code by repository to minimize leakage between train, valid & test sets. 
Rough assumption that each repository has its own style.  We want to avoid having code from the same repository in the training set as well as the validation or holdout set.

In [20]:
# train, valid, test splits
train, test = train_test_split(wdf, train_size=0.99, shuffle=True, random_state=8081)
train, valid = train_test_split(train, train_size=0.98, random_state=8081)



In [21]:
train = pd.DataFrame(train).reset_index(drop=True)
train_name_df = pd.DataFrame([str(i)+'.js' for i in range(train.shape[0])])
train = train.assign(full_path=train_name_df)

valid = pd.DataFrame(valid).reset_index(drop=True)
valid_name_df = pd.DataFrame([str(i)+'.js' for i in range(valid.shape[0])])
valid = valid.assign(full_path=valid_name_df)

test = pd.DataFrame(test).reset_index(drop=True)
test_name_df = pd.DataFrame([str(i)+'.js' for i in range(test.shape[0])])
test = test.assign(full_path=test_name_df)

ndf = pd.DataFrame(ndf).reset_index(drop=True)
ndf_name_df = pd.DataFrame([str(i)+'.js' for i in range(ndf.shape[0])])
ndf = ndf.assign(full_path=ndf_name_df)


In [22]:
print(f'train set num rows {train.shape[0]:,}')
print(f'valid set num rows {valid.shape[0]:,}')
print(f'test set num rows {test.shape[0]:,}')
print(f'without docstring rows {ndf.shape[0]:,}')

train set num rows 31,765
valid set num rows 649
test set num rows 328
without docstring rows 820


Preview what the training set looks like.  You can start to see how the data looks, the function tokens and docstring tokens are what will be fed downstream into the models.  The other information is important for diagnostics and bookeeping.

In [23]:
display(train.head())

Unnamed: 0,function_name,function_tokens,docstring_tokens,doc_origin,code,beginline,endline,full_path
0,__blank__,function code globals args closure closure2 va...,@constructor @param function code javascript c...,/**\n * @constructor\n * @param {Function} cod...,"function (code, globals, args, closure, closur...",16,50,0.js
1,_callee5$,function _callee5 var manager actual return re...,_ _ pure _ _,/*#__PURE__*/,function _callee5$(_context5) {\n while...,311,334,1.js
2,getResponseHeader,function getResponseHeader header,get single response header response @param hea...,/**\n * Get a single response header from ...,function getResponseHeader(header) {},121,121,2.js
3,_callee$,function var _ref _asyncToGenerator regenerato...,_ _ pure _ _,/*#__PURE__*/,function _callee$(_context) {\n\t\t\t\t\twhile...,62,81,3.js
4,dematerializeOperatorFunction,function dematerialize return function demater...,converts observable @link notification objects...,/**\n * Converts an Observable of {@link Notif...,function dematerializeOperatorFunction(source)...,60,62,4.js


## Output each set to train/valid/test.function/docstrings/lineage files
Original functions are also written to compressed json files. (Raw functions contain `,`, `\t`, `\n`, etc., it is less error-prone using json format)

`{train,valid,test}.lineage` are files that contain a reference to the original location where the code was retrieved. 

In [24]:
def write_to(df, filename, path='./data/processed_data/'):
    "Helper function to write processed files to disk."
    out = Path(path)
    out.mkdir(exist_ok=True)
    df.full_path.to_csv(out/'{}.full_path'.format(filename), index=False)
    df.function_tokens.to_csv(out/'{}.function'.format(filename), index=False)
    df.code.to_json(out/'{}_original_function.json.gz'.format(filename), orient='values', compression='gzip')
    df.beginline.to_csv(out/'{}.line1'.format(filename), index=False)
    df.endline.to_csv(out/'{}.line2'.format(filename), index=False)
    if filename != 'without_docstrings':
        df.docstring_tokens.to_csv(out/'{}.docstring'.format(filename), index=False)
def write_functions(df, path):
    out = Path(path)
    out.mkdir(exist_ok=True)
    name_df = pd.DataFrame([str(i)+'.js' for i in range(df.shape[0])])
    df = df.assign(full_path=name_df)
    display(df.head())
    for idx, row in df.iterrows():
        with open(out/row['full_path'], 'w+') as f:
            f.write(''.join(row['code']))

In [25]:
# write to output files
write_functions(train, './data/processed_data/train_func')
write_functions(valid, './data/processed_data/valid_func')
write_functions(test, './data/processed_data/test_func')
write_functions(ndf, './data/processed_data/without_docstrings')
write_to(train, 'train')
write_to(valid, 'valid')
write_to(test, 'test')
write_to(ndf, 'without_docstrings')

Unnamed: 0,function_name,function_tokens,docstring_tokens,doc_origin,code,beginline,endline,full_path
0,__blank__,function code globals args closure closure2 va...,@constructor @param function code javascript c...,/**\n * @constructor\n * @param {Function} cod...,"function (code, globals, args, closure, closur...",16,50,0.js
1,_callee5$,function _callee5 var manager actual return re...,_ _ pure _ _,/*#__PURE__*/,function _callee5$(_context5) {\n while...,311,334,1.js
2,getResponseHeader,function getResponseHeader header,get single response header response @param hea...,/**\n * Get a single response header from ...,function getResponseHeader(header) {},121,121,2.js
3,_callee$,function var _ref _asyncToGenerator regenerato...,_ _ pure _ _,/*#__PURE__*/,function _callee$(_context) {\n\t\t\t\t\twhile...,62,81,3.js
4,dematerializeOperatorFunction,function dematerialize return function demater...,converts observable @link notification objects...,/**\n * Converts an Observable of {@link Notif...,function dematerializeOperatorFunction(source)...,60,62,4.js


Unnamed: 0,function_name,function_tokens,docstring_tokens,doc_origin,code,beginline,endline,full_path
0,__blank__,function use strict sigma utils pkg sigma canv...,method renders edge two parallel lines @param ...,/**\n * This method renders the edge as two ...,"function (edge, source, target, context, setti...",17,72,0.js
1,LifetimeAction,function function LifetimeAction _classCallChe...,action trigger performed key vault lifetime ce...,/**\n * Action and its trigger that will be pe...,function LifetimeAction() {\n _classCallChe...,39,41,1.js
2,__blank__,function common logging info common i18n t not...,hammertime called stop,/**\n * ### Hammertime\n * To be called after ...,function () {\n common.logging.info(common....,146,150,2.js
3,__blank__,function if xhr readyState 4 window setTimeout...,check readystate timeout changes allow onerror...,Check readyState before timeout as it changes...,function () {\n\t\t\t\t\t\t\t\t\tif (_callback...,117,121,3.js
4,_callee22$,function _callee22 var result return regenerat...,_ _ pure _ _,/*#__PURE__*/,function _callee22$(_context22) {\n whi...,530,548,4.js


Unnamed: 0,function_name,function_tokens,docstring_tokens,doc_origin,code,beginline,endline,full_path
0,__blank__,function testcase var f1 function f1 return fu...,@path ch10 10.4 10.4.3 10.4.3 1 50-s.js @descr...,/**\n * @path ch10/10.4/10.4.3/10.4.3-1-50-s.j...,"function () {\n ""use strict"";\n\n ...",18,22,0.js
1,__blank__,function var _ref _asyncToGenerator regenerato...,_ _ pure _ _ stackedit v4 format stackedit v5 ...,/*#__PURE__*/\n StackEdit v4 format\n\n StackE...,function (_ref2) {\n var _ref3 ...,46,87,1.js
2,__blank__,function exec cmd args return new Promise func...,execute command,Execute command\n,function (data) {\n process.stdout.write(...,19,21,2.js
3,__blank__,function repeatWhen notifier return function s...,returns observable mirrors source observable e...,/**\n * Returns an Observable that mirrors the...,function (source) {\n return source.lif...,43,45,3.js
4,_curry,function curry func var argsLength func length...,creates functions returns function cached argu...,/**\n * Creates the functions that returns the...,function _curry() {\n for (var _len8 = ar...,227,242,4.js


Unnamed: 0,function_name,function_tokens,docstring_tokens,doc_origin,code,beginline,endline,full_path
0,_interopRequireDefault,function _interopRequireDefault obj return obj...,,,function _interopRequireDefault(obj) { return ...,9,9,0.js
1,extern,function t var walt n import extern Add from e...,,,"function extern(k, i) {\n return k + i;...",15,17,1.js
2,__blank__,function mod t is mod instance exports test 4,,,function (mod) {\n t.is(mod.instance.export...,18,20,2.js
3,__blank__,function t var walt n shadowing variables shou...,,,function (mod) {\n t.is(mod.instance.export...,26,28,3.js
4,__blank__,function t var src n For pointers n const tabl...,,,"function () {\n return (0, _.compile)('func...",33,35,4.js


In [3]:
!nvidia-smi

Tue Jul 23 06:06:31 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00004A30:00:00.0 Off |                    0 |
| N/A   57C    P0    59W / 149W |    408MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000A061:00:00.0 Off |                    0 |
| N/A   55C    P0    55W / 149W |    390MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000E323:00:00.0 Off |                    

## The pre-processed data is also hosted on Google Cloud, at the following URLs:

In [27]:
# # cool trick to send shell command results into a python variable in a jupyter notebook!
# files = ! ls ./data/processed_data/ | grep -E '*.function$|*.docstring$|*.lineage$|*_original_function.json.gz$'

# # print the urls
# urls = [f'https://storage.googleapis.com/kubeflow-examples/code_search/data/{f}' for f in files]
# for s in urls:
#     print(s)