# Preprocess Data
This notebook contains materials to parse raw python files into function and docstring pairs, tokenize both function and dosctring into tokens, and split these pairs into a train, valid and test set.  

*This step is optional, as we provide links to download pre-processed data at various points in the tutorial.  However, you might find it useful to go through these steps in order to understand how the data is prepared.*

If you are using the recommended approach of using a `p3.8xlarge` instance for this entire tutorial you can use this docker container to run this notebook: [hamelsmu/ml-gpu](https://hub.docker.com/r/hamelsmu/ml-gpu/).

Alternatively, if you wish to speed up *this notebook* by using an instance with lots of cores (because everything in this notebook is CPU bound), you can use this container [hamelsmu/ml-cpu](https://hub.docker.com/r/hamelsmu/ml-gpu/).

In [1]:
import ast
import glob
import re
from pathlib import Path

import re
import astor
import pandas as pd
import spacy
from tqdm import tqdm
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split

from general_utils import apply_parallel, flattenlist



In [2]:
! python -V

Python 2.7.15+


## Download and read  raw python files

The first thing we will want to do is to gather python code.  There is an open dataset that Google hosts on [BigQuery](https://cloud.google.com/bigquery/) that has code from open source projects on Github.  You can use [bigquery](https://cloud.google.com/bigquery/) to get the python files as a tabular dataset by executing the following SQL query in the bigquery console:

```{sql}
SELECT 
 max(concat(f.repo_name, ' ', f.path)) as repo_path,
 c.content
FROM `bigquery-public-data.github_repos.files` as f
JOIN `bigquery-public-data.github_repos.contents` as c on f.id = c.id
JOIN (
      --this part of the query makes sure repo is watched at least twice since 2017
      SELECT repo FROM(
        SELECT 
          repo.name as repo
        FROM `githubarchive.year.2017` WHERE type="WatchEvent"
        UNION ALL
        SELECT 
          repo.name as repo
        FROM `githubarchive.month.2018*` WHERE type="WatchEvent"
        )
      GROUP BY 1
      HAVING COUNT(*) >= 2
      ) as r on f.repo_name = r.repo
WHERE 
  f.path like '%.py' and --with python extension
  c.size < 15000 and --get rid of ridiculously long files
  REGEXP_CONTAINS(c.content, r'def ') --contains function definition
group by c.content
```


Here is a link to the [SQL Query](https://bigquery.cloud.google.com/savedquery/506213277345:009fa66f301240e5ad9e4006c59a4762) incase it is helpful.  The raw data contains approximate 1.2 million distinct python code files.

**To make things easier for this tutorial, the folks on the Google [Kubeflow team](https://kubernetes.io/blog/2017/12/introducing-kubeflow-composable/) have hosted the raw data for this tutorial in the form of 10 csv files, available at the url: https://storage.googleapis.com/kubeflow-examples/code_search/raw_data/00000000000{i}.csv as illustrated in the below code:**

In [3]:
# Read the data into a pandas dataframe, and parse out some meta-data

# df = pd.read_pickle('py0.pkl')
df = pd.concat([pd.read_pickle(f'../py{i}.pkl') for i in range (5)])
# df = pd.read_csv(f'01.csv')

df['nwo'] = df['repo_path'].apply(lambda r: r.split()[0])
df['path'] = df['repo_path'].apply(lambda r: r.split()[-1])
df.drop(columns=['repo_path'], inplace=True)
df = df[['nwo', 'path', 'content']]
df.head()

Unnamed: 0,nwo,path,content
0,modoboa,modoboa/core/models.py,"# -*- coding: utf-8 -*-\n\n""""""Core models.""""""\..."
1,modoboa,modoboa/core/utils.py,"# -*- coding: utf-8 -*-\n\n""""""Utility function..."
2,modoboa,modoboa/core/password_hashers/base.py,"# -*- coding: utf-8 -*-\n\n""""""\nBase password ..."
3,modoboa,modoboa/core/mocks.py,"# -*- coding: utf-8 -*-\n\n""""""Mocks used for t..."
4,modoboa,modoboa/core/views/base.py,"# -*- coding: utf-8 -*-\n\n""""""Base core views...."


In [4]:
# Inspect shape of the raw data
df.shape

(181175, 3)

## Functions to parse data and tokenize

Our goal is to parse the python files into (code, docstring) pairs.  Fortunately, the standard library in python comes with the wonderful [ast](https://docs.python.org/3.6/library/ast.html) module which helps us extract code from files as well as extract docstrings.  

We also use the [astor](http://astor.readthedocs.io/en/latest/) library to strip the code of comments by doing a round trip of converting the code to an [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree) and then from AST back to code. 

In [5]:
EN = spacy.load('en')
stop_words =  ["a", "about", "above", "after", "again", "against", "ain", "all", "am", "an", "and", "any", "are", "aren", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can", "couldn", "couldn't", "d", "did", "didn", "didn't", "do", "does", "doesn", "doesn't", "doing", "don", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn", "hadn't", "has", "hasn", "hasn't", "have", "haven", "haven't", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "i", "if", "in", "into", "is", "isn", "isn't", "it", "it's", "its", "itself", "just", "ll", "m", "ma", "me", "mightn", "mightn't", "more", "most", "mustn", "mustn't", "my", "myself", "needn", "needn't", "no", "nor", "not", "now", "o", "of", "off", "on", "once", "only", "or", "other", "our", "ours", "ourselves", "out", "over", "own", "re", "s", "same", "shan", "shan't", "she", "she's", "should", "should've", "shouldn", "shouldn't", "so", "some", "such", "t", "than", "that", "that'll", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "ve", "very", "was", "wasn", "wasn't", "we", "were", "weren", "weren't", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "won", "won't", "wouldn", "wouldn't", "y", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves", "could", "he'd", "he'll", "he's", "here's", "how's", "i'd", "i'll", "i'm", "i've", "let's", "ought", "she'd", "she'll", "that's", "there's", "they'd", "they'll", "they're", "they've", "we'd", "we'll", "we're", "we've", "what's", "when's", "where's", "who's", "why's", "would"]
for stopword in stop_words:
    EN.vocab[stopword].is_stop = True
    
cop = re.compile("[^a-z^0-9]")

In [15]:
from textacy.preprocess import preprocess_text

def tokenize_docstring(text):
    "Apply tokenization using spacy to docstrings."
    all_tokens = EN.tokenizer(text.lower())
    selected_tokens = [cop.sub('', token.text) for token in all_tokens if not token.is_space and not token.is_stop]
    return [token for token in selected_tokens if token != '']

def tokenize_code(text):
    "A very basic procedure for tokenizing code strings."
    return RegexpTokenizer(r'\w+').tokenize(text)


def get_function_docstring_pairs(blob):
    "Extract (function/method, docstring) pairs from a given code blob."
    pairs = []
    fc_dict = {}
    try:
        module = ast.parse(blob)
        classes = [node for node in module.body if isinstance(node, ast.ClassDef)]
        functions = [node for node in module.body if isinstance(node, ast.FunctionDef)]
        for _class in classes:
            for node in _class.body:
                if isinstance(node, ast.FunctionDef):
                    functions.append(node)
                    fc_dict[node] = _class.name

        for f in functions:
            source = astor.to_source(f)
            docstring = ast.get_docstring(f) if ast.get_docstring(f) else ''
            function = source.replace(ast.get_docstring(f, clean=False), '') if docstring else source
            class_name = fc_dict.get(f, '')
            pairs.append((class_name + '_' + f.name,
                          f.lineno,
                          source,
                          ' '.join(tokenize_code(function)),
                          ' '.join(tokenize_docstring(docstring.split('\n\n')[0]))
                         ))
    except (AssertionError, MemoryError, SyntaxError, UnicodeEncodeError):
        pass
    return pairs

err_content = []
def get_function_docstring_pairs_list(blob_list):
    """apply the function `get_function_docstring_pairs` on a list of blobs"""
    res = []
    global err_content
    for b in blob_list:
        try:
            pairs = get_function_docstring_pairs(str(b))
            res.append(pairs)
        except:
            print(b)
            err_content.append(b)
    return res

In [21]:
blob = '''

class Seq2Seq_Inference(object):
    def __init__(self,
                 encoder_preprocessor,
                 decoder_preprocessor,
                 seq2seq_model):

        self.enc_pp = encoder_preprocessor
        self.dec_pp = decoder_preprocessor
        self.seq2seq_model = seq2seq_model
        self.encoder_model = extract_encoder_model(seq2seq_model)
        self.decoder_model = extract_decoder_model(seq2seq_model)
        self.default_max_len = self.dec_pp.padding_maxlen
        self.nn = None
        self.rec_df = None
    '''
print(blob)
get_function_docstring_pairs(blob)



class Seq2Seq_Inference(object):
    def __init__(self,
                 encoder_preprocessor,
                 decoder_preprocessor,
                 seq2seq_model):

        self.enc_pp = encoder_preprocessor
        self.dec_pp = decoder_preprocessor
        self.seq2seq_model = seq2seq_model
        self.encoder_model = extract_encoder_model(seq2seq_model)
        self.decoder_model = extract_decoder_model(seq2seq_model)
        self.default_max_len = self.dec_pp.padding_maxlen
        self.nn = None
        self.rec_df = None
    


[('Seq2Seq_Inference___init__',
  4,
  'def __init__(self, encoder_preprocessor, decoder_preprocessor, seq2seq_model):\n    self.enc_pp = encoder_preprocessor\n    self.dec_pp = decoder_preprocessor\n    self.seq2seq_model = seq2seq_model\n    self.encoder_model = extract_encoder_model(seq2seq_model)\n    self.decoder_model = extract_decoder_model(seq2seq_model)\n    self.default_max_len = self.dec_pp.padding_maxlen\n    self.nn = None\n    self.rec_df = None\n',
  'def __init__ self encoder_preprocessor decoder_preprocessor seq2seq_model self enc_pp encoder_preprocessor self dec_pp decoder_preprocessor self seq2seq_model seq2seq_model self encoder_model extract_encoder_model seq2seq_model self decoder_model extract_decoder_model seq2seq_model self default_max_len self dec_pp padding_maxlen self nn None self rec_df None',
  '')]

The below convience function `apply_parallel` parses the code in parallel using process based threading.  Adjust the `cpu_cores` parameter accordingly to your system resources!

In [23]:
# print(df.head())
# pairs = flattenlist(apply_parallel(get_function_docstring_pairs_list, df.content.tolist(), cpu_cores=4))
pairs = get_function_docstring_pairs_list(df.content.tolist())
df =df[~df['content'].isin(err_content)]

In [55]:
len(err_content)

0

In [57]:
df.shape

(181175, 4)

In [24]:
assert len(pairs) == df.shape[0], f'Row count mismatch. `df` has {df.shape[0]:,} rows; `pairs` has {len(pairs):,} rows.'
df['pairs'] = pairs
df.to_pickle('temp3.pkl')

In [17]:
df = pd.read_pickle('temp.pkl')
df.head()
df['pairs'].head()

0    [(populate_callback, 393, def populate_callbac...
1    [(parse_map_file, 16, def parse_map_file(path)...
2    [(name, 28, @property\ndef name(cls):\n    """...
3    [(modo_api_instance_search, 12, @httmock.urlma...
4    [(find_nextlocation, 15, def find_nextlocation...
Name: pairs, dtype: object

## Flatten code, docstring pairs and extract meta-data

Flatten (code, docstring) pairs

In [25]:
# flatten pairs
df = df.set_index(['nwo', 'path'])['pairs'].apply(pd.Series).stack()
df = df.reset_index()
df.columns = ['nwo', 'path', '_', 'pair']

Extract meta-data and format dataframe.  

We have not optimized this code.  Pull requests are welcome!

In [26]:
df['function_name'] = df['pair'].apply(lambda p: p[0])
df['lineno'] = df['pair'].apply(lambda p: p[1])
df['original_function'] = df['pair'].apply(lambda p: p[2])
df['function_tokens'] = df['pair'].apply(lambda p: p[3])
df['docstring_tokens'] = df['pair'].apply(lambda p: p[4])
df = df[['nwo', 'path', 'function_name', 'lineno', 'original_function', 'function_tokens', 'docstring_tokens']]
df['url'] = df[['nwo', 'path', 'lineno']].apply(lambda x: 'https://github.com/{}/blob/master/{}#L{}'.format(x[0], x[1], x[2]), axis=1)
df.head()
df.to_pickle('total3.pkl')

In [60]:
df.head()

Unnamed: 0,nwo,path,function_name,lineno,original_function,function_tokens,docstring_tokens,url
0,modoboa,modoboa/core/models.py,populate_callback,393,"def populate_callback(user, group='SimpleUsers...",def populate_callback user group SimpleUsers f...,populate callback,https://github.com/modoboa/blob/master/modoboa...
1,modoboa,modoboa/core/models.py,__init__,89,"def __init__(self, *args, **kwargs):\n """"""L...",def __init__ self args kwargs super User self ...,load parameter manager,https://github.com/modoboa/blob/master/modoboa...
2,modoboa,modoboa/core/models.py,_crypt_password,94,"def _crypt_password(self, raw_value):\n """"""...",def _crypt_password self raw_value scheme para...,crypt local password appropriate scheme,https://github.com/modoboa/blob/master/modoboa...
3,modoboa,modoboa/core/models.py,set_password,112,"def set_password(self, raw_value, curvalue=Non...",def set_password self raw_value curvalue None ...,password update,https://github.com/modoboa/blob/master/modoboa...
4,modoboa,modoboa/core/models.py,check_password,137,"def check_password(self, raw_value):\n """"""C...",def check_password self raw_value match self p...,compare rawvalue current password,https://github.com/modoboa/blob/master/modoboa...


## Remove Duplicates

In [27]:
# remove observations where the same function appears more than once
before_dedup = len(df)
df = df.drop_duplicates(['original_function', 'function_tokens'])
after_dedup = len(df)

print(f'Removed {before_dedup - after_dedup:,} duplicate rows')

Removed 284,904 duplicate rows


In [28]:
df.shape

(1345732, 8)

In [29]:
def listlen(x):
    if not isinstance(x, list):
        return 0
    return len(x)

# functions should not be too long
df = df[df.function_tokens.str.split().apply(listlen) <= 500]
df.to_pickle('lessthan500_3.pkl')

In [30]:
df.shape

(1340176, 8)

In [31]:
df.head()

Unnamed: 0,nwo,path,function_name,lineno,original_function,function_tokens,docstring_tokens,url
0,modoboa,modoboa/core/models.py,_populate_callback,393,"def populate_callback(user, group='SimpleUsers...",def populate_callback user group SimpleUsers f...,populate callback,https://github.com/modoboa/blob/master/modoboa...
1,modoboa,modoboa/core/models.py,User___init__,89,"def __init__(self, *args, **kwargs):\n """"""L...",def __init__ self args kwargs super User self ...,load parameter manager,https://github.com/modoboa/blob/master/modoboa...
2,modoboa,modoboa/core/models.py,User__crypt_password,94,"def _crypt_password(self, raw_value):\n """"""...",def _crypt_password self raw_value scheme para...,crypt local password appropriate scheme,https://github.com/modoboa/blob/master/modoboa...
3,modoboa,modoboa/core/models.py,User_set_password,112,"def set_password(self, raw_value, curvalue=Non...",def set_password self raw_value curvalue None ...,password update,https://github.com/modoboa/blob/master/modoboa...
4,modoboa,modoboa/core/models.py,User_check_password,137,"def check_password(self, raw_value):\n """"""C...",def check_password self raw_value match self p...,compare rawvalue current password,https://github.com/modoboa/blob/master/modoboa...


## Separate function w/o docstrings

In [32]:
# separate functions w/o docstrings
# docstrings should be at least 3 words in the docstring to be considered a valid docstring

with_docstrings = df[df.docstring_tokens.str.split().apply(listlen) >= 3]
with_docstrings.to_pickle('withdoc3.pkl')
without_docstrings = df[df.docstring_tokens.str.split().apply(listlen) < 3]
without_docstrings.to_pickle('withoutdoc3.pkl')

In [33]:
with_docstrings.shape

(373272, 8)

In [54]:
with_docstrings['full_path'] = with_docstrings.apply(lambda row: '/'.join([row['nwo'], row['path'].replace('.', '_'), row['function_name']]),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [55]:
without_docstrings['full_path'] = without_docstrings.apply(lambda row: '/'.join([row['nwo'], row['path'].replace('.', '_'), row['function_name']]),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [43]:
with_docstrings.shape

(373272, 9)

## Partition code by repository to minimize leakage between train, valid & test sets. 
Rough assumption that each repository has its own style.  We want to avoid having code from the same repository in the training set as well as the validation or holdout set.

In [56]:
grouped = with_docstrings.groupby('nwo')

In [57]:
# train, valid, test splits
train, test = train_test_split(list(grouped), train_size=0.87, shuffle=True, random_state=8081)
train, valid = train_test_split(train, train_size=0.82, random_state=8081)

In [58]:
train = pd.concat([d for _, d in train]).reset_index(drop=True)
valid = pd.concat([d for _, d in valid]).reset_index(drop=True)
test = pd.concat([d for _, d in test]).reset_index(drop=True)

In [59]:
print(f'train set num rows {train.shape[0]:,}')
print(f'valid set num rows {valid.shape[0]:,}')
print(f'test set num rows {test.shape[0]:,}')
print(f'without docstring rows {without_docstrings.shape[0]:,}')

train set num rows 263,327
valid set num rows 59,026
test set num rows 50,919
without docstring rows 966,904


In [73]:
train.to_pickle('train3.pkl')
valid.to_pickle('valid3.pkl')
test.to_pickle('test3.pkl')

Preview what the training set looks like.  You can start to see how the data looks, the function tokens and docstring tokens are what will be fed downstream into the models.  The other information is important for diagnostics and bookeeping.

In [60]:
train.head()

Unnamed: 0,nwo,path,function_name,lineno,original_function,function_tokens,docstring_tokens,url,full_path
0,TextRank4ZH,textrank4zh/TextRank4Keyword.py,TextRank4Keyword___init__,18,"def __init__(self, stop_words_file=None, allow...",def __init__ self stop_words_file None allow_s...,keyword arguments stopwordsfile str delimiters...,https://github.com/TextRank4ZH/blob/master/tex...,TextRank4ZH/textrank4zh/TextRank4Keyword_py/Te...
1,TextRank4ZH,textrank4zh/TextRank4Sentence.py,TextRank4Sentence___init__,18,"def __init__(self, stop_words_file=None, allow...",def __init__ self stop_words_file None allow_s...,keyword arguments stopwordsfile strstr delimit...,https://github.com/TextRank4ZH/blob/master/tex...,TextRank4ZH/textrank4zh/TextRank4Sentence_py/T...
2,TextRank4ZH,textrank4zh/TextRank4Sentence.py,TextRank4Sentence_analyze,43,"def analyze(self, text, lower=False, source='n...",def analyze self text lower False source no_st...,keyword arguments text lower false source word...,https://github.com/TextRank4ZH/blob/master/tex...,TextRank4ZH/textrank4zh/TextRank4Sentence_py/T...
3,TextRank4ZH,textrank4zh/Segmentation.py,WordSegmentation___init__,23,"def __init__(self, stop_words_file=None, allow...",def __init__ self stop_words_file None allow_s...,keyword arguments stopwordsfile utf8str allows...,https://github.com/TextRank4ZH/blob/master/tex...,TextRank4ZH/textrank4zh/Segmentation_py/WordSe...
4,TextRank4ZH,textrank4zh/Segmentation.py,SentenceSegmentation___init__,85,"def __init__(self, delimiters=util.sentence_de...",def __init__ self delimiters util sentence_del...,keyword arguments delimiters,https://github.com/TextRank4ZH/blob/master/tex...,TextRank4ZH/textrank4zh/Segmentation_py/Senten...


## Output each set to train/valid/test.function/docstrings/lineage files
Original functions are also written to compressed json files. (Raw functions contain `,`, `\t`, `\n`, etc., it is less error-prone using json format)

`{train,valid,test}.lineage` are files that contain a reference to the original location where the code was retrieved. 

In [61]:
def write_to(df, filename, path='./data/processed_data/'):
    "Helper function to write processed files to disk."
    out = Path(path)
    out.mkdir(exist_ok=True)
    df.function_tokens.to_csv(out/'{}.function'.format(filename), index=False)
    df.full_path.to_csv(out/'{}.full_path'.format(filename), index=False)
    df.original_function.to_json(out/'{}_original_function.json.gz'.format(filename), orient='values', compression='gzip')
    if filename != 'without_docstrings':
        df.docstring_tokens.to_csv(out/'{}.docstring'.format(filename), index=False)
    df.url.to_csv(out/'{}.lineage'.format(filename), index=False)

In [62]:
# write to output files
write_to(train, 'train')
write_to(valid, 'valid')
write_to(test, 'test')
write_to(without_docstrings, 'without_docstrings')


  """
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.


In [72]:
# get astminer input
import os
def write_functions(prefix):
    path = './' + prefix + '/'
    def func(row):
        os.makedirs(path + '/'.join([row['nwo'], row['path'].replace('.', '_')]), exist_ok=True)
        with open(Path(path)/'{}.py'.format(row['full_path']), 'w') as outfile:
            outfile.write(row['original_function'])
    return func

train.apply(write_functions("train_functions"),axis=1)
valid.apply(write_functions("valid_functions"),axis=1)
test.apply(write_functions("test_functions"),axis=1)

0        None
1        None
2        None
3        None
4        None
5        None
6        None
7        None
8        None
9        None
10       None
11       None
12       None
13       None
14       None
15       None
16       None
17       None
18       None
19       None
20       None
21       None
22       None
23       None
24       None
25       None
26       None
27       None
28       None
29       None
         ... 
50889    None
50890    None
50891    None
50892    None
50893    None
50894    None
50895    None
50896    None
50897    None
50898    None
50899    None
50900    None
50901    None
50902    None
50903    None
50904    None
50905    None
50906    None
50907    None
50908    None
50909    None
50910    None
50911    None
50912    None
50913    None
50914    None
50915    None
50916    None
50917    None
50918    None
Length: 50919, dtype: object

In [42]:
!ls -lah ./data/processed_data2/

total 2.6G
drwxr-xr-x 2 root root 6.0K May 22 00:59 .
drwxr-xr-x 9 root root 6.0K May 22 00:53 ..
-rw-r--r-- 1 root root  13M May 22 00:55 test.docstring
-rw-r--r-- 1 root root  55M May 22 00:55 test.function
-rw-r--r-- 1 root root  16M May 22 00:55 test.lineage
-rw-r--r-- 1 root root  25M May 22 00:55 test_original_function.json.gz
-rw-r--r-- 1 root root  74M May 22 00:55 train.docstring
-rw-r--r-- 1 root root 312M May 22 00:53 train.function
-rw-r--r-- 1 root root  89M May 22 00:55 train.lineage
-rw-r--r-- 1 root root 140M May 22 00:55 train_original_function.json.gz
-rw-r--r-- 1 root root  15M May 22 00:55 valid.docstring
-rw-r--r-- 1 root root  67M May 22 00:55 valid.function
-rw-r--r-- 1 root root  18M May 22 00:55 valid.lineage
-rw-r--r-- 1 root root  30M May 22 00:55 valid_original_function.json.gz
-rw-r--r-- 1 root root 1.1G May 22 00:56 without_docstrings.function
-rw-r--r-- 1 root root 345M May 22 00:59 without_docstrings.lineage
-rw-r--r-- 1 root root 357M M

In [33]:
# get astminer input
              
# docdf = with_docstrings[['nwo', 'path', 'function_name', 'docstring_tokens']]
# docdf['full_path'] = docdf.apply(lambda row: '/'.join([row['nwo'], row['path'].replace('.','_'), row['function_name']]),axis=1)
# docdf = docdf[['full_path', 'docstring_tokens']]
# docdf.to_csv('ast.csv')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


## The pre-processed data is also hosted on Google Cloud, at the following URLs:

In [24]:
# # cool trick to send shell command results into a python variable in a jupyter notebook!
# files = ! ls ./data/processed_data/ | grep -E '*.function$|*.docstring$|*.lineage$|*_original_function.json.gz$'

# # print the urls
# urls = [f'https://storage.googleapis.com/kubeflow-examples/code_search/data/{f}' for f in files]
# for s in urls:
#     print(s)

https://storage.googleapis.com/kubeflow-examples/code_search/data/test.docstring
https://storage.googleapis.com/kubeflow-examples/code_search/data/test.function
https://storage.googleapis.com/kubeflow-examples/code_search/data/test.lineage
https://storage.googleapis.com/kubeflow-examples/code_search/data/test_original_function.json.gz
https://storage.googleapis.com/kubeflow-examples/code_search/data/train.docstring
https://storage.googleapis.com/kubeflow-examples/code_search/data/train.function
https://storage.googleapis.com/kubeflow-examples/code_search/data/train.lineage
https://storage.googleapis.com/kubeflow-examples/code_search/data/train_original_function.json.gz
https://storage.googleapis.com/kubeflow-examples/code_search/data/valid.docstring
https://storage.googleapis.com/kubeflow-examples/code_search/data/valid.function
https://storage.googleapis.com/kubeflow-examples/code_search/data/valid.lineage
https://storage.googleapis.com/kubeflow-examples/code_search/data/valid_origina