Table of Contents
Each step in the above diagram corresponds to a Jupyter notebook in this repo. Below is a high level description of each step:
1 - Preprocess Data: describes how to get python files from BigQuery, and use the AST module to clean code and extract docstrings.
2 - Train Function Summarizer: build a sequence-to-sequence model to predict a docstring given a python function or method. The primary purpose of this model is for a transfer learning task that requires the extraction of features from code.
3 - Train Language Model: Build a language model using Fastai on a corpus of docstrings. We will use this model for transfer learning to encode short phrases or sentences, such as docstrings and search queries.
4 - Train Code2Emb Model: Fine-tune the model from step 2 to predict vectors instead of docstrings. This model will be used to represent code in the same vector space as the sentence embeddings produced in step 3.
5 - Build Search Engine: Use the assets you created to created in steps 3 and 4 to create a semantic search tool.