# Scraping Julia Source Code
In order to train an autoencoder to produce low-dimensional vector representations of Julia source code, we first extract Julia source code expressions and compile these into a labeled training dataset by traversing the base Julia language repo directory structure.

This notebook illustrates the workflow within the src/validation/julia_code_scraping.jl file. 

In [None]:
using DelimitedFiles
include("../../../src/parse.jl")


We select only those expressions that are 500 characters or less. This only excludes 0.5% of the available Julia code snippets, and allows our RNN autoencoder model to remain computationally tractable. The longest expressions are on the order of 16,000 characters long, and comprise mostly lists of available characters. 


In [None]:
maxlen = 500;
dir = "~/Documents/git/julia";
file_type = "jl";


We define two utility functions to extract and save our Julia expressions as labeled code snippets. The first, `read_code()`, traverses the Julia repo directory structure and parses each Julia source code file into Julia `Expr` objects. 

This function then recurively calls our second function, `get_expr()`, to parse the `Expr` objects into strings for storage and analysis. If a given `Expr` object can be further decomposed then `get_expr()` recurses and returns all the bottom-level expression parses compiled into an array named `all_funcs`.

In [None]:
function read_code(dir, maxlen=500, file_type="jl", verbose=false)
    comments = r"\#.*\n"
    docstring = r"\"{3}.*?\"{3}"s

    all_funcs = []
    sources = []

    for (root, dirs, files) in walkdir(dir)
        for file in files
            if endswith(file, "."*file_type)
              s = Parsers.parsefile(joinpath(root, file))
              if !isa(s, Nothing)
                all_funcs = vcat(all_funcs, get_expr(s, joinpath(root, file), verbose));
              end
            end
        end
    end

    filter!(x->x!="",all_funcs)
    filter!(x -> length(x)<=maxlen, all_funcs)
    all_funcs = unique(all_funcs)

    return all_funcs
end


function get_expr(exp_tree, path, verbose=false)
    leaves = []

    for arg in exp_tree.args
        if verbose
            println(arg)
        end
        if typeof(arg) == Expr
            if arg.head != :block
                if verbose
                    println("Pushed!")
                end
                push!(leaves, (string(arg), path))
            else
                if verbose
                    println("Recursing!")
                end
                leaves = vcat(leaves, get_expr(arg, path, verbose))
            end
        end
    end

    return leaves
end


Finally, `all_funcs` is saved to disk for eventual input to our autoencoding model. 

In [None]:
all_funcs = read_code(dir, maxlen, file_type);
writedlm("all_funcs.csv", all_funcs, quotes=true);


In [None]:
println(size(all_funcs))
println()
println.(all_funcs[1:5]);