# Data Exploration

This notebook explores the pre-processed data, and shows some basic statistics that may be useful.  

In [1]:
import json

import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint

Definitions of each of the above fields are located in the  in the README.md file in the root of this repository.

## Part 2: Exploring The Full Dataset

You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding.

The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.

The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:

In [3]:
data_path = Path('/tf/data/datasets')

In [4]:
java_files = sorted(Path(data_path/'java/').glob('**/*.gz'))

In [5]:
print(f'Total number of java files: {len(java_files):,}')

Total number of java files: 18


To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: 

In [6]:
columns_long_list = ['repo', 'path', 'url', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = ['code_tokens', 'docstring_tokens', 
                      'language', 'partition']

cols = ['code', 'docstring']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f, 
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

This is what the python dataset looks like:

In [9]:
java_df = jsonl_list_to_dataframe(java_files, cols)

In [10]:
java_df.head(3)

Unnamed: 0,code,docstring
0,"protected final void fastPathOrderedEmit(U value, boolean delayError, Disposable disposable) {\n final Observer<? super V> observer = downstream;\n final SimplePlainQueue<U> q = queue;\n\n if (wip.get() == 0 && wip.compareAndSet(0, 1)) {\n if (q.isEmpty()) {\n ...","Makes sure the fast-path emits in order.\n@param value the value to emit or queue up\n@param delayError if true, errors are delayed until the source has terminated\n@param disposable the resource to dispose if the drain terminates"
1,"@CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> amb(Iterable<? extends ObservableSource<? extends T>> sources) {\n ObjectHelper.requireNonNull(sources, ""sources is null"");\n return RxJavaPlugins.onAssembly(new Obser...","Mirrors the one ObservableSource in an Iterable of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</..."
2,"@SuppressWarnings(""unchecked"")\n @CheckReturnValue\n @NonNull\n @SchedulerSupport(SchedulerSupport.NONE)\n public static <T> Observable<T> ambArray(ObservableSource<? extends T>... sources) {\n ObjectHelper.requireNonNull(sources, ""sources is null"");\n int len = sources...","Mirrors the one ObservableSource in an array of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</b><..."


## Summary Statistics