> Simply copied from original notebook https://github.com/github/CodeSearchNet/blob/master/notebooks/ExploreData.ipynb

In [1]:
import sys
sys.path.append('../../..')

In [14]:
import os
from pathlib import Path
import pandas as pd

from codenets.codesearchnet.copied_code.utils import read_file_samples

## Exploring The Full Dataset

You will need to complete the setup steps in the README.md file located in the root of this repository before proceeding.

The training data is located in `/resources/data`, which contains approximately 3.2 Million code, comment pairs across the train, validation, and test partitions.  You can learn more about the directory structure and associated files by viewing `/resources/README.md`.

The preprocessed data re stored in [json lines](http://jsonlines.org/) format.  First, we can get a list of all these files for further inspection:

In [10]:
root_path = Path("/home/mandubian/workspaces/tools/CodeSearchNet/")
python_files = sorted((root_path / "resources/data/python/").glob('**/*.gz'))
java_files = sorted((root_path / "resources/data/java/").glob('**/*.gz'))
go_files = sorted((root_path / "resources/data/go/").glob('**/*.gz'))
php_files = sorted((root_path / "resources/data/php").glob('**/*.gz'))
javascript_files = sorted((root_path / "resources/data/javascript").glob('**/*.gz'))
ruby_files = sorted((root_path / "resources/data/ruby").glob('**/*.gz'))
all_files = python_files + go_files + java_files + php_files + javascript_files + ruby_files

In [12]:
print(f'Total number of files: {len(all_files):,}')

Total number of files: 77


To make analysis of this dataset easier, we can load all of the data into a pandas dataframe: 

In [15]:
columns_long_list = ['repo', 'path', 'url', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = ['code_tokens', 'docstring_tokens', 
                      'language', 'partition']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f, 
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

Two columns that will be heavily used in this dataset are `code_tokens` and `docstring_tokens`, which represent a parallel corpus that can be used for interesting tasks like information retrieval (for example trying to retrieve a codesnippet using the docstring.).  You can find more information regarding the definition of the above columns in the README of this repo. 

Next, we will read in all of the data for a limited subset of these columns into memory so we can compute summary statistics.  **Warning:** This step takes ~ 20 minutes.

In [38]:
all_df = jsonl_list_to_dataframe(all_files, columns=columns_short_list)

In [39]:
all_df.head(3)

Unnamed: 0,code_tokens,docstring_tokens,language,partition
0,"[def, train, (, train_dir, ,, model_save_path,...","[Trains, a, k, -, nearest, neighbors, classifi...",python,train
1,"[def, predict, (, X_img_path, ,, knn_clf, =, N...","[Recognizes, faces, in, given, image, using, a...",python,train
2,"[def, show_prediction_labels_on_image, (, img_...","[Shows, the, face, recognition, results, visua...",python,train


## Summary Statistics

### Row Counts

#### By Partition

In [40]:
all_df.partition.value_counts()

train    1880853
valid      89154
test       78353
Name: partition, dtype: int64

#### By Language

In [41]:
all_df.language.value_counts()

php           578118
java          496688
python        435285
go            346365
javascript    138625
ruby           53279
Name: language, dtype: int64

#### By Partition & Language

In [42]:
all_df.groupby(['partition', 'language'])['code_tokens'].count()

partition  language  
test       go             14291
           java           26909
           javascript      6483
           php            28391
           ruby            2279
train      go            317832
           java          454451
           javascript    123889
           php           523712
           python        412178
           ruby           48791
valid      go             14242
           java           15328
           javascript      8253
           php            26015
           python         23107
           ruby            2209
Name: code_tokens, dtype: int64

### Token Lengths By Language

In [43]:
all_df['code_len'] = all_df.code_tokens.apply(lambda x: len(x))
all_df['query_len'] = all_df.docstring_tokens.apply(lambda x: len(x))

#### Code Length Percentile By Language

For example, the 80th percentile length for python tokens is 72

In [44]:
code_len_summary = all_df.groupby('language')['code_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(code_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,code_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,61.0
go,0.7,100.0
go,0.8,138.0
go,0.9,217.0
go,0.95,319.0
java,0.5,66.0
java,0.7,104.0
java,0.8,142.0
java,0.9,224.0
java,0.95,331.0


#### Query Length Percentile By Language

For example, the 80th percentile length for python tokens is 19

In [45]:
query_len_summary = all_df.groupby('language')['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0_level_0,Unnamed: 1_level_0,query_len
language,Unnamed: 1_level_1,Unnamed: 2_level_1
go,0.5,12.0
go,0.7,19.0
go,0.8,28.0
go,0.9,49.0
go,0.95,92.0
java,0.5,11.0
java,0.7,18.0
java,0.8,25.0
java,0.9,39.0
java,0.95,61.0


#### Query Length All Languages

In [46]:
query_len_summary = all_df['query_len'].quantile([.5, .7, .8, .9, .95])
display(pd.DataFrame(query_len_summary))

Unnamed: 0,query_len
0.5,9.0
0.7,15.0
0.8,20.0
0.9,32.0
0.95,50.0
