# Project 3 - Semantic Code Search
## Submitted by:
### Dhaval Patel - DJP526, Akshat Khare - AK7674, Disha Papneja - DP3074

## Introduction

In this project we have used the CodeSearchNet Corpus and participated in the corresponding challenge. In this project we focused only on Python language and the associated dataset which contains about 0.5 million pairs of function-documentation pairs and about another 1.1 million functions without an associated documentation. We then submitted our Normalized Discounted Cumulative Gain (NDCG) score for only the human annotated examples.



In this notebook we have implemented Bag of Words approach. A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. A measure of the presence of known words. Further details about it can be found it this research paper: https://arxiv.org/pdf/1909.09436.pdf. 

# Importing the data

In this section we import the dataset and explore the format and structure of data. It is be useful to explore a small sample in order to understand the format and structure of the data. While the full dataset can be automatically downloaded with the /script/setup script located in this repo, we can alternatively download a subset of the data from S3.

In [0]:
import json

import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint


## Downloading and decompressing the dataset

First we download the python dataset from https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

In [2]:
#Ref: https://github.com/github/CodeSearchNet/blob/master/notebooks/ExploreData.ipynb
!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

--2020-05-11 01:08:31--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.9.222
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.9.222|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 940909997 (897M) [application/zip]
Saving to: ‘python.zip’


2020-05-11 01:09:28 (15.9 MB/s) - ‘python.zip’ saved [940909997/940909997]



Now, we unzip the dataset

In [3]:
# -o option overwrites the files without prompting    
!unzip -o python.zip

Archive:  python.zip
   creating: python/
   creating: python/final/
   creating: python/final/jsonl/
   creating: python/final/jsonl/train/
  inflating: python/final/jsonl/train/python_train_9.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_12.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_10.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_0.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_6.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_2.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_4.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_8.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_11.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_5.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_13.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_3.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_1.jsonl.gz  
  inflating: python/fin

The unzipped dataset also contains .gz files so now we decompress all the gzip files <br>
The whole Python dataset is divided into 14 chunks and each part of the traning dataset contains 30000 rows

In [0]:
# decompress this gzip file
!gzip -f -d python/final/jsonl/train/python_train_0.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_1.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_2.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_3.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_4.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_5.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_6.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_7.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_8.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_9.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_10.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_11.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_12.jsonl.gz
!gzip -f -d python/final/jsonl/train/python_train_13.jsonl.gz


Now, we can inspect any of the unzip files to see its contents:
Read in the file and display the first row. The data is stored in JSON Lines format.
We can utilize the fact that each line in the file is valid json, and display the first row in a more human readable form.

In [0]:
with open('python/final/jsonl/train/python_train_0.jsonl', 'r') as f:
    sample_file_train = f.readlines()

# Preprocessing the data

In this section we convert the testing and training datasets into dataframes 

The function below selects minimum data row out of the whole dataset of a given file for showing purposes as showing a large datarow will be difficult to understand. So we show the smallest row. 

In [0]:
# function to select the minimum data row out of the whole dataset
def getMinimumDataRow(passedDataset):
  minDataDisplay=0
  for i in range(len(passedDataset)):
    if(len(passedDataset[minDataDisplay])>len(passedDataset[i])):
      minDataDisplay=i
  return minDataDisplay

We can see the json file with minimum datarow below, we have used pprint (pretty printer) here to show the json file in its proper format including indentations and proper spacing.

In [7]:
indexToDisplay=getMinimumDataRow(sample_file_train)
#print(json.loads(sample_file_train[indexToDisplay]))
# for formated print --> use pprint
pprint(json.loads(sample_file_train[indexToDisplay]))

{'code': 'async def add(ctx, left: int, right: int):\n'
         '    """Adds two numbers together."""\n'
         '    await ctx.send(left + right)',
 'code_tokens': ['async',
                 'def',
                 'add',
                 '(',
                 'ctx',
                 ',',
                 'left',
                 ':',
                 'int',
                 ',',
                 'right',
                 ':',
                 'int',
                 ')',
                 ':',
                 'await',
                 'ctx',
                 '.',
                 'send',
                 '(',
                 'left',
                 '+',
                 'right',
                 ')'],
 'docstring': 'Adds two numbers together.',
 'docstring_tokens': ['Adds', 'two', 'numbers', 'together', '.'],
 'func_name': 'add',
 'language': 'python',
 'original_string': 'async def add(ctx, left: int, right: int):\n'
                    '    """Adds two numbers together."""\n'
 

The below code combines all the training dataset files into one json file

In [8]:
howManyTrainDataSet=14
train_data_list_json=[]

for i in range(howManyTrainDataSet):
  oneFile=[]
  with open('python/final/jsonl/train/python_train_'+str(i)+'.jsonl', 'r') as f:
    oneFile = f.readlines()
  print("Total dataset in Train_"+str(i)+" is : "+str(len(oneFile)))
  train_data_list_json=train_data_list_json+oneFile

print("Train Dataset has",len(train_data_list_json),"rows")


Total dataset in Train_0 is : 30000
Total dataset in Train_1 is : 30000
Total dataset in Train_2 is : 30000
Total dataset in Train_3 is : 30000
Total dataset in Train_4 is : 30000
Total dataset in Train_5 is : 30000
Total dataset in Train_6 is : 30000
Total dataset in Train_7 is : 30000
Total dataset in Train_8 is : 30000
Total dataset in Train_9 is : 30000
Total dataset in Train_10 is : 30000
Total dataset in Train_11 is : 30000
Total dataset in Train_12 is : 30000
Total dataset in Train_13 is : 22178
Train Dataset has 412178 rows


## Conversion from text to Dataframes

The function below converts the List dataset to dataframes:

In [0]:
def convertToDataFrame(data_list_json):
  dfList=[]
  oneRow=[]
  columnNames=list(json.loads((data_list_json[0])).keys())
  for i in range(len(data_list_json)):
    oneList=[]
    oneRow=data_list_json[i]
    jsonString=json.loads(oneRow)
    for oneKey in columnNames:
      oneList.append(jsonString[oneKey])
    dfList.append(oneList)
  df=pd.DataFrame(dfList, columns =columnNames) 

  return df


We now convert the training dataset into dataframes.

In [0]:
df_train=convertToDataFrame(train_data_list_json)

Now, after converting the dataset into dataframe, we display the first 5 rows of it:

In [11]:
df_train.head()

Unnamed: 0,repo,path,func_name,original_string,language,code,code_tokens,docstring,docstring_tokens,sha,url,partition
0,ageitgey/face_recognition,examples/face_recognition_knn.py,train,"def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo='ball_tree', verbose=False):\n """"""\n Trains a k-nearest neighbors classifier for face recognition.\n\n :param train_dir: directory that contains a sub-directory for each known person, with its name.\n\n (View in s...",python,"def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo='ball_tree', verbose=False):\n """"""\n Trains a k-nearest neighbors classifier for face recognition.\n\n :param train_dir: directory that contains a sub-directory for each known person, with its name.\n\n (View in s...","[def, train, (, train_dir, ,, model_save_path, =, None, ,, n_neighbors, =, None, ,, knn_algo, =, 'ball_tree', ,, verbose, =, False, ), :, X, =, [, ], y, =, [, ], # Loop through each person in the training set, for, class_dir, in, os, ., listdir, (, train_dir, ), :, if, not, os, ., path, ., isdir...","Trains a k-nearest neighbors classifier for face recognition.\n\n :param train_dir: directory that contains a sub-directory for each known person, with its name.\n\n (View in source code to see train_dir example tree structure)\n\n Structure:\n <train_dir>/\n ├── <person...","[Trains, a, k, -, nearest, neighbors, classifier, for, face, recognition, .]",c96b010c02f15e8eeb0f71308c641179ac1f19bb,https://github.com/ageitgey/face_recognition/blob/c96b010c02f15e8eeb0f71308c641179ac1f19bb/examples/face_recognition_knn.py#L46-L108,train
1,ageitgey/face_recognition,examples/face_recognition_knn.py,predict,"def predict(X_img_path, knn_clf=None, model_path=None, distance_threshold=0.6):\n """"""\n Recognizes faces in given image using a trained KNN classifier\n\n :param X_img_path: path to image to be recognized\n :param knn_clf: (optional) a knn classifier object. if not specified, model_s...",python,"def predict(X_img_path, knn_clf=None, model_path=None, distance_threshold=0.6):\n """"""\n Recognizes faces in given image using a trained KNN classifier\n\n :param X_img_path: path to image to be recognized\n :param knn_clf: (optional) a knn classifier object. if not specified, model_s...","[def, predict, (, X_img_path, ,, knn_clf, =, None, ,, model_path, =, None, ,, distance_threshold, =, 0.6, ), :, if, not, os, ., path, ., isfile, (, X_img_path, ), or, os, ., path, ., splitext, (, X_img_path, ), [, 1, ], [, 1, :, ], not, in, ALLOWED_EXTENSIONS, :, raise, Exception, (, ""Invalid im...","Recognizes faces in given image using a trained KNN classifier\n\n :param X_img_path: path to image to be recognized\n :param knn_clf: (optional) a knn classifier object. if not specified, model_save_path must be specified.\n :param model_path: (optional) path to a pickled knn classifie...","[Recognizes, faces, in, given, image, using, a, trained, KNN, classifier]",c96b010c02f15e8eeb0f71308c641179ac1f19bb,https://github.com/ageitgey/face_recognition/blob/c96b010c02f15e8eeb0f71308c641179ac1f19bb/examples/face_recognition_knn.py#L111-L150,train
2,ageitgey/face_recognition,examples/face_recognition_knn.py,show_prediction_labels_on_image,"def show_prediction_labels_on_image(img_path, predictions):\n """"""\n Shows the face recognition results visually.\n\n :param img_path: path to image to be recognized\n :param predictions: results of the predict function\n :return:\n """"""\n pil_image = Image.open(img_path).conv...",python,"def show_prediction_labels_on_image(img_path, predictions):\n """"""\n Shows the face recognition results visually.\n\n :param img_path: path to image to be recognized\n :param predictions: results of the predict function\n :return:\n """"""\n pil_image = Image.open(img_path).conv...","[def, show_prediction_labels_on_image, (, img_path, ,, predictions, ), :, pil_image, =, Image, ., open, (, img_path, ), ., convert, (, ""RGB"", ), draw, =, ImageDraw, ., Draw, (, pil_image, ), for, name, ,, (, top, ,, right, ,, bottom, ,, left, ), in, predictions, :, # Draw a box around the face u...",Shows the face recognition results visually.\n\n :param img_path: path to image to be recognized\n :param predictions: results of the predict function\n :return:,"[Shows, the, face, recognition, results, visually, .]",c96b010c02f15e8eeb0f71308c641179ac1f19bb,https://github.com/ageitgey/face_recognition/blob/c96b010c02f15e8eeb0f71308c641179ac1f19bb/examples/face_recognition_knn.py#L153-L181,train
3,ageitgey/face_recognition,face_recognition/api.py,_rect_to_css,"def _rect_to_css(rect):\n """"""\n Convert a dlib 'rect' object to a plain tuple in (top, right, bottom, left) order\n\n :param rect: a dlib 'rect' object\n :return: a plain tuple representation of the rect in (top, right, bottom, left) order\n """"""\n return rect.top(), rect.right(...",python,"def _rect_to_css(rect):\n """"""\n Convert a dlib 'rect' object to a plain tuple in (top, right, bottom, left) order\n\n :param rect: a dlib 'rect' object\n :return: a plain tuple representation of the rect in (top, right, bottom, left) order\n """"""\n return rect.top(), rect.right(...","[def, _rect_to_css, (, rect, ), :, return, rect, ., top, (, ), ,, rect, ., right, (, ), ,, rect, ., bottom, (, ), ,, rect, ., left, (, )]","Convert a dlib 'rect' object to a plain tuple in (top, right, bottom, left) order\n\n :param rect: a dlib 'rect' object\n :return: a plain tuple representation of the rect in (top, right, bottom, left) order","[Convert, a, dlib, rect, object, to, a, plain, tuple, in, (, top, right, bottom, left, ), order]",c96b010c02f15e8eeb0f71308c641179ac1f19bb,https://github.com/ageitgey/face_recognition/blob/c96b010c02f15e8eeb0f71308c641179ac1f19bb/face_recognition/api.py#L32-L39,train
4,ageitgey/face_recognition,face_recognition/api.py,_trim_css_to_bounds,"def _trim_css_to_bounds(css, image_shape):\n """"""\n Make sure a tuple in (top, right, bottom, left) order is within the bounds of the image.\n\n :param css: plain tuple representation of the rect in (top, right, bottom, left) order\n :param image_shape: numpy shape of the image array...",python,"def _trim_css_to_bounds(css, image_shape):\n """"""\n Make sure a tuple in (top, right, bottom, left) order is within the bounds of the image.\n\n :param css: plain tuple representation of the rect in (top, right, bottom, left) order\n :param image_shape: numpy shape of the image array...","[def, _trim_css_to_bounds, (, css, ,, image_shape, ), :, return, max, (, css, [, 0, ], ,, 0, ), ,, min, (, css, [, 1, ], ,, image_shape, [, 1, ], ), ,, min, (, css, [, 2, ], ,, image_shape, [, 0, ], ), ,, max, (, css, [, 3, ], ,, 0, )]","Make sure a tuple in (top, right, bottom, left) order is within the bounds of the image.\n\n :param css: plain tuple representation of the rect in (top, right, bottom, left) order\n :param image_shape: numpy shape of the image array\n :return: a trimmed plain tuple representation of th...","[Make, sure, a, tuple, in, (, top, right, bottom, left, ), order, is, within, the, bounds, of, the, image, .]",c96b010c02f15e8eeb0f71308c641179ac1f19bb,https://github.com/ageitgey/face_recognition/blob/c96b010c02f15e8eeb0f71308c641179ac1f19bb/face_recognition/api.py#L52-L60,train


Now, its turn to convert test dataset into dataframes

In [0]:
# decompress this gzip file
!gzip -f -d python/final/jsonl/test/python_test_0.jsonl.gz

In [13]:
with open('python/final/jsonl/test/python_test_0.jsonl', 'r') as f:
    test_list = f.readlines()

print("Total Data rows in Test dataset",len(test_list))
df_test=convertToDataFrame(test_list)

Total Data rows in Test dataset 22176


Displaying the fist 5 rows of testing dataset

In [14]:
df_test.head()

Unnamed: 0,repo,path,func_name,original_string,language,code,code_tokens,docstring,docstring_tokens,sha,url,partition
0,soimort/you-get,src/you_get/extractors/youtube.py,YouTube.get_vid_from_url,"def get_vid_from_url(url):\n """"""Extracts video ID from URL.\n """"""\n return match1(url, r'youtu\.be/([^?/]+)') or \\n match1(url, r'youtube\.com/embed/([^/?]+)') or \\n match1(url, r'youtube\.com/v/([^/?]+)') or \\n match1(url, r'youtube\.com/watch/...",python,"def get_vid_from_url(url):\n """"""Extracts video ID from URL.\n """"""\n return match1(url, r'youtu\.be/([^?/]+)') or \\n match1(url, r'youtube\.com/embed/([^/?]+)') or \\n match1(url, r'youtube\.com/v/([^/?]+)') or \\n match1(url, r'youtube\.com/watch/...","[def, get_vid_from_url, (, url, ), :, return, match1, (, url, ,, r'youtu\.be/([^?/]+)', ), or, match1, (, url, ,, r'youtube\.com/embed/([^/?]+)', ), or, match1, (, url, ,, r'youtube\.com/v/([^/?]+)', ), or, match1, (, url, ,, r'youtube\.com/watch/([^/?]+)', ), or, parse_query_param, (, url, ,, '...",Extracts video ID from URL.,"[Extracts, video, ID, from, URL, .]",b746ac01c9f39de94cac2d56f665285b0523b974,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143,test
1,soimort/you-get,src/you_get/extractors/miomio.py,sina_xml_to_url_list,"def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.chil...",python,"def sina_xml_to_url_list(xml_data):\n """"""str->list\n Convert XML to URL List.\n From Biligrab.\n """"""\n rawurl = []\n dom = parseString(xml_data)\n for node in dom.getElementsByTagName('durl'):\n url = node.getElementsByTagName('url')[0]\n rawurl.append(url.chil...","[def, sina_xml_to_url_list, (, xml_data, ), :, rawurl, =, [, ], dom, =, parseString, (, xml_data, ), for, node, in, dom, ., getElementsByTagName, (, 'durl', ), :, url, =, node, ., getElementsByTagName, (, 'url', ), [, 0, ], rawurl, ., append, (, url, ., childNodes, [, 0, ], ., data, ), return, r...",str->list\n Convert XML to URL List.\n From Biligrab.,"[str, -, >, list, Convert, XML, to, URL, List, ., From, Biligrab, .]",b746ac01c9f39de94cac2d56f665285b0523b974,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/miomio.py#L41-L51,test
2,soimort/you-get,src/you_get/extractors/fc2video.py,makeMimi,"def makeMimi(upid):\n """"""From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\n L110""""""\n strSeed = ""gGddgPfeaf_gzyr""\n prehash = upid + ""_"" + strSeed\n return md5(prehash.encode('utf-8')).hexdigest()",python,"def makeMimi(upid):\n """"""From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\n L110""""""\n strSeed = ""gGddgPfeaf_gzyr""\n prehash = upid + ""_"" + strSeed\n return md5(prehash.encode('utf-8')).hexdigest()","[def, makeMimi, (, upid, ), :, strSeed, =, ""gGddgPfeaf_gzyr"", prehash, =, upid, +, ""_"", +, strSeed, return, md5, (, prehash, ., encode, (, 'utf-8', ), ), ., hexdigest, (, )]",From http://cdn37.atwikiimg.com/sitescript/pub/dksitescript/FC2.site.js\n Also com.hps.util.fc2.FC2EncrptUtil.makeMimiLocal\n L110,"[From, http, :, //, cdn37, ., atwikiimg, ., com, /, sitescript, /, pub, /, dksitescript, /, FC2, ., site, ., js, Also, com, ., hps, ., util, ., fc2, ., FC2EncrptUtil, ., makeMimiLocal, L110]",b746ac01c9f39de94cac2d56f665285b0523b974,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L11-L17,test
3,soimort/you-get,src/you_get/extractors/fc2video.py,fc2video_download,"def fc2video_download(url, output_dir = '.', merge = True, info_only = False, **kwargs):\n """"""wrapper""""""\n #'http://video.fc2.com/en/content/20151021bTVKnbEw'\n #'http://xiaojiadianvideo.asia/content/20151021bTVKnbEw'\n #'http://video.fc2.com/ja/content/20151021bTVKnbEw'\n #'http:...",python,"def fc2video_download(url, output_dir = '.', merge = True, info_only = False, **kwargs):\n """"""wrapper""""""\n #'http://video.fc2.com/en/content/20151021bTVKnbEw'\n #'http://xiaojiadianvideo.asia/content/20151021bTVKnbEw'\n #'http://video.fc2.com/ja/content/20151021bTVKnbEw'\n #'http:...","[def, fc2video_download, (, url, ,, output_dir, =, '.', ,, merge, =, True, ,, info_only, =, False, ,, *, *, kwargs, ), :, #'http://video.fc2.com/en/content/20151021bTVKnbEw', #'http://xiaojiadianvideo.asia/content/20151021bTVKnbEw', #'http://video.fc2.com/ja/content/20151021bTVKnbEw', #'http://v...",wrapper,[wrapper],b746ac01c9f39de94cac2d56f665285b0523b974,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/fc2video.py#L46-L57,test
4,soimort/you-get,src/you_get/extractors/dailymotion.py,dailymotion_download,"def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Dailymotion videos by URL.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""(...",python,"def dailymotion_download(url, output_dir='.', merge=True, info_only=False, **kwargs):\n """"""Downloads Dailymotion videos by URL.\n """"""\n\n html = get_content(rebuilt_url(url))\n info = json.loads(match1(html, r'qualities"":({.+?}),""'))\n title = match1(html, r'""video_title""\s*:\s*""(...","[def, dailymotion_download, (, url, ,, output_dir, =, '.', ,, merge, =, True, ,, info_only, =, False, ,, *, *, kwargs, ), :, html, =, get_content, (, rebuilt_url, (, url, ), ), info, =, json, ., loads, (, match1, (, html, ,, r'qualities"":({.+?}),""', ), ), title, =, match1, (, html, ,, r'""video_t...",Downloads Dailymotion videos by URL.,"[Downloads, Dailymotion, videos, by, URL, .]",b746ac01c9f39de94cac2d56f665285b0523b974,https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/dailymotion.py#L13-L35,test


Finding the minium data row for display purposes like we did for the training dataset.

In [15]:
# check minimum length data row from Test Dataset
indexToDisplay=getMinimumDataRow(test_list)
#print(json.loads(test_list[indexToDisplay]))
# for formated print --> use pprint
pprint(json.loads(test_list[indexToDisplay]))

{'code': 'def t_COMMA(self, t):\n'
         "        r','\n"
         '        t.endlexpos = t.lexpos + len(t.value)\n'
         '        return t',
 'code_tokens': ['def',
                 't_COMMA',
                 '(',
                 'self',
                 ',',
                 't',
                 ')',
                 ':',
                 't',
                 '.',
                 'endlexpos',
                 '=',
                 't',
                 '.',
                 'lexpos',
                 '+',
                 'len',
                 '(',
                 't',
                 '.',
                 'value',
                 ')',
                 'return',
                 't'],
 'docstring': "r',",
 'docstring_tokens': ['r'],
 'func_name': 'ModelLoader.t_COMMA',
 'language': 'python',
 'original_string': 'def t_COMMA(self, t):\n'
                    "        r','\n"
                    '        t.endlexpos = t.lexpos + len(t.value)\n'
                    '    

Displaying the type of function names present in the dataset:

In [16]:
# Top 10 Type of function names present in the dataset
df_test['func_name'].value_counts()[:10]

main                      105
dump                       21
get                        18
parse                      17
run                        16
load                       14
register                   14
pull                       12
Client._update_secrets     11
connect                     9
Name: func_name, dtype: int64

Displaying the names of columns in the training dataset

In [17]:
df_train.columns

Index(['repo', 'path', 'func_name', 'original_string', 'language', 'code',
       'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url',
       'partition'],
      dtype='object')

The function below converts tokens into strings:

In [0]:
# Just to convert tokens to string
def makeStr(listOfWords):
  return ' '.join(listOfWords)

Now, we convert the code_tokens column and docstring_tokens column to create a single string which will be used for further processing.

In [0]:
# use code_tokens and docstring_tokens to create a single string
df_train['merged_tokens_str']=df_train['code_tokens'].apply(lambda oneList: makeStr(oneList))+" "+df_train['docstring_tokens'].apply(lambda oneList: makeStr(oneList))

In [20]:
##############################
# Training Parameter Setting #
##############################
#trainingDatasetSize=30
trainingDatasetSize=df_train.shape[0]
#############################
#############################

docs_train=list(df_train['merged_tokens_str'][:trainingDatasetSize])
# Display few documents from training set
print(docs_train[:10])

['def train ( train_dir , model_save_path = None , n_neighbors = None , knn_algo = \'ball_tree\' , verbose = False ) : X = [ ] y = [ ] # Loop through each person in the training set for class_dir in os . listdir ( train_dir ) : if not os . path . isdir ( os . path . join ( train_dir , class_dir ) ) : continue # Loop through each training image for the current person for img_path in image_files_in_folder ( os . path . join ( train_dir , class_dir ) ) : image = face_recognition . load_image_file ( img_path ) face_bounding_boxes = face_recognition . face_locations ( image ) if len ( face_bounding_boxes ) != 1 : # If there are no people (or too many people) in a training image, skip the image. if verbose : print ( "Image {} not suitable for training: {}" . format ( img_path , "Didn\'t find a face" if len ( face_bounding_boxes ) < 1 else "Found more than one face" ) ) else : # Add face encoding for current image to the training set X . append ( face_recognition . face_encodings ( image , kn

The code below preprocesses the data by removing the special characters, removing words of less length. Right now we are removing words having length less than 2 .

In [0]:
import re 
import string 
from nltk.stem import PorterStemmer
 
# init stemmer
porter_stemmer=PorterStemmer()

def textPreprocessor(text):

  text=text.lower()
  # Removes special characters
  # Ref: https://kavita-ganesan.com/how-to-use-countvectorizer/#.XrXebWhKhPY
  text=re.sub("\\W"," ",text)

  # Ref: https://www.w3resource.com/python-exercises/re/python-re-exercise-49.php
  # Removes words of less length --> right now it will remove the words which are of length between 1 and 2
  shortword = re.compile(r'\W*\b\w{1,2}\b')
  text=shortword.sub('', text)

  # stem words
  words=re.split("\\s+",text)
  stemmed_words=[porter_stemmer.stem(word=word) for word in words]
  return ' '.join(stemmed_words)


# Training the model

Now after loading and preprocessing the data, it's time to train the model.

## Using sklearns Tfidfvectorizer:


TfidfVectorizer will tokenize documents using the textPreprocessor which is declared above this section and learns the vocabulary and calculates the idf (inverse document frequency) weights, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document. Because these vectors will contain a lot of zeros, we call them sparse.

The below code sets the parameters of TfidfVectorizer. The explanation of each parameter is as follows:
* use_idf = whether to use idf (setting it to True) or just use tf only (setting it to False)
* smooth_idf= used to Prevent zero divisions in tf-idf equation
* ngram_range(min,max) = uses n-values as n-grams to be extracted from the documents
* min_df=0.10 means, ignore words that have appeared in 10% or below 10% of the documents as they are too rare
* max_df=0.85 means, ignore words appeared in 85% or above 85% of the documents as they are too common
* preprocessor = cleaning text (stemming, removing special char etc.)
* max_features = will keep the top max_features ordered by term frequency and drop the rest 
* binary =  just use presence or absence of a term instead of the raw counts. This is useful in some tasks such as certain features in text classification where the frequency of occurrence is insignificant



#### Note: Above parameter settings can be run with various combinations if we have large computation power.
#### For the experiment, we have kept the settings as low as possible to run it on Google Colab Memory limitations

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Ref: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/
# Parameter Ref: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# Understanding ref: https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.XrXeA2hKhPY

# without Ngrams
tfidf_vectorizer = TfidfVectorizer(use_idf=True,smooth_idf=True,ngram_range=(1,1),min_df=0.10,max_df=0.85,max_features=5000,preprocessor=textPreprocessor,binary=False)
# With Ngrams of 1 and 2
# tfidf_vectorizer = TfidfVectorizer(use_idf=True,smooth_idf=True,ngram_range=(1,2),min_df=0.10,max_df=0.85,max_features=5000,preprocessor=textPreprocessor,binary=False)

Now after setting the parameters, we fit/get the vocabulary for the training dataset.

In [23]:
# fit/make tfidf vector vocabulary with training dataset
tfidf_vectorizer.fit(docs_train)
# Dictonary of words and their indexes
print(len(tfidf_vectorizer.vocabulary_))
print(len(tfidf_vectorizer.stop_words_))

34
1074432


Now, we convert traning dataset into a vector

In [24]:
# Convert each document to vector
tfidf_vector_train=tfidf_vectorizer.transform(docs_train)
print(tfidf_vector_train.shape)
#print(tfidf_vector_train.toarray())

(412178, 34)


# Testing phase

 Now Convert the Testing set into vector


In [25]:
#############################
# Testing Parameter Setting #
#############################
#testingDatasetSize=10
#testingDatasetSize=df_test.shape[0]
testingDatasetSize=25
getTop=3
#############################
#############################

df_test['merged_tokens_str']=df_test['code_tokens'].apply(lambda oneList: makeStr(oneList))+" "+df_test['docstring_tokens'].apply(lambda oneList: makeStr(oneList))

docs_test=list(df_test['merged_tokens_str'][:testingDatasetSize])
# Display few documents from testing set
print(docs_test)

tfidf_vector_test=tfidf_vectorizer.transform(docs_test)
print(tfidf_vector_test.shape)


["def get_vid_from_url ( url ) : return match1 ( url , r'youtu\\.be/([^?/]+)' ) or match1 ( url , r'youtube\\.com/embed/([^/?]+)' ) or match1 ( url , r'youtube\\.com/v/([^/?]+)' ) or match1 ( url , r'youtube\\.com/watch/([^/?]+)' ) or parse_query_param ( url , 'v' ) or parse_query_param ( parse_query_param ( url , 'u' ) , 'v' ) Extracts video ID from URL .", "def sina_xml_to_url_list ( xml_data ) : rawurl = [ ] dom = parseString ( xml_data ) for node in dom . getElementsByTagName ( 'durl' ) : url = node . getElementsByTagName ( 'url' ) [ 0 ] rawurl . append ( url . childNodes [ 0 ] . data ) return rawurl str - > list Convert XML to URL List . From Biligrab .", 'def makeMimi ( upid ) : strSeed = "gGddgPfeaf_gzyr" prehash = upid + "_" + strSeed return md5 ( prehash . encode ( \'utf-8\' ) ) . hexdigest ( ) From http : // cdn37 . atwikiimg . com / sitescript / pub / dksitescript / FC2 . site . js Also com . hps . util . fc2 . FC2EncrptUtil . makeMimiLocal L110', "def fc2video_download ( ur

Now calcuate distance/relevance between for each testing document with each training document and give top documents from training set for each testing set. Meaning: We are giving top 3 recommendations(from training set) for each testing query



In [0]:
# Finding Distance between vector_traing and vector_test
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

topRelevence=[]

for eachTestVector in tfidf_vector_test:
  # computing cosine between one document from testing set with all documents fo training set
  distanceBetweenOneTestRowAndAllTrainingRow=cosine_similarity(eachTestVector, tfidf_vector_train)
  scores = distanceBetweenOneTestRowAndAllTrainingRow[0]
  topScoreIndexes= list(np.argsort(scores))[-1*getTop:]
  topScoreIndexes.reverse()
  topRelevence.append(topScoreIndexes)


Displaying score for the first test document

In [27]:
# Displaying score for the first test document
print(topRelevence[0])

[84874, 85721, 20217]


In the below code we take the columns which are required in the CSV file which needs to be submitted.

In [28]:
# Now that we have top relevent documents from traning set for each testing document,
# It's time to create the final submission file

dfList=[]
for oneIndex in range(testingDatasetSize):
  for oneScore in range(getTop):
    oneRow=[]
    oneRow.append(df_test['docstring'][oneIndex])
    oneRow.append("python")
    oneRow.append(df_test['func_name'][oneIndex])
    oneRow.append(df_train['url'][topRelevence[oneIndex][oneScore]])
    dfList.append(oneRow)
  
df=pd.DataFrame(dfList, columns =['query','language','identifier','url'])

print(df.head())


                                                         query  ...                                                                                                                                                 url
0                                  Extracts video ID from URL.  ...                  https://github.com/JdeRobot/base/blob/303b18992785b2fe802212f2d758a60873007f1f/src/libs/comm_py/comm/ros/listenerBumper.py#L11-L36
1                                  Extracts video ID from URL.  ...  https://github.com/JdeRobot/base/blob/303b18992785b2fe802212f2d758a60873007f1f/src/drivers/MAVLinkServer/MAVProxy/modules/lib/mp_util.py#L207-L214
2                                  Extracts video ID from URL.  ...                     https://github.com/spyder-ide/spyder/blob/f76836ce1b924bcc4efd3f74f2960d26a4e528e0/spyder/utils/syntaxhighlighters.py#L852-L867
3  str->list\n    Convert XML to URL List.\n    From Biligrab.  ...                                     https://github.com/adubkov/py-za

# Evaluate with CodeSearchNet Challenge Benchmark



Now that we have top relevent documents from traning set for each testing document, it's time to create the final submission file

The benchmark evaluation dataset for the challenge can be found here: https://github.com/github/CodeSearchNet/blob/master/README.md#evaluation

We have downloaded the benchmark query dataset (https://github.com/github/CodeSearchNet/blob/master/resources/queries.csv) into our github repository from which we are now loading the queries.

In [29]:
!rm -rf AI-Project-3-CS-GY-6613
!git clone https://github.com/dhavalpatel290/AI-Project-3-CS-GY-6613.git

Cloning into 'AI-Project-3-CS-GY-6613'...
remote: Enumerating objects: 28, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 28 (delta 4), reused 23 (delta 2), pack-reused 0
Unpacking objects: 100% (28/28), done.


In [0]:
# Store the results of predictions from Testing dataset
df.to_csv("./AI-Project-3-CS-GY-6613/TF-IDF/results_On_Testing_Set_Of_2000_Rows/test_dataset_model_predictions.csv",index=False)

Printing the first 5 lines of Benchmark Query dataset by CodeSearchNet Challenge

In [58]:
df_test_leaderboard=pd.read_csv("./AI-Project-3-CS-GY-6613/TF-IDF/queries.csv")
print(df_test_leaderboard.head())

                   query
0  convert int to string
1         priority queue
2         string to date
3       sort string list
4      save list to file


In [59]:
df_test_leaderboard['merged_tokens_str']=df_test_leaderboard['query']
leaderboardTestingDatasetSize=df_test_leaderboard.shape[0]

docs_test_leaderboard=list(df_test_leaderboard['merged_tokens_str'][:leaderboardTestingDatasetSize])
# Display few queries from benchmark test set
print(docs_test_leaderboard[:10])

# get vector representation of each query
tfidf_vector_test_leaderboard=tfidf_vectorizer.transform(docs_test_leaderboard)
print(tfidf_vector_test_leaderboard.shape)


['convert int to string', 'priority queue', 'string to date', 'sort string list', 'save list to file', 'postgresql connection', 'confusion matrix', 'set working directory', 'group by count', 'binomial distribution']
(99, 34)


In [0]:
# Get top 20 relevant documents for each test query
topRelevenceLeaderboard=[]
getTop=20
for eachTestVector in tfidf_vector_test_leaderboard:
  distanceBetweenOneTestRowAndAllTrainingRow=cosine_similarity(eachTestVector, tfidf_vector_train)
  scores = distanceBetweenOneTestRowAndAllTrainingRow[0]
  topScoreIndexes= list(np.argsort(scores))[-1*getTop:]
  topScoreIndexes.reverse()
  topRelevenceLeaderboard.append(topScoreIndexes)


In [61]:
dfListLeaderboard=[]
for oneIndex in range(leaderboardTestingDatasetSize):
  for oneScore in range(getTop):
    oneRow=[]
    oneRow.append(df_test_leaderboard['query'][oneIndex])
    oneRow.append("python")
    oneRow.append(df_train['func_name'][topRelevenceLeaderboard[oneIndex][oneScore]])
    oneRow.append(df_train['url'][topRelevenceLeaderboard[oneIndex][oneScore]])
    dfListLeaderboard.append(oneRow)
  
df_leaderboard=pd.DataFrame(dfListLeaderboard,columns=['query','language','identifier','url'])

# Display the computed results 
print(df_leaderboard.head())


                   query  ...                                                                                                                         url
0  convert int to string  ...      https://github.com/pjuren/pyokit/blob/fddae123b5d817daa39496183f19c000d9c3791f/src/pyokit/datastruct/read.py#L264-L269
1  convert int to string  ...  https://github.com/csparpa/pyowm/blob/cdd59eb72f32f7238624ceef9b2e2329a5ebd472/pyowm/alertapi30/alert_manager.py#L120-L148
2  convert int to string  ...                https://github.com/csparpa/pyowm/blob/cdd59eb72f32f7238624ceef9b2e2329a5ebd472/pyowm/commons/tile.py#L71-L87
3  convert int to string  ...               https://github.com/csparpa/pyowm/blob/cdd59eb72f32f7238624ceef9b2e2329a5ebd472/pyowm/commons/tile.py#L90-L107
4  convert int to string  ...          https://github.com/csparpa/pyowm/blob/cdd59eb72f32f7238624ceef9b2e2329a5ebd472/pyowm/tiles/tile_manager.py#L34-L51

[5 rows x 4 columns]


Saving the CSV back to the repository, you can open our github repository to see the results.

In [0]:
df_leaderboard.to_csv("AI-Project-3-CS-GY-6613/TF-IDF/results_On_Benchmark_Set_Of_100_Rows/model_predictions.csv",index=False)


In [36]:
!pip install pickle5

Collecting pickle5
[?25l  Downloading https://files.pythonhosted.org/packages/cd/5a/cbdf36134804809d55ffd4c248343bd36680a92b6425885a3fd204d32f7b/pickle5-0.0.9.tar.gz (129kB)
[K     |██▌                             | 10kB 18.1MB/s eta 0:00:01[K     |█████                           | 20kB 1.6MB/s eta 0:00:01[K     |███████▋                        | 30kB 2.1MB/s eta 0:00:01[K     |██████████                      | 40kB 2.4MB/s eta 0:00:01[K     |████████████▋                   | 51kB 1.9MB/s eta 0:00:01[K     |███████████████▏                | 61kB 2.1MB/s eta 0:00:01[K     |█████████████████▊              | 71kB 2.3MB/s eta 0:00:01[K     |████████████████████▏           | 81kB 2.6MB/s eta 0:00:01[K     |██████████████████████▊         | 92kB 2.7MB/s eta 0:00:01[K     |█████████████████████████▎      | 102kB 2.6MB/s eta 0:00:01[K     |███████████████████████████▊    | 112kB 2.6MB/s eta 0:00:01[K     |██████████████████████████████▎ | 122kB 2.6MB/s eta 0:00:01[K 

In [0]:
# Saving Model
# Ref: https://www.kaggle.com/mattwills8/fit-transform-and-save-tfidfvectorizer

import pickle5 as pickle

modelSaveFolder="AI-Project-3-CS-GY-6613/TF-IDF/savedModels/"

pickle.dump(tfidf_vectorizer, open(modelSaveFolder+"tfidf.pickle", "wb"))
pickle.dump(tfidf_vector_train, open(modelSaveFolder+"train_features.pickle", "wb"))
pickle.dump(tfidf_vector_test, open(modelSaveFolder+"test_features.pickle", "wb"))
pickle.dump(tfidf_vector_test_leaderboard, open(modelSaveFolder+"test_leaderboard_features.pickle", "wb"))



Now we download the result zip file to our local system which we will upload to our github repository.

In [0]:
!tar -czf downloadFinalResult.tar.gz ./"AI-Project-3-CS-GY-6613"

In [0]:
from google.colab import files
files.download('./downloadResults.tar.gz')

## Submission on Wandb for the above run

In [43]:
!pip install wandb

Collecting wandb
[?25l  Downloading https://files.pythonhosted.org/packages/2d/c9/ebbcefa6ef2ba14a7c62a4ee4415a5fecef8fac5e4d1b4e22af26fd9fe22/wandb-0.8.35-py2.py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 2.7MB/s 
Collecting shortuuid>=0.5.0
  Downloading https://files.pythonhosted.org/packages/25/a6/2ecc1daa6a304e7f1b216f0896b26156b78e7c38e1211e9b798b4716c53d/shortuuid-1.0.1-py3-none-any.whl
Collecting docker-pycreds>=0.4.0
  Downloading https://files.pythonhosted.org/packages/f5/e8/f6bd1eee09314e7e6dee49cbe2c5e22314ccdb38db16c9fc72d2fa80d054/docker_pycreds-0.4.0-py2.py3-none-any.whl
Collecting watchdog>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/73/c3/ed6d992006837e011baca89476a4bbffb0a91602432f73bd4473816c76e2/watchdog-0.10.2.tar.gz (95kB)
[K     |████████████████████████████████| 102kB 9.6MB/s 
[?25hCollecting GitPython>=1.0.0
[?25l  Downloading https://files.pythonhosted.org/packages/44/33/917e6fde1cad13daa7053f39b7c8af3be287

In [44]:
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://app.wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 068567082f04ce819f2f69fff51d3d04f2a20217
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


In [45]:
import wandb
wandb.init(project="AI-Project-3-CS-GY-6613-djp526")

W&B Run: https://app.wandb.ai/djp526/AI-Project-3-CS-GY-6613-djp526/runs/2a7yfa69

In [46]:
wandb.save("AI-Project-3-CS-GY-6613/TF-IDF/results_On_Benchmark_Set_Of_100_Rows/model_predictions.csv")
modelSaveFolder="AI-Project-3-CS-GY-6613/TF-IDF/savedModels/"
wandb.save(modelSaveFolder+"tfidf.pickle")
wandb.save(modelSaveFolder+"train_features.pickle")
wandb.save(modelSaveFolder+"test_features.pickle")
wandb.save(modelSaveFolder+"test_leaderboard_features.pickle")


['/content/wandb/run-20200511_013851-2a7yfa69/test_leaderboard_features.pickle']

# Conclusions


Use of TF-IDF can lead to better results for the given benchmark queries if we could have used **n-gram models** with large size of vocabulary.

To get better accuracy, we can leverage other NLP techniques of Document searching as well. In current research work in Google Search Engine they use **RNN** to get lingual representation of the search strings.



For Project's scope of implementation we have tried the model/training method which was given on CodeSearchNet github repository.

We followed the instrcutions which were given here:  https://github.com/github/CodeSearchNet#setup

Below are the settings for GPU: <br>
Tesla M60 GPU <br> CPU count 16 <br> Memory 122 EBS


The Code and Documentation for our CodeSearchNet Implementation is present here:
https://github.com/dhavalpatel290/AI-Project-3-CS-GY-6613/tree/master/Baselinemodel