
# Code Intent Prediction
## With Applied Machine Learning Techniques
***
### Justin Hugh
#### Data Science Diploma Candidate, BrainStation
##### December 18, 2020

***

## Table of Contents
#### [ 1.0 - Introduction](#1.0---Introduction)
- [ 1.1 - Problem](#1.1---Problem)

#### [ 2.0 - Background](#2.0---Background)
- [ 2.1 - Stack Overflow](#2.1---Stack-Overflow)
- [ 2.2 - Packages and Libraries](#2.2---Packages-and-Libraries)

#### [ 3.0 - Limitations and Assumptions](#3.0---Limitations-and-Assumptions)

#### [ 4.0 - The Data](#4.0---The-Data)
- [ 4.1 - Sources of Data](#4.1---Sources-of-Data)  
    - [ 4.1.1 - CoNaLa](#4.1.1---CoNaLa)
    - [ 4.1.2 - Other Sources of Data](#4.1.2---Others-Sources-of-Data)

#### [ 5.0 - Exploratory Data Analysis](#5.0---Exploratory-Data-Analysis)  
- [ 5.1 - Importing Data](#5.1---Importing-Data)   
    - [ 5.1.1 - Importing CoNaLa Competition Data](#5.1.1---Importing-CoNaLa-Competition-Data)
        - [ 5.1.1.1 - Importing CoNala Training Data](#5.1.1.1---Importing-CoNaLa-Training-Data)
        - [ 5.1.1.2 - Importing CoNaLa Test Data](#5.1.1.2---Importing-CoNaLa-Test-Data)
    - [ 5.1.2 - DataFrames from CoNaLa Competition Data](#5.1.2---DataFrames-from-CoNaLa-Competition-Data)
    - [ 5.1.3 - Pickling Data](#5.1.3---Pickling-Data)
    - [ 5.1.4 - CoNaLa Mined Data](#5.1.4---CoNaLa-Mined-Data)
        - [ 5.1.2.1 - Proposed `prob` Cutoff](#Proposed-prob-Cutoff)
- [ 5.2 - Combining Data](#Combining-Data) &&& Double Check This!
    - [ 5.2.1 - Cleaning Data](#Cleaning-Data)
- [ 5.3 - Vectorizing Text Data](#Vectorizing-Text-Data) && Should this go before combining?
    - [ 5.3.1 - Simple Bag of Words Vectorization](#Simple-Bag-of-Words-Vectorization)
        - [ 5.3.1.1 - Vectorizing `conala_train_df` with Bag of Words](#Vectorizing-conala_train_df-with-Bag-of-Words)
        - [ 5.3.1.2 - Vectorizing `conala_mined_df` with Bag of Words](#Vectorizing-conala_mined_df-with-Bag-of-Words)
        - [ 5.3.1.3 - Comparing Vectorized `conala_mined_df` and `conala_trained_df`](#Comparing-Vectorized-conala_mined_df-and-conala_trained_df)
        - [ 5.3.1.4 - Combining DataFrames](#Combining-DataFrames)
        - [ 5.3.1.5 - Dimension Reduction of Bag of Words](#Dimension-Reduction-of-Bag-of-Words)
            - [ 5.3.1.5.1 - PCA on Bag of Words](#PCA-on-Bag-of-Words)
            - [ 5.3.1.5.2 - T-SNE on Bag of Words](#T-SNE-on-Bag-of-Words)
    - [ 5.3.2 - Word2Vec Text Vectorization](#Word2Vec-Text-Vectorization)
        - [ 
        
- [ Intent Paradigms](#Intent-Paradigms)  

#### [ Modelling and Analysis](#Modelling-and-Analysis)

#### [ Next Steps](#Next-Steps) 

#### [ Conclusion](#Conclusion)

#### [ References](#References)

***

# 1.0 - Introduction
[[Back to TOC]](#Table-of-Contents)

This report and project have been conducted as a final requirement and submission for the BrainStation Data Science Bootcamp. I participated in the Toronto Fall 2020 cohort. The duration of this project was about 5 weeks of development, ending December 20, 2020. 

This workbook constitutes only the code required to conduct the final approach which was taken in order to address the problem. A wide range of exploration and alternative preprocessing/modelling approaches were conducted, but these have been excluded from this notebook for brevity. Supplementary information is provided alongside this notebook which speaks to these other conducted approaches.

## 1.1 - Problem
[[Back To TOC]](#Table-of-Contents)

Software and code are becoming present nearly everywhere in our daily lives both personal and professional. Whether it helps us accomplish complex and massive tasks, or powers the applications and products we rely on, the digital world is expanding. Yet, only a fraction of us are literate in code, and even among those of us who are, there are a wide range of languages and frameworks so no one is familiar with it all, and mistakes or misinterpretations can be made even on languages we're familiar with. 

I propose a model which could predict the intent or purpose of a sample of code. A tool like this would helpful in understanding more of the world around us and would be hugely impactful for:  
- Education. Making code more accessible and interpretable.  
- Security. Identifying code with malicious intent.  
- Development. Providing contextual tooltips, suggestions, resources.   

The goal of this project is to develop an ML model which employs NLP tools to interpret a samples of code and make a prediction as to its intent.

***

# 2.0 - Background
[[Back To TOC]](#Table-of-Contents)

Currently, there are no commercialized or productized options available to a programmer who is looking to understand code that they are unable to interpret, whether in an academic, professional, or hobbyistic setting. 

For this project, I conducted preliminary research in order to find alternatives or studies on the topic, and while some others have begun study on this problem, it seems that state of the art applications in the field can perform well in translating python code to pseudocode [[1]](#References). However at the moment there is no high-performing model, or model that can convert the code into an English description. This difficulty occurs because the structure and syntax of the two types of text (code and intent) are quite different from one another, whereas this difference is less pronounced between code and pseudocode. Being able to create pseudocode is a wonderful start to the problem, but is insufficient for my goal in making code more accessible. For many hobbyists or even professionals who do not program as their main function, pseudocode will not be a familiar or fully interprable format.

This will clearly be a difficult task, but I'm looking forward to learning and sharing more about the problem space in hopes of moving towards my goal myself, or sharing learning with others so that they can make their own progress on the matter.

Natural Language Processing (NLP) Models have seen very large recent acceleration as a result of the success of smart assistants, as well as the commoditization of computing power. With this, support for very powerful libraries and approaches in language modelling are widely available, and there is extensive research on the topic. Because there are so many resources in the area, and because there continues to be a heavy amount of momentum, this is a great time to apply these techniques to my difficult problem space.

Below I'll discuss other contexts surrounding the problems I'm tackling.

## 2.1 - Stack Overflow
[[Back To TOC]](#Table-of-Contents)

**Stack Overflow** is an online community for programmers [[2]](#References). The website provides a question and answer experience in which programmers can submit questions about how to accomplish various tasks or identify bugs in their code. Respondents to the queries are commonly other programmers who know of a solution or can provide helpful direction, often in the form of a code snippet. This website and service has become very widely popular and as a result hosts a large collection of code snippets, paired with a question/query. It is now one of the largest collections of coding knowledge online.

Much of the code examined and used in this project originally comes from a post on Stack Overflow.

## 2.2 - Packages and Libraries
[[Back To TOC]](#Table-of-Contents)

There's a wealth of openly available python packages and libraries which are indispensible in tackling Machine Learning problems. These are dependencies of this project and I'll import a number of those needed in this section.

In [1]:
# Libraries for general array/dataframe use
import numpy as np 
import pandas as pd 

# Libraries for data visualization with python
import matplotlib.pyplot as plt
import seaborn as sns
# Magic function to help presentation of these visuals
%matplotlib inline

# Functions for saving and importing data
import pickle

# Data Processing/Transformation Packages
import json
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import BaggingClassifier

# Word Vectorization and Natural Language Processing Libararies
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk

# Runtime tools and helpers
from tqdm import tqdm
import time 
import warnings
warnings.filterwarnings("ignore")

These libaries are quite commonly used and are all easily acquired with [ Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/macos.html) [[3]](#References)

***

# 3.0 - Limitations and Assumptions
[[Back To TOC]](#Table-of-Contents)

In this section I'll recognize some of the limitiations and assumptions to the modelling and analysis I will conduct below. Those listed here are generally applicable to the project at large. Any that are more specifically applicable to a certain step are discussed at that point in the analysis.

- Some of this data is not current. The main sources of the data comes from a competition which was conducted in 2018. Software changes quite quickly and frequently. My model and this system's performance may be less applicable to presently written code, and will deprecate over time as libraries and languages are updated and the language, or prevalence of various functions adjusts to this.

- I assume the data set originating from the competition training data does not have known significant errors, such as incorrect application of code or erroneous syntax. If these are present in abundance, then this system's performance will have "learned" incorrect code application.  

- Developers are not uniquely identified in the data I've used. Not having this information restricts me from making more deep insights into code and intent on a developer-by-developer level which could potentially mean more accurate interpretations. Different developers can accomplish the same task by coding with quite different styles. However, not uniquely identifiying the developers is a good and necessary practice from a privacy and standpoint. If developers were uniquely identified in the data, this could potentially be used to reconstruct personal data, constituting a notable privacy concern.

- The amount of data that is used and analyzed in this project is quite restricted. This is largely as a result of limitations of computing resources and time. As a next step in this project, one of the first considerations should be expanding the amount of data analyzed.

***

# 4.0 - The Data
[[Back To TOC]](#Table-of-Contents)

This section will include notes about the sources of data used as inputs in this project. Where the data comes from, its characteristics, and how I intend to use it.

## 4.1 - Sources of Data
[[Back To TOC]](#Table-of-Contents)

To acquire the data used in this project, I conducted research, seeking sets of annotated code snippets. I found a good number of resources with much code available, but with varying characteristics. On one hand, I was able to find some well-structured and cleaned data, and on the other, I found large quantities of stored code, but either poorly formatted, or more difficult to process. 

I'll discuss below where I obtained the data.

### 4.1.1 - CoNaLa
[[Back To TOC]](#Table-of-Contents)

[_The Code/Natural Language Challenge (CoNaLa)_](https://conala-corpus.github.io/#dataset-information) is a challenge that was created by [_Carnegie Mellon University (CMU)_](https://www.cmu.edu/) along with [_NeuLab_](http://www.cs.cmu.edu/~neulab/) and [_STRUDEL Lab_](https://cmustrudel.github.io/) on May 31, 2018 in order to test systems for generating programs from natural language [[4]](#References). The original intent was to - given an english input such as "sort list x in reverse order" - have a system output such as `x.sort(reverse=True)` in Python. 

_CoNaLa_ is a competition with no end date, and the data are offered for use within the challenge itself, but also licensed for any other research on the intersection of code and natural languague, a use case which this project falls nicely into.

_CoNaLa_ provides a wealth of publicly available data which is well suited for the needs of this project including: 
- Training Data: Data crawled from _Stack Overflow_ with 2,379 training examples.
    - These data include user submitted queries, the corresponding code responses by other users on Stack Overflow. Each of these has been manually annotated with a revised intent field (the query rewritten to be more clear) by volunteers for the _CoNaLa_ competition.
- Test Data: 500 test examples. These have been curated by annotators.
    - These data were separated from the training data so that competitors could test their code with the same set. These are in the same format as the training data.
- Mined Data: Automatically-mined data with 600,000 examples. 
    - _CoNaLa_ also provides extra data to be considered if participants are interested. These were automatically scraped from Stack Overflow in pairs of query, and response. However, the quality and accuracy of the responses are not verified at all. Thus, there is no revised intent field.
    - A "probability" is provided with these data. This score was progammatically generated by the scraping protocol and indicates the level of relevance the scraped intent field (query) has to the code response.
- Links to other helpful and similar data sets:
    - [Django Dataset](https://ahcweb01.naist.jp/pseudogen/)  
    - [StaQC](https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset) 
    - [Code Docstring Corpus](https://github.com/EdinburghNLP/code-docstring-corpus)
        - These are simply links to other places were large amounts of code can be found, so very little cleaning or confirmation of its quality has been conducted. 

### 4.1.2 - Others Sources of Data
[[Back To TOC]](#Table-of-Contents)

Through the course of working on this project, it became quickly apparent when beginning to preprocess and preliminarily model with the data that I needed to work with a small sample of data. This is because the computing resources I had access to were restricted, and in order to complete the study within time constraints, I needed my models to train quickly, and without tying up large amounts of memory. For this reason, the CoNaLa Training Data and CoNaLa Test data discussed above were the only data sets which are used below. A priority next step to this project will be to acquire and clean a wider set of data and transfer the approach generated in this report to more data.

I've done a wealth of work in finding other sets of data that would be appropriate to incorporate in the future. These are left out of this report for brevity, but provided in the companion markdown file `other_data_sources.md`

I also made the decision to not include the CoNaLa Mined data further in this report after extensive exploration was conducted on that data. From that exploration I found that the data was for the most part of low quality that would harm or not improve the results of this report. Examples of the exploration conducted on this data and discussion on findings are included in the provided notebook, `conala_mined_data.ipynb`. 

***

# 5.0 - Exploratory Data Analysis
[[Back To TOC]](#Table-of-Contents)

The purpose of Exploratory Data Analysis (EDA) step is to familiarize myself with the data and determine whether it has missing values or other deficiencies. This step is also an opportunity to clean the data so it may be analyzed further, and peek at some of the more immediately evident insights of the data and parameters I'm working with. By the end of these activities, I'll have a cleaned set of data which is prepared for modelling and deeper analysis. 

## 5.1 - Importing Data
[[Back To TOC]](#Table-of-Contents)

Before conducting exploration, I'll have to load in the data to this notebook. The data I use below all came from the _CoNaLa_ competition introduced above, and so a singular approach for loading the data will be sufficient. In this section I'll outline the method I used for doing this. I'll also conduct the data import itself.

### 5.1.1 - Importing CoNaLa Competition Data
[[Back To TOC]](#Table-of-Contents)


Recall from the [**CoNaLa Section**](#CoNaLa) above, the Competition data is comprised of data crawled from Stack Overflow with 2,379 training examples, and 500 test examples. The data has been made conveniently available. 

This data can be accessed through direct downloaded by clicking this link: [CoNaLa Corpus v1.1 (.zip file, 52.1 MB)](http://www.phontron.com/download/conala-corpus-v1.1.zip). It's also linked on the [CoNaLa Competition Page](https://conala-corpus.github.io/).

This download produces a compressed `.zip` file. Once unzipped, the folder contians the data with training examples and the test examples stored in separate folders. Accordingly, I've loaded each of these into this notebook in a saparate step.

In order for the data to be loaded for use with this notebook, and for organization, the corresponding data has been moved into the `data` folder in the same directory as this notebook.

#### 5.1.1.1 - Importing CoNaLa Training Data
[[Back To TOC]](#Table-of-Contents)


I'll start first with the training data.

This data is provided in the format of a .json file, in a set of subfolders. I peeked at the data by opening the json file and viewing the text contained. A preview of what the data looks like is included below:

![Screen Shot 2020-12-06 at 2.43.21 PM.png](attachment:9b32217e-99d6-4518-9035-0ae5af08586c.png)

This looks like the data is contained in a list of JSON objects. This is great news, since this format is quite friendly for use with the `pandas` library. This will allow for conversion of this data into a DataFrame format which can be worked with in the upcoming process of data munging.

Knowing this, I can construct my data extraction process accordingly.

In [2]:
# CoNaLa Training Data

# Open the json file, handle with `with` statement so the file is closed once finished
# load the contents which are contained as a json object.
with open('data/conala-corpus/conala-train.json') as f:
    # Instantiate conala_train_data to hold the data.
    conala_train_data = json.load(f)

In [3]:
# Peek at the loaded data
conala_train_data

[{'intent': 'How to convert a list of multiple integers into a single integer?',
  'rewritten_intent': "Concatenate elements of a list 'x' of multiple integers to a single integer",
  'snippet': 'sum(d * 10 ** i for i, d in enumerate(x[::-1]))',
  'question_id': 41067960},
 {'intent': 'How to convert a list of multiple integers into a single integer?',
  'rewritten_intent': 'convert a list of integers into a single integer',
  'snippet': "r = int(''.join(map(str, x)))",
  'question_id': 41067960},
 {'intent': 'how to convert a datetime string back to datetime object?',
  'rewritten_intent': "convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'",
  'snippet': "datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f')",
  'question_id': 4170655},
 {'intent': 'Averaging the values in a dictionary based on the key',
  'rewritten_intent': 'get the average of a list values for each key in dictionary `d`)',
  'snippet': '[(i, sum(j) / len(j)) for 

In [4]:
# Check number of json objects in the list.
len(conala_train_data)

2379

This variable successfully loaded the data in a similar format, that is as a list of json objects, or equivalently, a list of dictionaries. I'll move onto the Test data and take the same approach.

#### 5.1.1.2 - Importing CoNaLa Test Data
[[Back To TOC]](#Table-of-Contents)

Similarly, this data is provided in the format of a .json file, in a set of subfolders. I peeked at the data, and show a preview included below:

![Screen Shot 2020-12-06 at 3.51.09 PM.png](attachment:482c8fe0-f451-4716-91fd-cf1e5a302dba.png)

This is also a list of JSON objects.

In [5]:
# CoNaLa Test Data

# Open file with `with` statement.
with open('data/conala-corpus/conala-test.json') as f:
    # Instantiate conala_test_data to hold the data.
    conala_test_data = json.load(f)

In [6]:
# Peek
conala_test_data

[{'intent': 'How can I send a signal from a python program?',
  'rewritten_intent': 'send a signal `signal.SIGUSR1` to the current process',
  'snippet': 'os.kill(os.getpid(), signal.SIGUSR1)',
  'question_id': 15080500},
 {'intent': 'Decode Hex String in Python 3',
  'rewritten_intent': "decode a hex string '4a4b4c' to UTF-8.",
  'snippet': "bytes.fromhex('4a4b4c').decode('utf-8')",
  'question_id': 3283984},
 {'intent': 'check if all elements in a list are identical',
  'rewritten_intent': 'check if all elements in list `myList` are identical',
  'snippet': 'all(x == myList[0] for x in myList)',
  'question_id': 3844801},
 {'intent': 'Format string dynamically',
  'rewritten_intent': 'format number of spaces between strings `Python`, `:` and `Very Good` to be `20`',
  'snippet': "print('%*s : %*s' % (20, 'Python', 20, 'Very Good'))",
  'question_id': 4302166},
 {'intent': 'How to convert a string from CP-1251 to UTF-8?',
  'rewritten_intent': None,
  'snippet': "d.decode('cp1251').en

In [7]:
# Check number of json objects in the list.
len(conala_test_data)

500

This was also successful.

#### 5.1.2 - DataFrames from CoNaLa Competition Data
[[Back To TOC]](#Table-of-Contents)

In this section, I'll manipulate the data to be in a more workable format, namely a pandas DataFrame. Conveniently pandas includes a method `from_dict` which is able to ingest the json data in its current listed form and create a DataFrame from this.

In [8]:
# Create DataFrames from the CoNaLa train and test sets, both from a list of dictionary objects
conala_train_df = pd.DataFrame.from_dict(conala_train_data)
conala_test_df = pd.DataFrame.from_dict(conala_test_data)

# Peek at the dfs
display(conala_test_df.head())
display(conala_train_df.head())

# Check Shapes of dfs
print(conala_train_df.shape)
print(conala_test_df.shape)

Unnamed: 0,intent,rewritten_intent,snippet,question_id
0,How can I send a signal from a python program?,send a signal `signal.SIGUSR1` to the current ...,"os.kill(os.getpid(), signal.SIGUSR1)",15080500
1,Decode Hex String in Python 3,decode a hex string '4a4b4c' to UTF-8.,bytes.fromhex('4a4b4c').decode('utf-8'),3283984
2,check if all elements in a list are identical,check if all elements in list `myList` are ide...,all(x == myList[0] for x in myList),3844801
3,Format string dynamically,format number of spaces between strings `Pytho...,"print('%*s : %*s' % (20, 'Python', 20, 'Very G...",4302166
4,How to convert a string from CP-1251 to UTF-8?,,d.decode('cp1251').encode('utf8'),7555335


Unnamed: 0,intent,rewritten_intent,snippet,question_id
0,How to convert a list of multiple integers int...,Concatenate elements of a list 'x' of multiple...,"sum(d * 10 ** i for i, d in enumerate(x[::-1]))",41067960
1,How to convert a list of multiple integers int...,convert a list of integers into a single integer,"r = int(''.join(map(str, x)))",41067960
2,how to convert a datetime string back to datet...,convert a DateTime string back to a DateTime o...,datetime.strptime('2010-11-13 10:33:54.227806'...,4170655
3,Averaging the values in a dictionary based on ...,get the average of a list values for each key ...,"[(i, sum(j) / len(j)) for i, j in list(d.items...",29565452
4,zip lists in python,"zip two lists `[1, 2]` and `[3, 4]` into a lis...","zip([1, 2], [3, 4])",13704860


(2379, 4)
(500, 4)


#### 5.1.3 - Pickling Data
[[Back To TOC]](#Table-of-Contents)

The process of conducting analysis and modelling on this data took place over multiple days and weeks. As a result the data needed to be stored in an efficient manner to avoid re-loading multiple times. 

To address this need, I employed the python pickling functions, a group of functions which enable python to write binary files which save the state of a file, and are more efficient in memory. These are particularly useful for saving the state of models so they do not have to be retrained.

For the purpose of a clean report submission, I've removed these pickle saving and loading cells from this workbook, but these were used very frequently. As a supplement, I've included a file titled `cells_for_pickling_data.ipynb` which provides example cells for how pickling was used in this process.

### 5.1.4 - CoNaLa Mined Data
[[Back To TOC]](#Table-of-Contents)

Early in this project, I conducted lengthy EDA on the "Mined Data" which was provided in the CoNaLa Competition. After progressing with this, and developing a method to clean and combining this data with the CoNaLa training data, I conducted modelling with this combined set. 

Unfortunately, in these preliminary modelling activities, I gained some lessons learned which indicated that continuing to include the mined data would be suboptimal for the purposes of this study. These include the following: 
- The mined data set is far larger than both the training and test data sets. Its influence on the modelling and conclusions would represent a significant imbalance without extensive efforts to balance the data.
- The mined data set communicates its data in a very different manner compared to the training and test sets
    - The mined data set does not have a `rewritten intent` field to compare with that of the training and test sets.
    - The mined data set includes a `prob` feature which indicates the chance that the response code snippet is in fact related to the Stack Overflow Query. 
- The quality of the data has not been reviewed by volunteers and so cannot be trusted with the same confidence of the others. This is especially clear when conducting an analysis of the distribution of the `prob` column. The average probability is very low, less than 10%, which indicates a very low amount of relation between the provided fields.  
- With the computing resources and the strict timelines in which I'd been working in, timeliness of modelling is a major consideration. Because of this it's important to restrict the data set to only the necessary elements.

I decided not to include the _CoNaLa_ Mined Data in my analysis and modelling at this stage. In the future this data will be more thoroughly examined and hopefully can be used to bolster modelling. 

I've provided a supplemental file in my submission titled `conala_mined_data` which presents a sample of work done in loading and exploring the mined data a bit. This is however not an exhaustive representation of how the mined data was used, as early modelling iterations included the mined data as well, but because of lengthy runtime and poor performance, these were abandoned. Additionally, including these cells would have constituted a great deal of redundancy in reporting on the process.

### Cleaning Data
[[Back To TOC]](#Table-of-Contents)

In [None]:
# df.duplicated()
# df.nan()
# describe()
# Impute?



&&&

When conducting research and preliminary analysis, I also loaded some other data sets from a variety of different sources, each requiring a different workflow in order to bring into this workbook and analyze. I've included some of this exploration in the accompanying file: ``

## Vectorizing Text Data
[[Back To TOC]](#Table-of-Contents)

### Simple Bag of Words Vectorization
[[Back To TOC]](#Table-of-Contents)

#### Vectorizing `conala_train_df` with Bag of Words
[[Back To TOC]](#Table-of-Contents)

In [None]:
# Check for nan
conala_train_df.isna().sum()

In [None]:
# Fill with ""
conala_train_df.fillna('', inplace=True)

conala_train_df.isna().sum()

In [None]:
# Instantiate 
conala_train_bagofwords = CountVectorizer(stop_words="english", min_df=5)

# Fit 
conala_train_bagofwords.fit(conala_train_df["rewritten_intent"])

# Transform with the bag of words.
conala_train_bag_SM = conala_train_bagofwords.transform(conala_train_df["rewritten_intent"])
conala_train_bag_SM

In [None]:
# Create a DataFrame (more workable) from the Sparse Matrix 
conala_train_bag_df = pd.DataFrame(columns=conala_train_bagofwords.get_feature_names(),
                                   data=conala_train_bag_SM.toarray())

In [None]:
conala_train_bag_df.sum().sort_values(ascending=False)

#### Vectorizing `conala_test_df`

In [None]:
# Check for nan
conala_test_df.isna().sum()

In [None]:
# Fill with ""
conala_test_df.fillna('', inplace=True)

conala_test_df.isna().sum()

In [None]:
# Transform with the bag of words from the train df
conala_test_bag_SM = conala_train_bagofwords.transform(conala_test_df["rewritten_intent"])
conala_test_bag_SM

In [None]:
# Create a DataFrame (more workable) from the Sparse Matrix 
conala_test_bag_df = pd.DataFrame(columns=conala_train_bagofwords.get_feature_names(),
                                   data=conala_test_bag_SM.toarray())

Since this is our test set, we shouldn't peek at the results of the transformation here.

#### Vectorizing `conala_mined_df` with Bag of Words
[[Back To TOC]](#Table-of-Contents)

We should have less confidence in the `intent` field contained in `conala_mined_df`, since this field is coming from a procedurally collected dataset, and has not been manually cleaned like that of the trained data. Because of this I'm proposing to use the bag of words we have fit to the train dataset in order to transform the mined dataset here. This will (hopefully) have the effect of identifying the words that should be considered as signficant to determining intent, as indicated by the fit on the training data set.

In [None]:
# Check for nan
conala_mined_df.isna().sum()

In [None]:
# Transform with the bag of words from the train df
conala_mined_bag_SM = conala_train_bagofwords.transform(conala_mined_df["intent"])
conala_mined_bag_SM

The number of stored elements from the mined df is a bit low. This may not be a good way of interpreting the mined code. There's some things we can try/consider: 

- graph min_df, and look at vocab size as WELL as elements contained. as WELL as vocab:records, elements:records. 

- only some of the mined records have high probability anyways, maybe we can filter out low probabilities and try again.
- maybe we filter out the rows that have no words in the bag of words which will cut down the data anways
- maybe both.

In [None]:
# Create a DataFrame (more workable) from the Sparse Matrix 
conala_mined_bag_df = pd.DataFrame(columns=conala_train_bagofwords.get_feature_names(),
                                   data=conala_mined_bag_SM.toarray())

In [None]:
conala_mined_bag_df.sum().sort_values(ascending=False)

#### Comparing Vectorized `conala_mined_df` and `conala_trained_df`
[[Back To TOC]](#Table-of-Contents)

In [None]:
conala_train_bag_df.sum().index==conala_mined_bag_df.sum().index

In [None]:
conala_train_bag_df.sum().values
conala_mined_bag_df.sum().values

In [None]:
# Create a df for comparison of word frequency in bag of words
bag_df = pd.DataFrame(data={"train_freq":conala_train_bag_df.sum().values, "mined_freq":conala_mined_bag_df.sum().values},
             index=conala_train_bag_df.sum().index)

In [None]:
# Inspecting the most common terms of the bag of words.
display(bag_df.sort_values(by="train_freq", ascending=False))
display(bag_df.sort_values(by="mined_freq", ascending=False))

In [None]:
train_bag_sorted = bag_df.sort_values(by="train_freq", ascending=False)

In [None]:
# Plot term frequency, mined compared with train
plt.figure(figsize=(8,80))
plt.barh(train_bag_sorted.index,train_bag_sorted["train_freq"], fill=False, edgecolor='b')
plt.barh(train_bag_sorted.index,train_bag_sorted["mined_freq"], fill=False, edgecolor='r')
plt.autoscale(enable=True, axis='y', tight=True)
plt.show()

The terms are actually quite comparable. Enough so that I'm comfortable proceeding with this for preliminary modelling.

One point in particular is worth noting. The term "python" is far more represented in the mined data. This indicates to me that the term is so frequently used that it is not actually helpful in identifying intent. This is not a surprising result since the CoNaLa competition was designed with Python code specifically being analyzed. **I'm going to drop this row** since it simply does not give valuable information seeing as the code should all be written in python anyways.

#### Combining DataFrames
[[Back To TOC]](#Table-of-Contents)

In [None]:
print(conala_train_bag_df.shape)
print(conala_mined_bag_df.shape)
print(conala_train_bag_df.shape[0]+conala_mined_bag_df.shape[0])

In [None]:
combined_bag_df = pd.concat([conala_train_bag_df, conala_mined_bag_df], ignore_index=True)
combined_bag_df

The first 2379 rows are from the train data, and the last 3385 are from the mined data.

In [None]:
# Dropping the `python` column
combined_bag_df.drop(columns="python", index=1, inplace=True)

In [None]:
from varname import nameof
# &&&&&& Pickling both dfs. 
pickle_list = [conala_train_bag_df, conala_mined_bag_df, combined_bag_df]

for df in pickle_list:
    file = open(f'pickled_{nameof(df)}', 'ab+') 
    pickle.dump(df, file)                      
    file.close()

#### Dimension Reduction of Bag of Words
[[Back To TOC]](#Table-of-Contents)

##### PCA on Bag of Words
[[Back To TOC]](#Table-of-Contents)

##### T-SNE on Bag of Words
[[Back To TOC]](#Table-of-Contents)

### Word2Vec Text Vectorization
[[Back To TOC]](#Table-of-Contents)

Word2Vec Embeddings are 

See also Doc2Vec, FastText and wrappers for VarEmbed and WordRank.
[[x]](#References)

In [None]:
# Import Gensim, and get word2vec model methods. 
from gensim.models import Word2Vec
import gensim.downloader # allows downloading of existing models

# Downloading a pre-trained vector using 50 dimensions, from twitter data
wv = gensim.downloader.load('glove-twitter-50')

In [None]:
# Checking vocab type
type(wv.vocab)

In [None]:
# Terms in vocab
len(wv.vocab)

In [None]:
# Checking for similar terms, cosine similarity!
wv.most_similar("man")

In [None]:
# Check if word is in wv vocab
"cat" in wv.vocab

In [None]:
# How many unique word are in our corpus?
len(unique_words)

now check how many of these are in the word2vec pre-trained model.

In [None]:
# Find the list of words contained in model, and those missing.
contained=[] # list of terms in both our corpus and the model
missing=[] # list of terms in our corpus, but not the model
msk=[] # True/false mask for unique words that are in the model. 
for i in unique_words:
    if(i in wv.vocab):
        msk.append(1)
        contained.append(i)
    else:
        msk.append(0)
        missing.append(i)
sum(msk)

In [None]:
# peek at missing words
missing

&&&& Loading Pre-existing vec model

&&&&& When using Word2Vec, there's much extra thought to be given regarding how the sentences I'm feeding to the model will be handled. There's a large number of special characters such as brackets and "%" for example.

&&&&& Comparing the unique words to vocab of pre-trained.

In [None]:
# A couple of functions to help process lists of text sentences.

import re
import nltk
nltk.download('punkt')

def clean_split_text_list(li):
    '''
    Takes a list of sentences.
    Returns a list of lists, each inner list is words in a sentence.
    Also adds a space on either side of non-word, non-digit chars. 
    This allows for brackets, etc. to be considered as their own word, unless 
    vectorized with a model which does not include them.
    '''
    
    new_list = list()
    for i in li:
        try:
            i = i.lower() #lowercase the sentence
        except:
            pass
        try:
            i = re.sub('([^a-zA-Z\ \d])', r' \1 ', i) # Add spaces between special chars
        except:
            pass
        try:
            i = list(i.split(' '))
        except:
            pass
        new_list.append(i)
    return new_list

def vectorize_text_list(li):
    '''
    Takes a list of lists.
        - first list is a sentence
        - inner list is a list of words.
    Returns a list of lists, each inner list is words in a sentence.
    Also adds a space on either side of non-word, non-digit chars. 
    This allows for brackets, etc. to be considered as their own word, unless 
    vectorized with a model which does not include them.
    '''
    new_list=list() # new list object to be returned at end.
    for i in li:
        if i == None:
            new_list.append(np.zeros_like(wv["empty"])) # If None, empty array of wv shape.
            continue
        if type(i) == float:
            i = str(i)
        sub_list=list() # list of vecs, representing a sentence
        for j in i: 
            try:
                vec = wv[j]
                sub_list.append(vec)
            except KeyError:
                continue
        new_list.append(sub_list)
    return new_list

#### PCA on Word2Vec
[[Back To TOC]](#Table-of-Contents)

#### T-SNE on Word2Vec
[[Back To TOC]](#Table-of-Contents)

## Intent Paradigms
[[Back To TOC]](#Table-of-Contents)

We can look at the above graph to see some common themes which emerge, at least on the level of word frequency. 

- String manipulation 
- List manipulation 
- Type change
- Regular Expression
- DataFrame Manipulation
- Find object  


&&...



# Modelling and Analysis
[[Back To TOC]](#Table-of-Contents)


## ML Clustering Models

In [None]:
# For this preliminary modelling, we'll work with: 
combined_bag_df

With this data, our goal is to identify a number of clusters which are "similar" to one another. These can give an understanding of the paradigms which are commonly found in code snippets (at least in Stack Overflow). 

So the plan of action will be to apply various clustering models to the vectorized data to see what we can learn from each in turn. The 4 we will try are: 
- Agglomerative
- DB Scan
- KMeans
- Gaussian Mixture

In [None]:
# Importing the libraries
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import fcluster
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score

### Agglomerative Clustering
[[Back To TOC]](#Table-of-Contents)

- Single
- Maximum
- Average
- Ward's


In [None]:
%%time
from scipy.cluster.hierarchy import dendrogram, linkage
# we are using the average linkage here
linkagemat = linkage(combined_bag_df, 'average') 

In [None]:
%%time
plt.figure(figsize=(25, 10))
dendrogram(
    linkagemat,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.  # font size for the x axis labels
);

From the dendrogram above, we can see how the number of clusters reduces as the avereage distrance is increased. 

In [None]:
%%time
agg_clus = AgglomerativeClustering(n_clusters=20, linkage='average').fit(combined_bag_df)

In [None]:
agg_clus.labels_

In [None]:
np.unique(agg_clus.labels_, return_counts=True)

In [None]:
from sklearn.metrics.cluster import silhouette_score

silhouette_score(combined_bag_df, agg_clus.labels_)

This doesn't seem all that helpful. We do have multiple clusters, but the vast majority of them lie in one.

We can try to standard scale the data and run the same.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize
ss = StandardScaler()

# Fit 
ss_fit = ss.fit(combined_bag_df)

# Transform
combined_bag_df_ss = ss.transform(combined_bag_df)

In [None]:
%%time
from scipy.cluster.hierarchy import dendrogram, linkage
# we are using the average linkage here
linkagemat = linkage(combined_bag_df_ss, 'average') 

In [None]:
%%time
plt.figure(figsize=(25, 10))
dendrogram(
    linkagemat,
    leaf_rotation=90.,  # rotates the x axis labels
    leaf_font_size=8.  # font size for the x axis labels
);

From the dendrogram above, we can see how the number of clusters reduces as the avereage distrance is increased. 

In [None]:
%%time
agg_clus = AgglomerativeClustering(n_clusters=20, linkage='average').fit(combined_bag_df_ss)

In [None]:
# Pickle the model for rapid use later. 
agglom_model = open('pickled_agglom_model', 'ab+') 

# source, destination 
pickle.dump(agg_clus, agglom_model)                      
agglom_model.close() 

In [None]:
agg_clus.labels_

In [None]:
np.unique(agg_clus.labels_, return_counts=True)

In [None]:
from sklearn.metrics.cluster import silhouette_score

silhouette_score(combined_bag_df, agg_clus.labels_)

This is just as bad, and the silhouette score is worse.

### DBSCAN
[[Back To TOC]](#Table-of-Contents)


In [None]:
# Instantiate
db = DBSCAN(eps=2, min_samples=10)

In [None]:
db.fit(combined_bag_df.sample(10))

In [None]:
%%timeit
from sklearn.cluster import DBSCAN

# Instantiate
db = DBSCAN(eps=2, min_samples=10)

# Fit
db.fit(combined_bag_df)

In [None]:
type(db)

In [None]:
#try this out with a range of eps and min_samples
print(db.labels_.sum()) # labels

In [None]:
np.unique(db.labels_, return_counts=True)

Still not great results here.

Try a larger eps, reduce min_samples

In [None]:
%%timeit
# Instantiate
db = DBSCAN(eps=4, min_samples=5)

# Fit
db.fit(combined_bag_df)

In [None]:
#try this out with a range of eps and min_samples
print(db.labels_.sum()) # labels

In [None]:
np.unique(db.labels_, return_counts=True)

Not much better

In [None]:
db_labelled_df = combined_bag_df.copy()
db_labelled_df.insert(0,"DB_label", db.labels_)

In [None]:
db_zero = db_labelled_df[db_labelled_df["DB_label"]==0]

## Autoencoding

Attempting Dimension Reduction with Autoencoding

# Next Steps
[[Back To TOC]](#Table-of-Contents)


# Conclusion
[[Back To TOC]](#Table-of-Contents)


# References
[[Back To TOC]](#Table-of-Contents)

[1] Is it possible to translate Python code to English? September 18, 2018. Quora. [online] Available at: https://www.quora.com/Is-it-possible-to-translate-Python-code-to-English

[2] Stack Overflow. 2020. [online] Available at: https://stackoverflow.com

[3] Conda. 2020. [online] Available at: https://docs.conda.io/projects/conda/en/latest/index.html

[4] CoNaLa: The Code/Natural Language Challenge. 2020. CoNaLa: The Code/Natural Language Challenge. [online] Available at: <https://conala-corpus.github.io/#dataset-information> [Accessed 13 November 2020].





[x] Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. arXiv:1805.08949v1. 23 May 2018. Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, Graham Neubig, Carnegie Mellon University, USA. [online]. Available at: https://arxiv.org/pdf/1805.08949.pdf

[x] Neural Machine Translation (seq2seq) Tutorial. 2017. Minh-Thang Luong and Eugene Brevdo and Rui Zhao. [online]. Available at: https://github.com/tensorflow/nmt

[x] Character-level recurrent sequence-to-sequence model. 2020/04/26. Francois Chollet. [online]. Available at: https://keras.io/examples/nlp/lstm_seq2seq/

[x] models.word2vec – Word2vec embeddings. 2020/11/04. Radim Řehůřek. [online]. Available at: https://radimrehurek.com/gensim/models/word2vec.html
