
# Code Intent Prediction
## With Applied Machine Learning Techniques
***
### Justin Hugh
#### Data Science Diploma Candidate, BrainStation
##### December 18, 2020

***

# Table of Contents
## [ Introduction](#Introduction)
## [ Limitations and Assumptions](#Limitations-and-Assumptions)
## [ Background](#Background)
## [ The Data](#The-Data)  
### [- Sources of Data](#Sources-of-Data)  
### [- Data Characteristics](#Data-Characteristics)  
## [ Data Wrangling](#Data-Wrangling)  
## [ Conclusion](#Conlusion)  
## [ References](#References)

***

# Introduction
[[Back to TOC]](#Table-of-Contents)

Software and code are becoming unavoidable in our daily lives both personal and professional, but only a fraction of us are literate in code. Even among developers, there exist a wide range of languages and frameworks so no one is familiar with it all. A model which could predict the intent of code would be hugely impactful for:  
- Education. Making code more accessible and interpretable.  
- Security. Identifying code with malicious intent.  
- Development. Providing contextual tooltips, suggestions, resources.   

The goal of this project is to develop an ML model employing NLP tools to interpret what a piece of code is trying to accomplish.

***

# Limitations and Assumptions
[[Back To TOC]](#Table-of-Contents)

In this section we'll recognize some of the limitiations and assumptions to the modelling and analysis we will conduct. Those listed here will are generally applicable to the project at large. Any that are more specifically applicable to a certain step are discussed at that point in the analysis.

- Some of this data is not current. One of the main sources of our data comes from a competition which was conducted in 2018. Software changes quite quickly since updates to packages are relatively cheaply accomplished. My model and this system's performance may be less applicable to presently constructed code, and will deprecate over time as libraries and languages are updated.

- I assume this data set does not have known significant errors, such as incorrect application of code or erroneous syntax. If these are present in abundance, then this system's performance will have "learned" incorrect code application. 

- Developers are not uniquely identified in the data I've used. Not having this information restricts me from making more deep insights into code and intent on a developer-by-developer level which could potentially mean more accurate interpretations. However, this is a good and necessary practice from a privacy and standpoint. If developers were uniquely identified in the data, this could potentially be used to reconstruct personal data, constituting a notable privacy concern.

***

# Background
[[Back To TOC]](#Table-of-Contents)

In any project there's always important context other than the code and the model. In this section, we'll discuss the important subject matter surround the problems we're tackling. 

## Packages and Libraries
[[Back To TOC]](#Table-of-Contents)

There's a wealth of support openly available in the form of packages for Machine Learning, and other problem areas we'll touch on in this project. 

We'll import some necessary ones below in this section. 

In [3]:
# The usual packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline

# Data Wrangling
from sklearn.model_selection import train_test_split 
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, MinMaxScaler 
from sklearn.decomposition import PCA

# Model Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, roc_auc_score

# The classifiers 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

# NLP
from sklearn.feature_extraction.text import CountVectorizer
import nltk 
import string
from sklearn.neighbors import NearestNeighbors

ModuleNotFoundError: No module named 'nltk'

***

# The Data
[[Back To TOC]](#Table-of-Contents)

A model is only as good as the data it uses. In this section we'll discuss the inputs in this project; where it comes from, what it looks like, and how we hope to use it.

## Sources of Data
[[Back To TOC]](#Table-of-Contents)

To acquire the data used in this project, I accessed numerous resources hoping to create a varied, but informative and value-added data set. 

### CoNaLa

[_The Code/Natural Language Challenge (CoNaLa)_](https://conala-corpus.github.io/#dataset-information) is a challenge that was created by [_Carnegie Mellon University (CMU)_](https://www.cmu.edu/) along with [_NeuLab_](http://www.cs.cmu.edu/~neulab/) and [_STRUDEL Lab_](https://cmustrudel.github.io/) on May 31, 2018 in order to test systems for generating programs from natural language [[1]](#References). The original intent was to - given an english input such as "sort list x in reverse order" - have a system output `x.sort(reverse=True)` in Python. 

_CoNaLa_ is a competition with no end date, and are offered for use within the challenge itself, or any other research on the intersection of code and natural languague - which this project falls nicely into.

_CoNaLa_ provides a wealth of publicly available data which is well suited for the needs of this project (and ours) including: 
- Data crawled from _Stack Overflow_ with 2,379 training examples, and 500 test examples. These have been curated by annotators.
- Automatically-mined data with 600,000 examples. 
- Links to other helpful and similar data sets:
    - [Django Dataset](https://ahcweb01.naist.jp/pseudogen/)  
    - [StaQC](https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset)  
    - [Code Docstring Corpus](https://github.com/EdinburghNLP/code-docstring-corpus)  
    
&&&&
I accessed these data in a couple different ways (direct download from CoNaLa, git, etc.)
&&&&

## Data Characteristics
[[Back To TOC]](#Table-of-Contents)


# Exploratory Data Analysis
[[Back To TOC]](#Table-of-Contents)

The purpose of Exploratory Data Analysis (EDA) is to familiarize ourselves with the data, determine whether it has missing values or other deficiencies, clean the data so it may be analyzed, and peek at some of the more immediately evident relations of the data and parameters we're working with. By the end of these activities, we will have a cleaned set of data which is prepared for modelling and deeper analysis.

## Importing the Data
[[Back To TOC]](#Table-of-Contents)

Our data come from a variety of different sources, each requiring a different workflow in order to bring into this workbook and analyze. In this section we'll outline our methods for doing this. And import the data itself.

In [None]:
# CoNaLa Training Data

## Intent Paradigms
[[Back To TOC]](#Table-of-Contents)

(not exclusive?)
- String manipulation
- List manipulation 
- Type change
- Regular Expression
- DataFrame Manipulation
- Find object  


&&...



# Data Wrangling
[[Back To TOC]](#Table-of-Contents)


# Conclusion
[[Back To TOC]](#Table-of-Contents)


# References
[[Back To TOC]](#Table-of-Contents)

[1] CoNaLa: The Code/Natural Language Challenge. 2020. Conala: The Code/Natural Language Challenge. [online] Available at: <https://conala-corpus.github.io/#dataset-information> [Accessed 13 November 2020].