
# Code Intent Prediction
## With Applied Machine Learning Techniques
***
### Justin Hugh
#### Data Science Diploma Candidate, BrainStation
##### December 18, 2020

***

# Table of Contents
## [ Introduction](#Introduction)
## [ Limitations and Assumptions](#Limitations-and-Assumptions)
## [ Background](#Background)
## [ The Data](#The-Data)
- ### [ Sources of Data](#Sources-of-Data)  
    - #### [ CoNaLa](#CoNaLa)
- ### [ Data Characteristics](#Data-Characteristics)  

## [ Exploratory Data Analysis](#Exploratory-Data-Analysis)  
- ### [ Importing Data](#Importing-Data)   
    - #### [ CoNaLa Competition Data](#CoNaLa-Competition-Data)  
    - #### [ CoNaLa Mined Data](#CoNaLa-Mined-Data)
- ### [ Intent Paradigms](#Intent-Paradigms)  

## [ Modelling and Analysis](#Modelling-and-Analysis)
    
## [ Conclusion](#Conlusion)  
## [ References](#References)

***

# Introduction
[[Back to TOC]](#Table-of-Contents)

Software and code are becoming present nearly everywhere in our daily lives both personal and professional, yet only a fraction of us are literate in code. Even among those, there exists a wide range of languages and frameworks so no one is familiar with it all. 

I propose a model which could predict the intent or purpose of a sample of code. A tool like this would help us understand more of the world around us and would be hugely impactful for:  
- Education. Making code more accessible and interpretable.  
- Security. Identifying code with malicious intent.  
- Development. Providing contextual tooltips, suggestions, resources.   

The goal of this project is to develop an ML model employing NLP tools to interpret what a piece of code is trying to accomplish.

***

# Limitations and Assumptions
[[Back To TOC]](#Table-of-Contents)

In this section I'll recognize some of the limitiations and assumptions to the modelling and analysis I will conduct. Those listed here will are generally applicable to the project at large. Any that are more specifically applicable to a certain step are discussed at that point in the analysis.

- Some of this data is not current. One of the main sources of the data comes from a competition which was conducted in 2018. Software changes quite quickly since updates to packages are relatively cheaply accomplished. My model and this system's performance may be less applicable to presently constructed code, and will deprecate over time as libraries and languages are updated.

- I assume this data set does not have known significant errors, such as incorrect application of code or erroneous syntax. If these are present in abundance, then this system's performance will have "learned" incorrect code application. 

- Developers are not uniquely identified in the data I've used. Not having this information restricts me from making more deep insights into code and intent on a developer-by-developer level which could potentially mean more accurate interpretations. However, this is a good and necessary practice from a privacy and standpoint. If developers were uniquely identified in the data, this could potentially be used to reconstruct personal data, constituting a notable privacy concern.

***

# Background
[[Back To TOC]](#Table-of-Contents)

In any project it's important to recognize the context of what's being investigated and how, other than the just code and the model we create. In this section, we'll discuss the important subject matter surround the problems we're tackling. 

## Packages and Libraries
[[Back To TOC]](#Table-of-Contents)

There's a wealth of support openly available in the form of packages for Machine Learning, and other problem areas I'll touch on in this project. 

I import some necessary ones below in this section. 

In [1]:
# The usual packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline

# Data Import
import pickle

# Data Wrangling
from sklearn.model_selection import train_test_split 
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.decomposition import PCA
import json
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import BaggingClassifier

# Model Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from tempfile import mkdtemp

# The classifiers 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier

# NLP
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
import string
from sklearn.neighbors import NearestNeighbors

# runtime
from tqdm import tqdm
import time 
import warnings
warnings.filterwarnings("ignore")

# timeseries
from statsmodels.api import tsa
from statsmodels.tsa.ar_model import AR
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.statespace.sarimax import SARIMAX

# pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer

***

# The Data
[[Back To TOC]](#Table-of-Contents)

A model is only as good as the data it uses. In this section I'll discuss the inputs in this project; where it comes from, what it looks like, and how I hope to use it.

## Sources of Data
[[Back To TOC]](#Table-of-Contents)

To acquire the data used in this project, I accessed numerous resources hoping to create a balance. On one hand, I was able to find some well-structured and cleaned data, and on the other, in order to bolster the amount of data I had, I accessed other data sets. 

&&&&&&& Conducting some scraping and intent learning myself? &&&&&&&&&&

I'll discuss where we obtained the data below. It's important to note where the data used to train the model I've created comes from. With that understanding it's possible to see some quirks, patterns, and other important information to keep in mind throughout this project.

### CoNaLa

[_The Code/Natural Language Challenge (CoNaLa)_](https://conala-corpus.github.io/#dataset-information) is a challenge that was created by [_Carnegie Mellon University (CMU)_](https://www.cmu.edu/) along with [_NeuLab_](http://www.cs.cmu.edu/~neulab/) and [_STRUDEL Lab_](https://cmustrudel.github.io/) on May 31, 2018 in order to test systems for generating programs from natural language [[1]](#References). The original intent was to - given an english input such as "sort list x in reverse order" - have a system output `x.sort(reverse=True)` in Python. 

_CoNaLa_ is a competition with no end date, and are offered for use within the challenge itself, or any other research on the intersection of code and natural languague - which this project falls nicely into.

_CoNaLa_ provides a wealth of publicly available data which is well suited for the needs of this project (and ours) including: 
- Data crawled from _Stack Overflow_ with 2,379 training examples, and 500 test examples. These have been curated by annotators.
- Automatically-mined data with 600,000 examples. 
- Links to other helpful and similar data sets:
    - [Django Dataset](https://ahcweb01.naist.jp/pseudogen/)  
    - [StaQC](https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset)[3]  
    - [Code Docstring Corpus](https://github.com/EdinburghNLP/code-docstring-corpus)[4]  
    
&&&&
I accessed these data in a couple different ways (direct download from CoNaLa, git, etc.)
&&&&

## Data Characteristics
[[Back To TOC]](#Table-of-Contents)


# Exploratory Data Analysis
[[Back To TOC]](#Table-of-Contents)

The purpose of Exploratory Data Analysis (EDA) is to familiarize ourselves with the data, determine whether it has missing values or other deficiencies, clean the data so it may be analyzed, and peek at some of the more immediately evident relations of the data and parameters we're working with. By the end of these activities, we will have a cleaned set of data which is prepared for modelling and deeper analysis.

## Importing Data
[[Back To TOC]](#Table-of-Contents)

Our data come from a variety of different sources, each requiring a different workflow in order to bring into this workbook and analyze. In this section we'll outline our methods for doing this. And import the data itself.

### CoNaLa Competition Data
[[Back To TOC]](#Table-of-Contents)


In [2]:
# CoNaLa Training Data

# Open file, handle with `with` and load the contents which are contained as a json object.
# Instantiate conala_train_data to hold the data.
with open('data/conala-corpus/conala-train.json') as f:
    conala_train_data = json.load(f)

In [3]:
conala_train_data

[{'intent': 'How to convert a list of multiple integers into a single integer?',
  'rewritten_intent': "Concatenate elements of a list 'x' of multiple integers to a single integer",
  'snippet': 'sum(d * 10 ** i for i, d in enumerate(x[::-1]))',
  'question_id': 41067960},
 {'intent': 'How to convert a list of multiple integers into a single integer?',
  'rewritten_intent': 'convert a list of integers into a single integer',
  'snippet': "r = int(''.join(map(str, x)))",
  'question_id': 41067960},
 {'intent': 'how to convert a datetime string back to datetime object?',
  'rewritten_intent': "convert a DateTime string back to a DateTime object of format '%Y-%m-%d %H:%M:%S.%f'",
  'snippet': "datetime.strptime('2010-11-13 10:33:54.227806', '%Y-%m-%d %H:%M:%S.%f')",
  'question_id': 4170655},
 {'intent': 'Averaging the values in a dictionary based on the key',
  'rewritten_intent': 'get the average of a list values for each key in dictionary `d`)',
  'snippet': '[(i, sum(j) / len(j)) for 

In [4]:
# CoNaLa Test Data

# Open file
# Instantiate conala_test_data to hold the data.
with open('data/conala-corpus/conala-test.json') as f:
    conala_test_data = json.load(f)

In [5]:
conala_test_data

[{'intent': 'How can I send a signal from a python program?',
  'rewritten_intent': 'send a signal `signal.SIGUSR1` to the current process',
  'snippet': 'os.kill(os.getpid(), signal.SIGUSR1)',
  'question_id': 15080500},
 {'intent': 'Decode Hex String in Python 3',
  'rewritten_intent': "decode a hex string '4a4b4c' to UTF-8.",
  'snippet': "bytes.fromhex('4a4b4c').decode('utf-8')",
  'question_id': 3283984},
 {'intent': 'check if all elements in a list are identical',
  'rewritten_intent': 'check if all elements in list `myList` are identical',
  'snippet': 'all(x == myList[0] for x in myList)',
  'question_id': 3844801},
 {'intent': 'Format string dynamically',
  'rewritten_intent': 'format number of spaces between strings `Python`, `:` and `Very Good` to be `20`',
  'snippet': "print('%*s : %*s' % (20, 'Python', 20, 'Very Good'))",
  'question_id': 4302166},
 {'intent': 'How to convert a string from CP-1251 to UTF-8?',
  'rewritten_intent': None,
  'snippet': "d.decode('cp1251').en

In [6]:
# Create DataFrames from the CoNaLa train and test sets, both from a list of dictionary objects
conala_train_df = pd.DataFrame.from_dict(conala_train_data)
conala_test_df = pd.DataFrame.from_dict(conala_test_data)


# Peek at the dfs
display(conala_test_df.head())
display(conala_train_df.head())

Unnamed: 0,intent,rewritten_intent,snippet,question_id
0,How can I send a signal from a python program?,send a signal `signal.SIGUSR1` to the current ...,"os.kill(os.getpid(), signal.SIGUSR1)",15080500
1,Decode Hex String in Python 3,decode a hex string '4a4b4c' to UTF-8.,bytes.fromhex('4a4b4c').decode('utf-8'),3283984
2,check if all elements in a list are identical,check if all elements in list `myList` are ide...,all(x == myList[0] for x in myList),3844801
3,Format string dynamically,format number of spaces between strings `Pytho...,"print('%*s : %*s' % (20, 'Python', 20, 'Very G...",4302166
4,How to convert a string from CP-1251 to UTF-8?,,d.decode('cp1251').encode('utf8'),7555335


Unnamed: 0,intent,rewritten_intent,snippet,question_id
0,How to convert a list of multiple integers int...,Concatenate elements of a list 'x' of multiple...,"sum(d * 10 ** i for i, d in enumerate(x[::-1]))",41067960
1,How to convert a list of multiple integers int...,convert a list of integers into a single integer,"r = int(''.join(map(str, x)))",41067960
2,how to convert a datetime string back to datet...,convert a DateTime string back to a DateTime o...,datetime.strptime('2010-11-13 10:33:54.227806'...,4170655
3,Averaging the values in a dictionary based on ...,get the average of a list values for each key ...,"[(i, sum(j) / len(j)) for i, j in list(d.items...",29565452
4,zip lists in python,"zip two lists `[1, 2]` and `[3, 4]` into a lis...","zip([1, 2], [3, 4])",13704860


### CoNaLa Mined Data
[[Back To TOC]](#Table-of-Contents)

In [9]:
# CoNala Mined Data

# This file is different in format from the other CoNaLa Competition Data, 
# and contains multiple json objects. We need to handle it differently.

# First instantiate an empty list, this will be used hold all of the dictionary objects as a list
# of dictionaries.
conala_mined_data_list = []

# Open file, loop through the json objects in the file, appending the list each time. 
with open('data/conala-corpus/conala-mined.jsonl') as f:
    for jsonObj in tqdm(f):
        code_dic = json.loads(jsonObj)
        conala_mined_data_list.append(code_dic)

593891it [00:45, 12978.64it/s]


In [10]:
conala_mined_data_list[::-1]

[{'parent_answer_post_id': 39398969,
  'prob': 3.0130267068701258e-05,
  'snippet': 'tr\nfold',
  'intent': 'Script works differently when ran from the terminal and ran from Python',
  'id': '39397034_39398969_13',
  'question_id': 39397034},
 {'parent_answer_post_id': 41140750,
  'prob': 4.903111128849962e-05,
  'snippet': 'Red\nBlue',
  'intent': 'BeautifulSoup: just get inside of a tag, no matter how many enclosing tags there are',
  'id': '2957013_41140750_6',
  'question_id': 2957013},
 {'parent_answer_post_id': 5180297,
  'prob': 7.572041759033881e-05,
  'snippet': 'import re\nRawPurchaseAmount',
  'intent': 'Python Remove Comma In Dollar Amount',
  'id': '5180184_5180297_5',
  'question_id': 5180184},
 {'parent_answer_post_id': 5180297,
  'prob': 7.73206863173198e-05,
  'snippet': 'RawPurchaseAmount',
  'intent': 'Python Remove Comma In Dollar Amount',
  'id': '5180184_5180297_4',
  'question_id': 5180184},
 {'parent_answer_post_id': 2957181,
  'prob': 7.832874238417845e-05,
  '

In [11]:
conala_mined_df = pd.DataFrame(conala_mined_data_list)

In [33]:
# Peek at the DataFrame for the mined CoNaLa data.
conala_mined_df.head()

Unnamed: 0,parent_answer_post_id,prob,snippet,intent,id,question_id
0,34705233,0.869,"sorted(l, key=lambda x: (-int(x[1]), x[0]))",Sort a nested list by two elements,34705205_34705233_0,34705205
1,13905946,0.85267,[int(x) for x in str(num)],converting integer to list in python,13905936_13905946_0,13905936
2,13838041,0.852143,c.decode('unicode_escape'),Converting byte string in unicode string,13837848_13838041_0,13837848
3,23490179,0.850829,"parser.add_argument('-t', dest='table', help='...",List of arguments with argparse,23490152_23490179_0,23490152
4,2721807,0.840372,"datetime.datetime.strptime(s, '%Y-%m-%dT%H:%M:...",How to convert a Date string to a DateTime obj...,2721782_2721807_0,2721782


Note that the index is not random, the records are sorted such that the probabilities decrease along the list. That is, the lower the rank of the index, the higher the probability score.

In [12]:
# Looking at the DataFrame's statistics for numerical columns.
conala_mined_df.describe()

Unnamed: 0,parent_answer_post_id,prob,question_id
count,593891.0,593891.0,593891.0
mean,18599590.0,0.064413,16724310.0
std,12606240.0,0.085104,12278070.0
min,595.0,3e-05,502.0
25%,7669063.0,0.016376,6196250.0
50%,16570900.0,0.034263,14166150.0
75%,29606530.0,0.074638,25656130.0
max,42773100.0,0.869,42771820.0


The numbers for the `parent_answer_post_id` and `question_id` columns are not too meaninful since the numbers stored in these columns are nominal.

The statistics we see for the `prob` column is quite informative here. Of note, the mean probability is very low, at 6.4%, and even the 75th percentile is much lower than expected at 7.5%. This means there are far fewer records here where the code snippet is confidently associated with an intent than I was hoping for. This problem is discussed in the study which produced this mined code: 

> existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the
correctness of the NL-code pairs obtained.

To get a better intuition for how this `prob` score describes the association of the given `intent` and `snippet` I manually sampled the data and reviewed it.

In [102]:
# First sample from the whole DataFrame, look at the results.
conala_mined_df[conala_mined_df['prob']>0.5].sample(10)

Unnamed: 0,parent_answer_post_id,prob,snippet,intent,id,question_id
2437,38273689,0.531273,""""""""""""".join(map(lambda x: x * 7, 'map'))",How to repeat individual characters in strings...,38273353_38273689_0,38273353
999,25293078,0.607091,df['Season2'] = df['Season'].apply(split_it),applying regex to a pandas dataframe,25292838_25293078_3,25292838
1492,36166644,0.575944,Foo.objects.filter(Q(bar_x__name='bar x') | Q(...,How to traverse a GenericForeignKey in Django?,36164654_36166644_0,36164654
1237,31385715,0.591987,"""""""package ([^\\s]+)\\s+is([\\s\\S]*)end\\s+(p...",python backreference regex,31385457_31385715_0,31385457
2161,12739974,0.54313,"zip(['a', 'c', 'e'], ['b', 'd'])",python convert list to dictionary,6900955_12739974_4,6900955
1049,42462626,0.604554,"df.replace(' ', '_', regex=True)",How to replace the white space in a string in ...,42462530_42462626_0,42462530
143,3805981,0.710027,SomeModel.objects.filter(id=id).delete(),How to delete a record in Django models?,3805958_3805981_0,3805958
43,27760083,0.7574,"driver.execute_script('window.scrollTo(0, docu...",How can I scroll a web page using selenium web...,20986631_27760083_0,20986631
3255,10149202,0.503644,"df.set_index(['Z', 'A', 'pos']).unstack('pos')",Multiindex from array in Pandas with non uniqu...,10133021_10149202_0,10133021
3069,6510636,0.509928,[x for x in file.namelist() if x.endswith('/')],How can i list only the folders in zip archive...,6510477_6510636_0,6510477


In [102]:
# Sample from the DataFrame, among records with prob > 0.5.
conala_mined_df[conala_mined_df['prob']>0.5].sample(10)

Unnamed: 0,parent_answer_post_id,prob,snippet,intent,id,question_id
2437,38273689,0.531273,""""""""""""".join(map(lambda x: x * 7, 'map'))",How to repeat individual characters in strings...,38273353_38273689_0,38273353
999,25293078,0.607091,df['Season2'] = df['Season'].apply(split_it),applying regex to a pandas dataframe,25292838_25293078_3,25292838
1492,36166644,0.575944,Foo.objects.filter(Q(bar_x__name='bar x') | Q(...,How to traverse a GenericForeignKey in Django?,36164654_36166644_0,36164654
1237,31385715,0.591987,"""""""package ([^\\s]+)\\s+is([\\s\\S]*)end\\s+(p...",python backreference regex,31385457_31385715_0,31385457
2161,12739974,0.54313,"zip(['a', 'c', 'e'], ['b', 'd'])",python convert list to dictionary,6900955_12739974_4,6900955
1049,42462626,0.604554,"df.replace(' ', '_', regex=True)",How to replace the white space in a string in ...,42462530_42462626_0,42462530
143,3805981,0.710027,SomeModel.objects.filter(id=id).delete(),How to delete a record in Django models?,3805958_3805981_0,3805958
43,27760083,0.7574,"driver.execute_script('window.scrollTo(0, docu...",How can I scroll a web page using selenium web...,20986631_27760083_0,20986631
3255,10149202,0.503644,"df.set_index(['Z', 'A', 'pos']).unstack('pos')",Multiindex from array in Pandas with non uniqu...,10133021_10149202_0,10133021
3069,6510636,0.509928,[x for x in file.namelist() if x.endswith('/')],How can i list only the folders in zip archive...,6510477_6510636_0,6510477


In [117]:
# Use cell to view full contents of intent and snippet strings for specific 
# records in the sample, by its index.
ind = 143
print(conala_mined_df.loc[ind,'intent'])
print(conala_mined_df.loc[ind,'snippet'])

How to delete a record in Django models?
SomeModel.objects.filter(id=id).delete()


In [118]:
# Store this record result in helper csv file to aid manually viewing individual
# records as a random sample. 
with open('prob_explore.csv', mode='a+') as f:
    pd.DataFrame(conala_mined_df.loc[ind]).T.to_csv(f, mode='a+', header=False)

In [31]:
# Then look at samples with the 
conala_mined_df[conala_mined_df['prob']>0.5].sample(10)

Unnamed: 0,parent_answer_post_id,prob,snippet,intent,id,question_id
1735,3809287,0.563476,(a.T * b).T,numpy matrix multiplication,3809265_3809287_0,3809265
1985,6432924,0.551054,"[[1, 2, 5], [3, 4, 5]]",How to modify elements of iterables with itera...,6432898_6432924_0,6432898
2472,7142731,0.529971,"time.strptime('2011-03-06T03:36:45+0000', '%Y-...",Parse FB Graph API date string into python dat...,7142618_7142731_0,7142618
323,8188287,0.676402,plt.show(),How do I autosize text in matplotlib python?,8182124_8188287_0,8182124
1950,21882971,0.55285,plt.savefig('temp.png'),How to show matplotlib plots in python,8575062_21882971_0,8575062
1865,17368230,0.556817,"connection.uid('STORE', '-FLAGS', '(\\Seen)')",python imaplib - mark email as unread or unseen,17367611_17368230_0,17367611
1172,10012452,0.596319,session.query(Shots).filter_by(event_id=event_id),How do I model a many-to-many relationship ove...,9995999_10012452_1,9995999
2149,13811775,0.543824,plt.show(),Plot only on continent in matplotlib,13796315_13811775_0,13796315
3182,16883459,0.506159,"codecs.open('myfile', 'r', 'iso-8859-1').read()","How to read a ""C source, ISO-8859 text""",16883447_16883459_0,16883447
2808,13368753,0.518488,"[list(v) for k, v in itertools.groupby(mylist,...",Python split a list into subsets based on pattern,13368723_13368753_0,13368723


In [120]:
# Can use this function to determine how many records have a probability greater than `prob_thresh`
prob_thresh = 0.5
len(conala_mined_df[conala_mined_df["prob"]>prob_thresh])

3385

We have have three DataFrames now. Let's look at their shapes and column names. We should compare them to make sure they are consistent, or create a plan for making them consistent.

In [18]:
# Summarize the shapes
print("Shape of CoNaLa train df:", conala_train_df.shape)
print("Shape of CoNaLa test df:", conala_test_df.shape)
print("Shape of CoNaLa mined df:", conala_mined_df.shape)

print("\nColumns of CoNaLa train df:\n", conala_train_df.columns)
print("\nColumns of CoNaLa test df:\n", conala_test_df.columns)
print("\nColumns of CoNaLa mined df:\n", conala_mined_df.columns)

Shape of CoNaLa train df: (2379, 4)
Shape of CoNaLa test df: (500, 4)
Shape of CoNaLa mined df: (3000, 6)

Columns of CoNaLa train df:
 Index(['intent', 'rewritten_intent', 'snippet', 'question_id'], dtype='object')

Columns of CoNaLa test df:
 Index(['intent', 'rewritten_intent', 'snippet', 'question_id'], dtype='object')

Columns of CoNaLa mined df:
 Index(['parent_answer_post_id', 'prob', 'snippet', 'intent', 'id',
       'question_id'],
      dtype='object')


The `conala_mined_df` has three columns that are the same as the other two, and is missing one column. 

Same columns:
- `question_id`
- `snippet`
- `intent`

Missing Column:
- `rewritten_intent`

The `question id` appears to be used as an index. I should be able to do the same, but will have to check with the mined data that the numbers don't overlap as these need to be unique.

The `snippet` column is where the snippets of code are contained. It's very likely that this column shouldn't be modified before being vectorized since this is the data that I'm trying to interpret and I do not want to introduce bias. When I conduct vectorizing on this shortly, there will be some decisions to be made about how to break up the data.

The `intent` column contains plain english questions submitted by developers to [Stack Overflow](#stackoverflow.com), in order to achieve certain tasks. This contains information about **desired intent**. Unfortunately most of these are written as questions and so the intent is not structured as preferred. 

The `rewritten_intent` column is a colum which has been viewed by the CoNaLa team to create a more plain-english description about the code in question. This will be extremely helpful since this represents some preliminary cleaning which has already been done for me. Unfortunately this doesn't exist for the `conala_mined_df` so I'll have to determine how to either create it, or handle the disparity.

This is a good time to export the data to a readable format for me to review visually.

In [None]:
# Export to csv for readability.
conala_train_df.to_csv(r'conala_train_df.csv')
conala_test_df.to_csv(r'conala_train_df.csv')
# This is a big dataset so we should reduce it 
conala_mined_df.sample(frac=1).to_csv(r'conala_train_df.csv')

## Count Vectorizing
[[Back To TOC]](#Table-of-Contents)

### Vectorizing `conala_train_df`

In [39]:
# Check for nan
conala_train_df.isna().sum()

intent               0
rewritten_intent    79
snippet              0
question_id          0
dtype: int64

In [42]:
# Fill with ""
conala_train_df.fillna('', inplace=True)

conala_train_df.isna().sum()

intent              0
rewritten_intent    0
snippet             0
question_id         0
dtype: int64

In [70]:
# 1. Instantiate 
conala_train_bagofwords = CountVectorizer(stop_words="english")

# 2. Fit 
conala_train_bagofwords.fit(conala_train_df["rewritten_intent"])

# 3. Transform
conala_train_bag_df = conala_train_bagofwords.transform(conala_train_df["rewritten_intent"])
conala_train_bag_df

<2379x2348 sparse matrix of type '<class 'numpy.int64'>'
	with 13298 stored elements in Compressed Sparse Row format>

### Vectorizing `conala_test_df`

In [65]:
# Check for nan
conala_test_df.isna().sum()

intent              0
rewritten_intent    0
snippet             0
question_id         0
dtype: int64

In [66]:
# Fill with ""
conala_test_df.fillna('', inplace=True)

conala_test_df.isna().sum()

intent              0
rewritten_intent    0
snippet             0
question_id         0
dtype: int64

In [71]:
# Transform with the bag of words from the train df
conala_test_bag_df = conala_train_bagofwords.transform(conala_test_df["rewritten_intent"])
conala_test_bag_df

<500x2348 sparse matrix of type '<class 'numpy.int64'>'
	with 2376 stored elements in Compressed Sparse Row format>

### Vectorizing `conala_mined_df`

In [56]:
# Check for nan
conala_mined_df.isna().sum()

parent_answer_post_id    0
prob                     0
snippet                  0
intent                   0
id                       0
question_id              0
dtype: int64

In [72]:
# Transform with the bag of words from the train df
conala_mined_bag_df = conala_train_bagofwords.transform(conala_mined_df["intent"])
conala_mined_bag_df

<593891x2348 sparse matrix of type '<class 'numpy.int64'>'
	with 2273980 stored elements in Compressed Sparse Row format>

The number of stored elements from the mined df is LESS than the number of elements themselves! This may not be a good way of interpreting the mined code. There's some things we can try/consider: 

- graph min_df, and look at vocab size as WELL as elements contained. as WELL as vocab:records, elements:records. 

- only some of the mined records have high probability anyways, maybe we can filter out low probabilities and try again.
- maybe we filter out the rows that have no words in the bag of words which will cut down the data anways
- maybe both.

## Intent Paradigms
[[Back To TOC]](#Table-of-Contents)

(not exclusive?)
- String manipulation
- List manipulation 
- Type change
- Regular Expression
- DataFrame Manipulation
- Find object  


&&...



# Modelling and Analysis
[[Back To TOC]](#Table-of-Contents)


# Conclusion
[[Back To TOC]](#Table-of-Contents)


# References
[[Back To TOC]](#Table-of-Contents)

[1] CoNaLa: The Code/Natural Language Challenge. 2020. CoNaLa: The Code/Natural Language Challenge. [online] Available at: <https://conala-corpus.github.io/#dataset-information> [Accessed 13 November 2020].

[2] Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. arXiv:1805.08949v1. 23 May 2018. Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, Graham Neubig, Carnegie Mellon University, USA. [online]. Available at: https://arxiv.org/pdf/1805.08949.pdf