# ML-SQL language (1st-take)

## Authors

Written by: Neeraj Asthana (under Professor Robert Brunner)

University of Illinois at Urbana-Champaign

Summer 2016

## Acknowledgements

Followed Tutorial at: http://www.onlamp.com/lpt/a/6435

## Description

This notebook is meant to experiment with constructs for the ML-SQL language. The goal is to be able to understand ML-SQL syntax and port commands to actionable directives in Python.

___

In [142]:
#Libraries
#from pyparsing import Word, Literal, alphas, Optional, OneOrMore, Group, Or, Combine, oneOf
from pyparsing import *
import string
import sys
import pandas as pd

___

### Grammer Definition

Literals and Valid Symbols that are possible in the ML-SQL language

In [121]:
letters = string.ascii_letters
punctuation = string.punctuation
numbers = string.digits
whitespace = string.whitespace

#combinations
everything = letters + punctuation + numbers
everythingWOQuotes = everything.replace("\"", "").replace("'", "")

#Booleans
bools = Literal("True") + Literal("False")

#Parenthesis and Quotes
openParen = Literal("(").suppress()
closeParen = Literal(")").suppress()
Quote = Literal('"').suppress()

#includes every combination except whitespace
everything

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~0123456789'

___

### READ

In [137]:
filename = Word(everything).setResultsName("filename")

#define so that there can be multiple verisions of READ
readKeyword = oneOf(["Read", "READ"]).suppress()

#Define Read Optionals
#header
headerLiteral = (Literal("header") + Literal("=")).suppress()
header = Optional(headerLiteral + Or(bools).setResultsName("header"), default = "False" )

#separator
separatorLiteral = (Or([Literal("sep"), Literal("separator")]) + Literal("=")).suppress()
definesep = Quote + Word(everythingWOQuotes + whitespace).setResultsName("sep") + Quote
separator = Optional(separatorLiteral + definesep, default = ",")

#Compose Read Optionals
readOptions = Optional(openParen + separator + header + closeParen)

read = readKeyword + filename + readOptions

In [149]:
readTest = 'READ /home/ubuntu/notebooks/ML-SQL/Classification/iris.data (sep="," header=False)'

readTestResult = read.parseString(readTest)

filename = readTestResult.filename
header = readTestResult.header
sep = readTestResult.sep

#Function to lower a string value of "True" or "False" to an actual python boolean value
def str_to_bool(s):
    if s == 'True':
         return True
    elif s == 'False':
         return None
    else:
         raise ValueError ("Cannot lower value " + s + " to a boolean value")
            
#read parameters from parsed statement and read the file
f = pd.read_csv(filename, sep = sep, header = str_to_bool(header))
f.head()

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


___

### SPLIT

Splits dataset into training, testing, and validation sets. Give 3 non-negative decimals that sum to 1 to specify these quantities.

In [165]:
#define so that there can be multiple verisions of READ
splitKeyword = oneOf(["Split", "SPLIT"]).suppress()

#Phrases used to organize splits
trainPhrase = (Literal("train") + Literal("=")).suppress()
testPhrase = (Literal("test") + Literal("=")).suppress()
valPhrase = (Literal("validation") + Literal("=")).suppress()

#train, test, validation split values
trainS = Combine(Literal(".") + Word(numbers)).setResultsName("train_split")
testS = Combine(Literal(".") + Word(numbers)).setResultsName("test_split")
valS = Combine(Literal(".") + Word(numbers)).setResultsName("validation_split")

#Compose phrases and values together 
training = trainPhrase + trainS
testing = testPhrase + testS
val = valPhrase + valS

#Creating Optional Split phrase
ocomma = Optional(",").suppress()
split = Optional(splitKeyword + openParen + training + ocomma + testing + ocomma + val + closeParen)

#Combining READ and SPLIT keywords into one clause for combined use
read_split = read + split

In [166]:
#Split test
splitTest = "SPLIT (train = .6, test = .2, validation = .2)"

print(split.parseString(splitTest))

['.6', '.2', '.2']


In [None]:
#Read with Split test
read_split_test = 