## Exploratory Data Analysis

In [1]:
from data import load_file

In [2]:
inputs, outputs = load_file("inputs/test_set.txt") # loading test set, original data is huge file

In [3]:
print('Number of samples in test set:', len(inputs))

Number of samples in test set: 200000


In [4]:
# print the first 10 inputs and targets
# left is the input, right is the target

print(f"{'INPUT':^30}{'TARGET':^36}")
print("-" * 66)
for i in range(10):
    print(f"{inputs[i]:^30} | {outputs[i]:^30}")
    print("-" * 66)

            INPUT                            TARGET               
------------------------------------------------------------------
         -5*h*(5-2*h)          |          10*h**2-25*h         
------------------------------------------------------------------
          s*(8*s-21)           |          8*s**2-21*s          
------------------------------------------------------------------
       (21-t)*(-6*t-4)         |        6*t**2-122*t-84        
------------------------------------------------------------------
       (21-5*c)*(3*c-7)        |       -15*c**2+98*c-147       
------------------------------------------------------------------
          4*n*(n+22)           |          4*n**2+88*n          
------------------------------------------------------------------
        (k+2)*(5*k+29)         |         5*k**2+39*k+58        
------------------------------------------------------------------
       (k-15)*(2*k+29)         |          2*k**2-k-435         
----------------

Besides integers, operators, and paranthesis, there is a vocabulary consisting of the trigonometric text.

(tan, cos, sin, etc)

In [5]:
# longest input and target
print("Longest input:", max(len(i) for i in inputs))
print("Longest target:", max(len(i) for i in outputs))

Longest input: 29
Longest target: 28


In [6]:
import collections, re

In [7]:
def freq(pattern, s):
    return collections.Counter(re.findall(pattern, s)).most_common()

In [8]:
# returning all chars
freq(".", "".join(inputs + outputs))

[('*', 1259482),
 ('-', 586615),
 ('2', 548812),
 ('(', 343740),
 (')', 343740),
 ('1', 311256),
 ('+', 249907),
 ('4', 191116),
 ('3', 189635),
 ('6', 170624),
 ('5', 160975),
 ('8', 159509),
 ('7', 129956),
 ('0', 124439),
 ('n', 114056),
 ('s', 113387),
 ('i', 105536),
 ('9', 99978),
 ('c', 57806),
 ('a', 56863),
 ('t', 56709),
 ('o', 56515),
 ('j', 49350),
 ('z', 49306),
 ('y', 48835),
 ('x', 48778),
 ('h', 48540),
 ('k', 48422)]

It looks like the only characters used are from the trig functions or characters used as variables.

In [9]:
# lowercase terms
freq("[a-z]+", " ".join(inputs + outputs))

[('n', 98493),
 ('i', 97833),
 ('s', 97584),
 ('c', 49706),
 ('j', 49350),
 ('z', 49306),
 ('a', 49003),
 ('t', 48849),
 ('y', 48835),
 ('x', 48778),
 ('h', 48540),
 ('k', 48422),
 ('o', 48415),
 ('cos', 8100),
 ('tan', 7860),
 ('sin', 7703)]

sin, cos, and tan are the trig functions.

In [10]:
# symbol terms
freq("[\*|-|\(|\)|\+|=]+", " ".join(inputs + outputs))

[('*', 649932),
 ('+', 245303),
 ('**', 201702),
 (')', 200402),
 ('(', 152467),
 (')*(', 126574),
 ('*(', 63186),
 (')**', 6490),
 (')+', 4604),
 ('))*(', 1513),
 (')*', 1504),
 ('))*', 389),
 ('))', 181)]

### We can use all of the above information to establish a 'vocabulary' for the model.

### This vocabulary consists of:

Trigonometric Functions
* sin, cos, tan

Integers
* 0-9

Different combinations of paranthenses:
* ) , )), (, (( , etc

Operators:
* +, -, *, **

Variables:
* s, i, n, c, z, y, h, k, x, o, a, j, t

### We can use this vocabulary to create a tokenizer for the model.

In [11]:
# storing final vocabulary
vocab_items = "sin|cos|tan|\d|\w|\(|\)|\+|-|\*+"
vocab = set(re.findall(vocab_items, " ".join(inputs + outputs))) # set of all unique terms

In [12]:
print("Vocabulary size:", len(vocab))
print("Vocabulary:", sorted(vocab)) 

Vocabulary size: 32
Vocabulary: ['(', ')', '*', '**', '+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'c', 'cos', 'h', 'i', 'j', 'k', 'n', 'o', 's', 'sin', 't', 'tan', 'x', 'y', 'z']


### Wrapping up the EDA

In [13]:
# checking if inputs and outputs contain only terms from vocab
for i in range(len(inputs)):
    for term in re.findall(vocab_items, inputs[i]):
        if term not in vocab:
            print("Term not in vocab:", term)
            break
    for term in re.findall(vocab_items, outputs[i]):
        if term not in vocab:
            print("Term not in vocab:", term)
            break

The above cell ran without issue! We can conclude that the inputs and outputs are valid.

In [14]:
# saving the vocabulary
with open("vocab.txt", "w") as f:
    for term in sorted(vocab):
        f.write(term + ", ")