# E-mail Validation in Software 2.0

I few weeks ago Andrej Karpathy's article on [Software 2.0](https://medium.com/@karpathy/software-2-0-a64152b37c35). It paints a pretty enlightening vision of what the future of software development could look like with tools like deep learning in the picture. Pete Warden had a [follow-up](https://petewarden.com/2017/11/13/deep-learning-is-eating-software/) to Karpathy's post about it going into even more possibilities on the subject. That got me thinking about how some of the simple things programmers do could be automated, like developing validators, cleaning up input text, parsing text, data, and images, etc.

So, I figured it might be neat to write an e-mail validation tool. 

![E-mail validation is pretty popular!](email_validation.png)

E-mail validation is something that happens everywhere and I'd wager that every programmer has either used one off-the-shelf or else written a simple one on their own. It's [far from trivial](http://www.regular-expressions.info/email.html) and in most cases not 100% correct. 

Naturally, a good test of what Software 2.0 looks like would be asking the question, "Could someone develop a deep learning model to validate e-mails?" With that, borrowing Warden's terms, rather than programming a validator, the software engineer is now teaching a model how to validate e-mails. The real test of how well this works is whether the validator works not only on unseen e-mails but on e-mails that look completely novel in nature, but are also valid.

## Building some tools to help

Before we start training models, we start by building some helper functions to generate both "real-looking" and non-real-looking e-mail addresses.

In [1]:
import random
import string

import scipy.stats

def random_letters(length=10):
    """Generates random letters with a certain length"""
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(length))

def fake_email_generator(n=10):
    """Generates a fake 'valid-looking' email address"""
    
    domains = ['gmail.com', 'yahoo.com', 'aol.com', 'hotmail.com',
              'live.com', 'mail.ru']
    for _ in range(n):
        l = scipy.stats.poisson(9).rvs() + 1  # Strictly positive Poisson with mean 10
        domain = random.choice(domains)
        yield random_letters(l) + '@' + domain
        
def fake_non_email_generator(n=10):
    """Generates a fake string that isn't an email address"""
    for _ in range(n):
        l = scipy.stats.poisson(9).rvs() + 1
        yield random_letters(l)

In [2]:
print(list(fake_email_generator()))
print(list(fake_non_email_generator()))

['rpqlkssdapdaxwqmsc@aol.com', 'odwwohwtjc@yahoo.com', 'fociqvyzk@mail.ru', 'spaxeres@gmail.com', 'gxjbwt@hotmail.com', 'tzodige@gmail.com', 'vlighecxss@live.com', 'bgtyumzk@mail.ru', 'ykzcjesfiii@yahoo.com', 'omytwerxyfzbysm@yahoo.com']
['wummyupw', 'foqphyljepyofq', 'tgzfvidt', 'qsduzjzp', 'znyjexojvt', 'jwhmffyewu', 'njsfvscwidd', 'oupneoony', 'wfmrmr', 'cnrugyhvbjb']


In [3]:
import numpy as np

n_fake, n_real = 100, 50

raw_email_strings = list(fake_non_email_generator(n_fake)) + list(fake_email_generator(n_real))
y = np.concatenate([np.zeros((n_fake,)), np.ones((n_real,))])

In [4]:
# Perform some random shuffling of the data
# NB: shuffling needs to be the same between strings and y

shuffle_idx = np.arange(len(raw_email_strings))
np.random.shuffle(shuffle_idx)
shuffle_idx


raw_email_strings = [
    raw_email_strings[i] for i in shuffle_idx
]
y = y[shuffle_idx]

In [5]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tzr = Tokenizer(filters='', char_level=True)
tzr.fit_on_texts(raw_email_strings)
email_sequences = pad_sequences(tzr.texts_to_sequences(raw_email_strings))

Using TensorFlow backend.


In [6]:
num_characters = len(tzr.index_docs) + 1

In [7]:
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, Embedding, LSTM

In [8]:
model = Sequential()
model.add(Embedding(num_characters, 10))
model.add(LSTM(200))
model.add(Dense(200, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [9]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', 
              metrics=['acc'])

In [10]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 10)          290       
_________________________________________________________________
lstm_1 (LSTM)                (None, 200)               168800    
_________________________________________________________________
dense_1 (Dense)              (None, 200)               40200     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 201       
Total params: 209,491
Trainable params: 209,491
Non-trainable params: 0
_________________________________________________________________


In [11]:
history = model.fit(email_sequences,
                    y,
                    batch_size=10,
                    epochs=10,
                    validation_split=0.2)

Train on 120 samples, validate on 30 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [12]:
test_email = ['test@gmail.com',
              'another.example@yahoo.com',
              'example@',
              '@example.com',
              'testtesttesttesttesttesttest',
              'gmail.com',
              'matt!@whatsup.net',
              'jason@mindspring.org',
              'blah@gatech.edu',
              '@yahoo.com']
test_seq = tzr.texts_to_sequences(test_email)

In [13]:
test_seq = pad_sequences(test_seq, maxlen=email_sequences.shape[-1])

In [15]:
np.set_printoptions(formatter={'float': '{:2f}'.format})
predictions = model.predict(test_seq)
predictions

array([[0.827229],
       [0.955771],
       [0.199287],
       [0.784823],
       [0.405493],
       [0.630595],
       [0.382036],
       [0.727012],
       [0.700789],
       [0.682027]], dtype=float32)

If we take a cut-off of 0.7, we get the following for predictions.

Positive predictions:

In [31]:
np.array(test_email)[predictions.reshape(-1) > 0.7]

array(['test@gmail.com', 'another.example@yahoo.com', '@example.com',
       'jason@mindspring.org', 'blah@gatech.edu'],
      dtype='<U28')

Negative predictions:

In [32]:
np.array(test_email)[predictions.reshape(-1) < 0.7]

array(['example@', 'testtesttesttesttesttesttest', 'gmail.com',
       'matt!@whatsup.net', '@yahoo.com'],
      dtype='<U28')

Overall, our neural network trained on a fairly simplistic (and synthetic) dataset of e-mails is able to hone in on a few features to determine 