I really liked Andrej Karpathy's article on Software 2.0 (https://medium.com/@karpathy/software-2-0-a64152b37c35) as it really gave a vision of what the future of software development could look like with tools like deep learning in the picture. Pete Warden had a follow-up to Andrej's post about it going into even more possibilities on the subject (https://petewarden.com/2017/11/13/deep-learning-is-eating-software/) and that got me thinking about how some of the simple things programmers do could be automated. 

The first item I thought of on that list was writing an email validator. Email validation turns out to be very tricky to get perfect but most programmers could come up with something that would be sufficient for the majority of cases.

Note: it looks like there's no longer a free download of the Alexa 1M top sites, so for domains, I'll pick them from the [Majestic Million](https://blog.majestic.com/development/majestic-million-csv-daily/).

In [30]:
import random
import string

import scipy.stats

def random_letters(length=10):
    """Generates random letters with a certain length"""
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(length))

def fake_email_generator(n=10):
    """Generates a fake 'valid-looking' email address"""
    
    domains = ['gmail.com', 'yahoo.com', 'aol.com', 'hotmail.com',
              'live.com', 'mail.ru']
    for _ in range(n):
        l = scipy.stats.poisson(9).rvs() + 1  # Strictly positive Poisson with mean 10
        domain = random.choice(domains)
        yield random_letters(l) + '@' + domain
        
def fake_non_email_generator(n=10):
    """Generates a fake string that isn't an email address"""
    for _ in range(n):
        l = scipy.stats.poisson(9).rvs() + 1
        yield random_letters(l)

In [32]:
print(list(fake_email_generator()))
print(list(fake_non_email_generator()))

['jftlwpsrbimchis@gmail.com', 'nmnydijn@mail.ru', 'gywhp@mail.ru', 'fgjorxecfy@hotmail.com', 'yfjwpsz@live.com', 'jfuazntshrb@gmail.com', 'kigwlgoacmi@gmail.com', 'zmdlvzgpnhkiw@mail.ru', 'dnlefovfzo@gmail.com', 'hiqfhcayyboii@mail.ru']
['keqtmcw', 'tmdsv', 'ollylurmfvljzrnrhww', 'nvvedo', 'xteobcwmtvurhl', 'bbsmbnreoiyltdxdn', 'uhudwwixyj', 'hycoilxxri', 'ibqqdqsbuihads', 'ndquifquf']


In [183]:
import numpy as np

raw_email_strings = list(fake_non_email_generator()) + list(fake_email_generator())
y = np.concatenate([np.zeros((10,)), np.ones((10,))])

In [215]:
maxlen = 0
for e in raw_email_strings:
    if len(e) > maxlen:
        maxlen = len(e)
print(maxlen)

24


In [184]:
np.concatenate([np.zeros((10,)), np.ones((10,))])

array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [202]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tzr = Tokenizer(filters='', char_level=True)

In [204]:
tzr.fit_on_texts(raw_email_strings)

In [187]:
email_seq = tzr.texts_to_sequences(raw_email_strings)

In [216]:
email_seq = pad_sequences(email_seq, maxlen=maxlen)

In [217]:
email_seq.shape

(20, 24)

In [194]:
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, Embedding

In [231]:
model = Sequential()
model.add(Dense(200, input_dim=maxlen, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

In [232]:
model.compile(optimizer='rmsprop', loss='binary_crossentropy', 
              metrics=['acc'])

In [233]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_25 (Dense)             (None, 200)               5000      
_________________________________________________________________
dense_26 (Dense)             (None, 1)                 201       
Total params: 5,201
Trainable params: 5,201
Non-trainable params: 0
_________________________________________________________________


In [234]:
history = model.fit(email_seq,
                    y,
                    batch_size=2,
                    epochs=10,
                    validation_split=0.2)

Train on 16 samples, validate on 4 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [245]:
test_email = ['test@gmail.com',
             'another.example@yahoo.com',
             'example@',
             '@example.com',
             'testtesttesttesttesttesttest',
             'gmail.com']
test_seq = tzr.texts_to_sequences(test_email)

In [246]:
test_seq = pad_sequences(test_seq, maxlen=maxlen)

In [247]:
np.set_printoptions(formatter={'float': '{:2f}'.format})
model.predict(test_seq)

array([[0.975970],
       [0.999845],
       [0.000176],
       [0.914399],
       [0.743429],
       [0.057131]], dtype=float32)