# Brainstorming process

Limitations: black-box problem (no gradient information, model architecture etc.), unknown vocabulary (OOV not allowed)

### Idea 1: train a teacher network

Because the target classifier network is a black-box model, we do not have gradient information.
However, we could train a neural network to imitate its input/output behavior, and use it to perform gradient ascent on adversarial inputs.

Would take a longer time to train and tune...

### Idea 2: genetic algorithm to spawn emails

Takes a long time to design properly...

### Idea 3: reinforcement learning (policy gradient)

Could work, but unknown vocabulary interferes with ability to define a good action space

### Idea 4: character/word replacement

In the spirit of reinforcement learning, but with human domain knowledge.

Can guess words that are both small in character number (decrease editing distance) and uncommon in spam

# Code

In [51]:
import random

In [111]:
filename = 'spam3'

In [112]:
with open('spam/{}.txt'.format(filename), 'r') as f:
    text = f.read()

### For spam2

In [109]:
output = []
for line in text.split('\n'):
    if random.random() < 0.8:
        line = 'Tim ' + line + ' Tim'
    output.append(line)
output  = '\n'.join(output)

print(output)

Tim Subject: paliourg office xp - $ 60 Tim
Tim top quality software : Tim
Tim special offer # 1 : Tim
Tim windows xp professional + microsoft office xp professional = only $ 80 Tim
Tim special offer # 2 : Tim
Tim adobe - photoshop 7 , premiere 7 , illustrator 10 = only $ 120 Tim
Tim special offer # 3 : Tim
Tim macromedia dreamwaver mx 2004 + flash mx 2004 = only $ 100 Tim
Tim also : Tim
windows 2003 server
windows 2000 workstation
Tim windows 2000 server Tim
Tim windows 2000 advanced server Tim
Tim windows 2000 datacenter Tim
Tim windows nt 4 . 0 Tim
Tim windows millenium Tim
Tim windows 98 second edition Tim
Tim windows 95 Tim
Tim office xp professional Tim
Tim office 2000 Tim
Tim office 97 Tim
Tim ms plus Tim
Tim ms sql server 2000 enterprise edition Tim
Tim ms visual studio . net architect edition Tim
Tim ms encarta encyclopedia delux 2004 Tim
Tim ms project 2003 professional Tim
Tim ms money 2004 Tim
ms streets and trips 2004
Tim ms works 7 Tim
Tim ms picture it premium 9 Tim
ms ex

### For spam1 and spam3

In [120]:
output = []
for char in text:
    if char == ' ':
        if random.random() < 0.5:
            output.append(' I I')
    output.append(char)
output  = ''.join(output)

print(output)

Subject: I I buy popular I I drugs I I online
online pharmacy - I I visit our online store I I and save I I .
save up I I to I I 80 % compared I I to I I normal rates I I .
all I I popular drugs I I are I I available !
- I I world I I wide I I shipping
- I I no I I doctor visits
- no I I prescriptions
- I I next I I day priority I I shipping
- discreet packaging
- buy I I in bulk and I I save I I !
we I I make I I it easier I I and I I faster than I I ever I I to get the prescriptions I I you
need I I .
simply rx I I is your convenient , safe and private online source for
fda approved pharmacy I I prescriptions . I I we sell I I brand I I - name I I and I I exact
generic equivalents of I I us I I fda approved prescription I I drugs I I through
our I I fully I I - I I licensed overseas pharmacy . I I upon I I approval of your I I medical
information I I , I I a licensed I I physician will I I issue I I a free prescription
which I I can I I be filled and shipped to you in I I one I I bus

## Run inference

In [41]:
import os
import glob
import pickle
import sys
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences

def read_data(path_name):
    error_counts = 0
    data_list = []
    fn_list = []
    for fn in os.listdir(path_name):
        # print(fn)
        try:
            with open(os.path.join(path_name, fn),'r', encoding='utf8', errors='ignore') as f:
                data = f.read()
            data_list.append(data)
            fn_list.append(fn)
        except Exception as e:
            # print(fn, e)
            error_counts += 1
    # print('Error Counts: ', error_counts)
    # print(len(fn_list), 'mail read')
    return (data_list, fn_list)

In [121]:
SEQ_LEN = 1000
model_path = './models/spam_model.h5'
tokenizer_path = './models/tokenizer.pkl'
filepath = './submission2/'

## Loading Data    
# print('Loading Data...')
data, files = read_data(filepath)

## Loading Model and Tokenizer
# print('Loading Model and Tokenizer...')
with open(tokenizer_path,'rb') as f:
    tokenizer = pickle.load(f)
model = load_model(model_path)

## Preprocessing
dl_x = tokenizer.texts_to_sequences(data)
dl_x = pad_sequences(dl_x, maxlen = SEQ_LEN)

## Model Predict
print('Model Predict...')
pred = model.predict(dl_x)
for i, yp in enumerate(pred):
    print(files[i], yp[1], 0 if yp[1]<0.5 else 1)

Model Predict...
spam1.txt 0.4836771 0
spam2.txt 6.351516e-08 0
spam3.txt 0.20406184 0
