# In this notebook, I use autoencoder for this fraud detection. The fraud patterns are learnt from the given hosts that are known to receive substantial amounts of fraudulent traffic

For some context, when an ad is shown on a webpage you are visiting, the image (or
animation or video) is loaded into your browser from another site (ad server) and, along
with the request for the ad your browser may also generate requests to other servers,
e.g. the ones of the tracking companies who want to learn about your habits or the
companies which measure where the ads are shown, for how long they are seen, etc.
This data was collected by one of such measurement companies.

Advertising fraud is a huge problem for digital media and brands. You will be trying to
help solve this problem. Attached is a file with one day of ad impression logs for selected IP addresses. The fields are, in order:
- Timestamp
- IP address
- Detected browser type
- User agent string
- Host (URL)
- Whether the impression was in view (1.0 = yes, 0.0 = no)
- Number of plugins installed
- Browser window position and size (x, y, width, height)
- Network latency

Your task is to identify hosts which are receiving a substantial amount of fraudulent
traffic. As part of this, you may also wish to identify IP addresses home to machines that
are part of botnets, but this is not required. The definition of "substantial" is up to you --
this may be a ranked list of all hosts, or a list of hosts reaching a certain threshold, or
you may choose not to quantify the amount of fraud and simply classify hosts as likely
to be experiencing high fraud or not.
To get you started, here is a list of hosts which are known to receive substantial
amounts of fraudulent traffic:

- featureplay.com
- uvidi.com
- spryliving.com
- greatboxgames.com
- mmabay.co.uk
- workingmothertv.com
- besthorrorgame.com
- dailyparent.com
- superior-movies.com
- yourhousedesign.com
- outdoorlife.tv
- drumclub.info
- cycleworld.tv
- hmnp.us
- nlinevideos.com

In [None]:
import pandas as pd
import numpy as np

## Load the data, profile the data

In [None]:
data = 'HW_assignment_data_set.tsv'
data_df = pd.read_csv(data, names=['timestamp', 'ip', 'browser_type', 'user_agent_string', 'host', 'viewed',
                                   'no_of_plugins', 'position', 'network_latency'],
                      sep='\t', parse_dates=['timestamp'])
data_df.shape

(235083, 9)

In [None]:
data_df.head()

Unnamed: 0,timestamp,ip,browser_type,user_agent_string,host,viewed,no_of_plugins,position,network_latency
0,2014-08-25,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,http://www.domain.com.au,0.0,,"(0,0,1280,629)",0.0
1,2014-08-25,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,http://www.domain.com.au,0.0,,"(0,0,1280,629)",0.0
2,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,
3,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,
4,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,


In [None]:
data_df.dtypes

timestamp            datetime64[ns]
ip                           object
browser_type                 object
user_agent_string            object
host                         object
viewed                      float64
no_of_plugins               float64
position                     object
network_latency             float64
dtype: object

In [None]:
data_df.describe(include='all')

Unnamed: 0,timestamp,ip,browser_type,user_agent_string,host,viewed,no_of_plugins,position,network_latency
count,235083,235083,235083,235075,235083,211811.0,162304.0,118451,200283.0
unique,74914,8216,6,4778,11779,,,10302,
top,2014-08-25 23:18:24,411.517.507.552,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://pixel.adsafeprotected.com,,,"(0,0,1366,667)",
freq,20,12828,76595,26509,20548,,,2271,
first,2014-08-25 00:00:00,,,,,,,,
last,2014-08-25 23:59:58,,,,,,,,
mean,,,,,,0.431998,10.139442,,345.2002
std,,,,,,0.495355,7.637172,,13292.82
min,,,,,,0.0,1.0,,0.0
25%,,,,,,0.0,2.0,,0.0


## Feature engineering: One hot encoder on brower type

In [None]:
data_df = pd.concat([data_df, pd.get_dummies(data_df['browser_type'])], axis=1)
data_df.head()

Unnamed: 0,timestamp,ip,browser_type,user_agent_string,host,viewed,no_of_plugins,position,network_latency,Chrome,Firefox,Internet Explorer,Opera,Safari/Webkit,Unknown
0,2014-08-25,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,http://www.domain.com.au,0.0,,"(0,0,1280,629)",0.0,0,0,0,0,1,0
1,2014-08-25,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,http://www.domain.com.au,0.0,,"(0,0,1280,629)",0.0,0,0,0,0,1,0
2,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,,0,0,0,0,0,1
3,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,,0,0,0,0,0,1
4,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,,0,0,0,0,0,1


## Feature engineering: Calculate browser area

In [None]:
def browser_area_cal(s):
    if s != s:
        return s
    else:
        _, _, width, height = s.split(',')
        width = int(width)
        height = int(height[:-1])
        return width*height

In [None]:
data_df['browser_area'] = data_df['position'].apply(browser_area_cal)
data_df.head()

Unnamed: 0,timestamp,ip,browser_type,user_agent_string,host,viewed,no_of_plugins,position,network_latency,Chrome,Firefox,Internet Explorer,Opera,Safari/Webkit,Unknown,browser_area
0,2014-08-25,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,http://www.domain.com.au,0.0,,"(0,0,1280,629)",0.0,0,0,0,0,1,0,805120.0
1,2014-08-25,393.414.443.469,Safari/Webkit,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...,http://www.domain.com.au,0.0,,"(0,0,1280,629)",0.0,0,0,0,0,1,0,805120.0
2,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,,0,0,0,0,0,1,
3,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,,0,0,0,0,0,1,
4,2014-08-25,325.441.386.395,Unknown,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://www.mangareader.net,,,,,0,0,0,0,0,1,


## Generate train and test data based on a list of hosts which are known to receive substantial amounts of fraudulent traffic

In [None]:
host_list = ['featureplay.com', 'uvidi.com', 'spryliving.com', 'greatboxgames.com', 'mmabay.co.uk',
             'workingmothertv.com', 'besthorrorgame.com', 'dailyparent.com','superior-movies.com',
             'yourhousedesign.com', 'outdoorlife.tv', 'drumclub.info', 'cycleworld.tv', 'hmnp.us', 'nlinevideos.com']

In [None]:
# train_df: known hosts
train_df = data_df[data_df['host'].apply(lambda full_host: any(host in full_host for host in host_list))]
train_df.shape

(2343, 16)

In [None]:
# test_df: unknown hosts
test_df = data_df[~data_df['host'].apply(lambda full_host: any(host in full_host for host in host_list))]
test_df.shape

(232740, 16)

In [None]:
train_df.shape[0] + test_df.shape[0] - data_df.shape[0]

0

## Botnets IPs: Top IPs by count in train data

In [None]:
pd.DataFrame(train_df['ip'].value_counts())

Unnamed: 0,ip
324.338.423.496,224
496.529.325.519,219
574.491.567.341,208
476.494.399.426,186
529.366.487.475,156
496.437.522.387,139
574.452.484.501,136
496.325.356.347,76
489.462.542.447,67
438.562.383.508,66


## Build an autoencoder on train_data to learn the hosts who has received substantial amounts of fraudulent traffic

#### For simple, we only use features ['viewed', 'no_of_plugins', 'network_latency', 'Chrome', 'Firefox', 'Internet Explorer', 'Opera', 'Safari/Webkit', 'Unknown', 'browser_area'] 

In [None]:
selected_features = ['viewed', 'no_of_plugins', 'network_latency', 'Chrome', 'Firefox', 'Internet Explorer', 'Opera',
                     'Safari/Webkit', 'Unknown', 'browser_area']

In [None]:
train_selected = train_df[selected_features]
test_selected = test_df[selected_features]
train_selected.describe(include='all')

Unnamed: 0,viewed,no_of_plugins,network_latency,Chrome,Firefox,Internet Explorer,Opera,Safari/Webkit,Unknown,browser_area
count,2265.0,1559.0,2074.0,2343.0,2343.0,2343.0,2343.0,2343.0,2343.0,2012.0
mean,0.7766,5.780629,2372.299421,0.428937,0.068715,0.466069,0.0,0.005548,0.03073,867278.3
std,0.416616,8.048353,8356.602506,0.49503,0.253023,0.498954,0.0,0.074297,0.172622,417020.5
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,689152.0
50%,1.0,3.0,127.5,0.0,0.0,0.0,0.0,0.0,0.0,723968.0
75%,1.0,3.0,942.5,1.0,0.0,1.0,0.0,0.0,0.0,1052898.0
max,1.0,26.0,221849.0,1.0,1.0,1.0,0.0,1.0,1.0,4096000.0


#### We first fill np.nan by mean values

In [None]:
train_selected = train_selected.fillna(train_selected.mean())
test_selected = test_selected.fillna(test_selected.mean())
train_selected.describe(include='all')

Unnamed: 0,viewed,no_of_plugins,network_latency,Chrome,Firefox,Internet Explorer,Opera,Safari/Webkit,Unknown,browser_area
count,2343.0,2343.0,2343.0,2343.0,2343.0,2343.0,2343.0,2343.0,2343.0,2343.0
mean,0.7766,5.780629,2372.299421,0.428937,0.068715,0.466069,0.0,0.005548,0.03073,867278.3
std,0.40962,6.564434,7862.052698,0.49503,0.253023,0.498954,0.0,0.074297,0.172622,386429.3
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,689152.0
50%,1.0,3.0,155.0,0.0,0.0,0.0,0.0,0.0,0.0,867278.3
75%,1.0,5.780629,2372.299421,1.0,0.0,1.0,0.0,0.0,0.0,1049088.0
max,1.0,26.0,221849.0,1.0,1.0,1.0,0.0,1.0,1.0,4096000.0


#### We normalize three features to the range [0, 1]

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
min_max_scaler = MinMaxScaler()

In [None]:
X_train_3features = min_max_scaler.fit_transform(train_selected[['no_of_plugins', 'network_latency', 'browser_area']])
X_train_3features

array([[0.19122514, 0.01575847, 0.16825   ],
       [0.08      , 0.        , 0.1765    ],
       [0.19122514, 0.00064909, 0.32      ],
       ...,
       [0.19122514, 0.0011314 , 0.22444287],
       [0.19122514, 0.01069331, 0.22444287],
       [0.04      , 0.        , 0.32      ]])

In [None]:
X_test_3features = min_max_scaler.transform(test_selected[['no_of_plugins', 'network_latency', 'browser_area']])
X_test_3features

array([[3.67268655e-01, 0.00000000e+00, 1.96562500e-01],
       [3.67268655e-01, 0.00000000e+00, 1.96562500e-01],
       [3.67268655e-01, 1.46040423e-03, 9.35165208e+06],
       ...,
       [3.67268655e-01, 1.46040423e-03, 9.35165208e+06],
       [3.20000000e-01, 7.66286979e-04, 9.35165208e+06],
       [0.00000000e+00, 0.00000000e+00, 2.90625000e-02]])

In [None]:
rest_features = [feature for feature in selected_features 
                 if feature not in ['no_of_plugins', 'network_latency', 'browser_area']]
rest_features

['viewed',
 'Chrome',
 'Firefox',
 'Internet Explorer',
 'Opera',
 'Safari/Webkit',
 'Unknown']

In [None]:
X_train = np.concatenate((X_train_3features, train_selected[rest_features].values), axis=1)
X_test = np.concatenate((X_test_3features, test_selected[rest_features].values), axis=1)
X_train

array([[0.19122514, 0.01575847, 0.16825   , ..., 0.        , 0.        ,
        0.        ],
       [0.08      , 0.        , 0.1765    , ..., 0.        , 0.        ,
        0.        ],
       [0.19122514, 0.00064909, 0.32      , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.19122514, 0.0011314 , 0.22444287, ..., 0.        , 0.        ,
        0.        ],
       [0.19122514, 0.01069331, 0.22444287, ..., 0.        , 0.        ,
        0.        ],
       [0.04      , 0.        , 0.32      , ..., 0.        , 0.        ,
        0.        ]])

## Build Autoencoder on X_train for fraud detection, including encoder and decoder

In [None]:
from keras.layers import Input, Dense, Dropout
from keras.models import Model

Using TensorFlow backend.


In [None]:
input_dim = X_train.shape[1]
encoding_dim = input_dim//2
print(input_dim, encoding_dim)

10 5


In [None]:
# Input placeholder, dropout rate, encorder and decorder
input_att = Input(shape=(input_dim,))
input_dropout = Dropout(0.2)(input_att)
encoded = Dense(encoding_dim, activation='relu')(input_dropout)
decoded = Dense(input_dim, activation='linear')(encoded)

In [None]:
autoencoder = Model(input_att, decoded)
autoencoder.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 10)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 55        
_________________________________________________________________
dense_2 (Dense)              (None, 10)                60        
Total params: 115
Trainable params: 115
Non-trainable params: 0
_________________________________________________________________


In [None]:
autoencoder.compile(loss='mean_squared_error', optimizer='adam')

In [None]:
autoencoder.fit(X_train, X_train, epochs=100, shuffle=True, validation_split=0.2, verbose=1)

Train on 1874 samples, validate on 469 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100


Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.callbacks.History at 0x7f13f7b86b00>

#### Evaluate the loss on X_ train and X_test: huge difference

In [None]:
autoencoder.evaluate(X_train, X_train)



0.019190713044373628

In [None]:
autoencoder.evaluate(X_test, X_test)



2.842567490605635e+17

## Define function to calculate mse for each record

In [None]:
def mse_for_each_record(act, pred):
    error = np.square(act - pred)
    squared_error = np.square(error)
    mean_squared_error = np.mean(squared_error, axis=1)
    return mean_squared_error

In [None]:
pred_train = autoencoder.predict(X_train)
pred_test = autoencoder.predict(X_test)

In [None]:
mse_train = mse_for_each_record(X_train, pred_train)
mse_test = mse_for_each_record(X_test, pred_test)

In [None]:
print("-------mse_train-------")
print(pd.Series(mse_train).describe())
print("\n-------mse_test-------")
print(pd.Series(mse_test).describe())

-------mse_train-------
count    2343.000000
mean        0.007036
std         0.019955
min         0.000024
25%         0.000070
50%         0.000127
75%         0.002992
max         0.111754
dtype: float64

-------mse_test-------
count    2.327400e+05
mean     6.757804e+40
std      2.305288e+43
min      4.164897e-05
25%      7.424011e-03
50%      1.896629e-01
75%      6.860657e+26
max      7.864056e+45
dtype: float64


## Decide the fraud threshold as the 99% percentile of train data

In [None]:
threshold = np.percentile(mse_train,99)

print("Fraud-Threshold = {}".format(threshold))

Fraud-Threshold = 0.10048884204631418


In [None]:
print("Fraud records in test data = {}%".format(np.sum(mse_test <= threshold)/X_test.shape[0]*100))

Fraud records in test data = 39.75251353441608%


In [None]:
fraud_in_test = test_df[mse_test <= threshold]
fraud_in_test.shape

(92520, 16)

In [None]:
fraud_in_test.head()

Unnamed: 0,timestamp,ip,browser_type,user_agent_string,host,viewed,no_of_plugins,position,network_latency,Chrome,Firefox,Internet Explorer,Opera,Safari/Webkit,Unknown,browser_area
5,2014-08-25 00:00:00,326.432.563.561,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://failblog.cheezburger.com,0.0,25.0,"(0,0,1366,643)",73.0,1,0,0,0,0,0,878338.0
10,2014-08-25 00:00:04,369.438.566.432,Chrome,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKi...,http://www.stuffyoushouldknow.com,0.0,17.0,"(0,0,1920,993)",0.0,1,0,0,0,0,0,1906560.0
11,2014-08-25 00:00:04,369.438.566.432,Chrome,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKi...,http://www.stuffyoushouldknow.com,0.0,17.0,"(0,0,1920,993)",0.0,1,0,0,0,0,0,1906560.0
15,2014-08-25 00:00:06,488.432.432.344,Chrome,Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...,http://madamenoire.com,0.0,23.0,"(0,0,1366,643)",139.0,1,0,0,0,0,0,878338.0
22,2014-08-25 00:00:09,325.554.366.399,Chrome,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKi...,http://jobsearch.about.com,1.0,16.0,"(0,0,1366,643)",0.0,1,0,0,0,0,0,878338.0


## Autoencoder suggests those hosts which receive substantial amounts of fraudulent traffic

In [None]:
# Top hosts by count are my results:
fraud_in_test['host'].value_counts()

http://pixel.adsafeprotected.com            2202
http://www.answers.com                      1970
http://www.domain.com.au                    1839
http://www.mynet.com                        1692
http://fw.adsafeprotected.com               1642
http://www.stuffyoushouldknow.com           1522
http://www.pandora.com                      1452
http://www.cars.com                         1419
http://www.amazon.com                       1026
http://www.startribune.com                   837
http://www.huffingtonpost.com                834
http://www.usmagazine.com                    681
http://living.msn.com                        671
http://www.ebay.co.uk                        622
http://celebs.answers.com                    607
http://www.dailymail.co.uk                   546
http://webmaila.juno.com                     532
http://www.realtor.com                       524
http://www.ratemyprofessors.com              517
http://www.cnn.com                           514
http://www.pch.com  

## Autoencoder suggests botnets IPs by IP count. Noticing that IP like 529.366.487.475 shows in both train and test data

In [None]:
# Top IP by count are suggested to be botnets.
pd.DataFrame(fraud_in_test['ip'].value_counts())

Unnamed: 0,ip
411.517.507.552,6785
358.472.462.434,2524
337.508.436.325,1982
411.374.349.559,1857
369.438.566.432,1528
544.531.344.383,1293
391.367.519.448,1267
357.394.348.456,1008
529.366.487.475,967
515.446.450.448,921


# If I have more time:
- I will apply feature engineering on the feature 'user_agent_string' and extract key tokens from it.