先导入URLs列表，看下大体情况...

In [37]:
import pandas as pd

In [38]:
malicious_urls = pd.read_csv('test.txt', sep='\t', header=None)

In [39]:
malicious_urls

Unnamed: 0,0,1
0,bd-un1.wanglv.com/auycbe.js,1
1,www.hzed.com/front/activity/verifyMobileForReg,1
2,47.93.177.253:4346/HZJH.aspx?Lottery=CQSSC&Mke...,1
3,115.231.155.32/ttyzcpzoomtmbcbohggmaztaspeannb...,1
4,222.73.132.173/zijian.hls.video.qq.com/DE26960...,1
5,www.ntwxjx.cn/favicon.ico,1
6,125.65.241.24/om.tc.qq.com/ArnrLlOlGDRFZR0puMs...,1
7,aipiin2.cn/video/m3u8/2018/12/04/e5ad1a5e/0226.ts,1
8,www.xinyuzhaiwu.com/files/article/image/0/21/2...,1
9,img.aituwen.com/qZEERmOUyvL8LHTEmRyYy9w8O4g=?i...,1


再看个黑白样本的数量分布情况

In [40]:
malicious_urls[1].value_counts()

1    1794
0     368
Name: 1, dtype: int64

黑样本居多，但还不至于出现样本不均衡问题，所以在模型选择和评估方面，还不需要注意这方面的问题！

我们对URL排个序，看看该如何选择特征...

In [41]:
malicious_urls.sort_values(0)

Unnamed: 0,0,1
1921,1.1/android_update.htm?uid=9DC3091F90310C72FFE...,1
581,1.wenzhangba.cn/bwosfoscy.js,1
580,1.wenzhangba.cn/source/js/web/35zptm.js?p=kcgm...,1
752,112.117.221.11/qpdxv/v0/20160324/7b/7e/86b3cea...,1
2151,112.117.221.11/qpdxv/v0/20181108/d8/ea/348617c...,1
604,112.117.221.24/qpdxv/v0/20181019/9b/78/7db423f...,1
1107,112.117.221.41/qpdxv/v0/20180907/10/32/d459412...,1
1803,112.117.221.41/qpdxv/v0/20181129/5e/79/2569603...,1
1826,112.67.251.142/vlivehls.tc.qq.com/A7EkeJdcsGXt...,1
859,112.67.251.162/zijian.hls.video.qq.com/698D3A8...,1


大致看了下，该URLs数据集定义恶意URL的标准，并不同于直白的比如基于web攻击类型的定义，而是基本上是涉及诸如黄色图片，视频，赌球，涉政等问题，而对于这种问题的恶意定义都是主观性的，而如果使用特征抽取的传统机器学习方法来学习，特征的寻找相对来说比较广泛，所以想尝试使用深度学习中的类似情感分析的方法来解决这类问题，比如使用conv1d，或rnn，lstm等模型来处理，因这类文本在上下文之间并不需要存在明显的关联性，所以决定采用conv1d，文本无关联需求的1维卷积来处理这类文本，速度上也会快很多。

In [42]:
malicious_urls[0]

0                             bd-un1.wanglv.com/auycbe.js
1          www.hzed.com/front/activity/verifyMobileForReg
2       47.93.177.253:4346/HZJH.aspx?Lottery=CQSSC&Mke...
3       115.231.155.32/ttyzcpzoomtmbcbohggmaztaspeannb...
4       222.73.132.173/zijian.hls.video.qq.com/DE26960...
5                               www.ntwxjx.cn/favicon.ico
6       125.65.241.24/om.tc.qq.com/ArnrLlOlGDRFZR0puMs...
7       aipiin2.cn/video/m3u8/2018/12/04/e5ad1a5e/0226.ts
8       www.xinyuzhaiwu.com/files/article/image/0/21/2...
9       img.aituwen.com/qZEERmOUyvL8LHTEmRyYy9w8O4g=?i...
10      cdn.caishache.com/preloads/d0b86a0a-eca7-4dd6-...
11                       www.methanometer.com/favicon.ico
12                     www.zhenren.com/batch.ad.php?id=57
13                     www.zhenren.com/batch.ad.php?id=53
14                                            m.69bj.com/
15      static.sistalk.cn/uploads/avatar/2017/0612/sis...
16      static.sistalk.cn/uploads/album/2018/1209/sist...
17            

In [43]:
from keras import layers, models, utils, preprocessing, callbacks, regularizers
from sklearn.model_selection import train_test_split
import tensorflow as tf
from sklearn.metrics import accuracy_score
import numpy as np

In [44]:
max_words = 10000
tokenizer = preprocessing.text.Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(malicious_urls[0])
sequences = tokenizer.texts_to_sequences(malicious_urls[0])

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 7758 unique tokens.


In [45]:
maxlen = 0
for i in sequences:
    if len(i) > maxlen:
        maxlen = len(i)
maxlen

141

In [46]:
X = preprocessing.sequence.pad_sequences(sequences, maxlen=50, padding='post')
y = malicious_urls[1].values

In [47]:
# Xtrain, Xval, ytrain, yval = train_test_split(X, y, test_size=0.3, random_state=1)
Xtrain, Xval, Xtest, ytrain, yval, ytest = X[:1500], X[1500:2000], X[2000:], y[:1500], y[1500:2000], y[2000:]

In [48]:
early_stopping = callbacks.EarlyStopping(patience=10)

model = models.Sequential()
model.add(layers.Embedding(len(word_index) + 1, 64, input_length=50))
model.add(layers.Conv1D(64, 7, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(Xtrain, ytrain, epochs=10, batch_size=100, validation_data=(Xval, yval), callbacks=[early_stopping])

Train on 1500 samples, validate on 500 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a31532c88>

In [49]:
model.evaluate(Xtest, ytest)



[0.030722566886034645, 0.9876543209876543]

In [50]:
# 将URL文本转换为ASCII编码，编码范围为0-255，共256个，正好与灰度图灰阶范围一致
# 同时获取最大文本字符长度，后面需要对不足文本行进行补0处理
urls_lst = []
maxlen = 0
for url in malicious_urls[0].values:
    url_lst = []
    for char in url:
        try:
            url_lst.append(ord(char))
        except TypeError:
            url_lst.append(ord('?'))
    urls_lst.append(url_lst)
    getlen = len(url_lst)
    if getlen > maxlen:
        maxlen = getlen

In [51]:
X = preprocessing.sequence.pad_sequences(urls_lst, maxlen=maxlen, padding='post')
# 将2维矩阵，转换为3维矩阵，以作为1维卷积的输入，同时对其进行标准化处理，防止出现输入数值过大，或过小，而导致出现梯度消失或爆炸问题
X = X.reshape((X.shape[0], X.shape[1], 1)) / 255.0
y = malicious_urls[1].values

In [52]:
Xtrain, Xval, Xtest, ytrain, yval, ytest = X[:1500], X[1500:2000], X[2000:], y[:1500], y[1500:2000], y[2000:]

In [53]:
early_stopping = callbacks.EarlyStopping(patience=80) 

model = models.Sequential()
model.add(layers.Conv1D(64, 7, activation='relu', kernel_initializer='lecun_normal'))
model.add(layers.MaxPooling1D(7))
model.add(layers.Conv1D(64, 9, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='nadam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(Xtrain, ytrain, epochs=50, batch_size=100, validation_data=(Xval, yval), callbacks=[early_stopping])

Train on 1500 samples, validate on 500 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1a34fb3da0>

In [54]:
model.evaluate(Xtest, ytest)



[0.06108299964739953, 0.9814814814814815]

以上模型使用了两种方法，但都是基于一维卷积进行的

1. 基于一维卷积的文本序列，将所有文本都转换为整数型数据，然后再结合词嵌入进行词向量的降维得到；
2. 思想是将每个文本序列转换为ascii码，因ascii码的范围恰遇图像的RGB和灰度范围都一致，将其看做是一个深度为1的灰度图，再使用一维卷积

整体来看，使用方法的效果模型更好，速度快，模型更简单，得到的精度也很高，而方法二的思路相对新颖，但模型相比方法一较复杂，训练时间也较长，，对于两个模型的解释，主要还是在一维卷积这种对于无序小数据文本处理的优势上！